Real world datasets often have missing data.
Realistically there are only 3 ways to deal with missing data:
1.   Leave it as missing
2.   Remove the missing data
3.   Fill in the missing data




Leave it as missing - Depending on the type of data, this is a valid choice. For example, if dealing with categorial data, we could simply treat a NaN as another category.

Remove the missing data - Dependent on how much data is missing.

*   Large Percentage - too much is missing to make a reasonable guess. Remove the column.
*   Small Percentage - only removes a few data points from our dataset. Remove the rows.



Fill in the missing data -
A non trivial percentage is missing and the data point rows are important.
Lots of strategies available:
*   Fill in missing data using Mode, Mean Median
*   Based off another feature column, conceive of a reasonable value.

You are making an educated guess to fill in the missing data.

What is the best way to deal with missing data? There is no correct because all data sets and situation are different. Use your common sense and overall goals to see which strategy makes sense.

In [None]:
import numpy as np
import pandas as pd

In [None]:
df = pd.DataFrame({'A': [1,2, np.nan, 4], 'B' : [5, np.nan, np.nan, 8], 'C': [10,20,30,40]})
df

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,,20
2,,,30
3,4.0,8.0,40


In [None]:
# get columns where there is no missing data
df.dropna(axis=1)

Unnamed: 0,C
0,10
1,20
2,30
3,40


In [None]:
# get columns where there are at least 2 non NaN values
df.dropna(axis=1, thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,,20
2,,,30
3,4.0,8.0,40


In [None]:
# get columns where there are the threshold is based on the % of the length of the DF
df.dropna(axis=1, thresh=0.75*len(df))

Unnamed: 0,A,C
0,1.0,10
1,2.0,20
2,,30
3,4.0,40


In [None]:
# fill in the missing values with the value
# can be a different data type
df.fillna(value='FILL VALUE')

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,FILL VALUE,20
2,FILL VALUE,FILL VALUE,30
3,4.0,8.0,40


In [None]:
# automatically formats integer value to a float
df.fillna(value = 0)

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,0.0,20
2,0.0,0.0,30
3,4.0,8.0,40


In [None]:
# fill in a column
df['A'] = df['A'].fillna(value = 0)
df

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,,20
2,0.0,,30
3,4.0,8.0,40


In [None]:
df['B'] = df['B'].fillna(value = df['B'].mean())
df

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,6.5,20
2,0.0,6.5,30
3,4.0,8.0,40


In [None]:
df = pd.DataFrame({'A': [1,2, np.nan, 4], 'B' : [5, np.nan, np.nan, 8], 'C': [10,20,30,40]})

# we can also use a shorthand for the value argument
# gets the mean on a per column basis
df.fillna(df.mean())

Unnamed: 0,A,B,C
0,1.0,5.0,10
1,2.0,6.5,20
2,2.333333,6.5,30
3,4.0,8.0,40
