# Missing Data 

In this lecture, we will show a few convinient methods to deal with missing data in pandas: -

In [14]:
import numpy as np
import pandas as pd

In [15]:
d = {
    'A':[1, 2, np.nan],
    'B':[5, np.nan, np.nan],
    'C':[1, 2, 3]
}
df = pd.DataFrame(data=d)

A lot of times when you're reading in data using pandas, it'll automatically fill-in missing data points with a null or NaN.
Let us go ahead and explore how we can use methods like dropna() and fillna() to actually drop or fill those missing data values respectively.

In [16]:
# print the dataframe
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


Note that row 0 and column C have no missing values wherease there are a few missing values in the other rows and columns.

## dropna()

A lot of times you just want to drop missing values from your data set, especially if they are sparsely present.

If you say dropna() just by itself and call it as a method on the dataframe what occurs is that Pandas will drop any row containing atleast one missing value. Note that the same operation can be carried out on column by specifying axis=1

In [17]:
# use the dropna() function to drop rows which contain atleast 1 NaN value
df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [18]:
# use the dropna() command to drop columns which contain atleast 1 NaN value
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


#### threshold value 

We can set the value of the threshold parameter in the function dropna(), this makes sure that we only drop those rows/columns which don't contain less non-NaN values than the specified threshold.

In [19]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


As we can observe, our dataframe contains 3 rows of which row 0 has 3 non-NaN values, row 1 contains 2 non-NaN values and row 2 contains just 1 non-NaN value.

This means that if set the value of the threshold parameter (within the dropna() function) as 2, it will only get rid of row 2 because it doesn't have atleast 2 (thresh value) non-NaN values. 

In [20]:
df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


## fillna()

Dropping NaN values can be handy in some specific situations but a lot of times we would just like to fill in the missing cells with some other value. This can be achieved using the fillna() method.

In [21]:
df.fillna(value="FILL VALUE")

Unnamed: 0,A,B,C
0,1,5,1
1,2,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3


As observable, the NaN values are replaced by the value specified within the fillna() method. Another common thing we might want to do is fill up the missing cells with the mean of the rest of the values in the respective row/column.

In [22]:
df['A']

0    1.0
1    2.0
2    NaN
Name: A, dtype: float64

As we can see, column 'A' contains a NaN value in its third row, we can replace with the mean of all other values present in column 'A'

In [23]:
df['A'].fillna(value=df['A'].mean())

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64