# Missing Data

Cleaning your data is an important step in the data analysis process. Searching for and replacing missing data is part of that, and pandas makes it easier for us.

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Creating a DataFrame from a dictionary in an one line

df = pd.DataFrame({'A':[1,2,np.nan],
                  'B':[5,np.nan,np.nan],
                  'C':[1,2,3]})

df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


#### df.isnull()

Will return boolean values that will tell us if the DataFrame has null values.

In [3]:
df.isnull()

Unnamed: 0,A,B,C
0,False,False,False
1,False,True,False
2,True,True,False


In [4]:
# Null values on a single column

df.isnull()["B"]

0    False
1     True
2     True
Name: B, dtype: bool

#### df.dropna(axis,thresh)

Will drop any rows or columns which contains missing values. 

By default, `axis = 0`, but we can set it to `1` and have python drop any columns that have missing values.

In [5]:
# Dropping rows that have null values

df.dropna()

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [6]:
# Dropping columns that contain null values

df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


We can speciy a threshold and have `df.dropna` require a certain number of null values before dropping the row/column.

In [7]:
# Specifying a threshold of 2

df.dropna(thresh=2)

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


#### df.fillna()

Instead of deleting data, we may want to fill our data.

In [8]:
# Filling our data with "Fill Value"

df.fillna(value='FILL VALUE')

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3


In [9]:
# Filling a column with the mean
# Saving the data

df["A"] = df['A'].fillna(value=df['A'].mean())
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,1.5,,3
