___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# Missing Data

Let's show a few convenient methods to deal with Missing Data in pandas:

In [2]:
import numpy as np
import pandas as pd

In [3]:
# just like we can create a series with a dictionary, we can create a dataframe from a dictionary as well 

df = pd.DataFrame({'A':[1,2,np.nan], # "np.nan" is just to signify null value or it might give an error
                  'B':[5,np.nan,np.nan], 
                  'C':[1,2,3]})

# the keys will act like columns in the dataframe
# the values will act as the data points for each row in the column name

In [4]:
df

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2
2,,,3


In [5]:
# the context is that ".dropna()" is used to typically drop missing values from a dataframe
df.dropna()

# if we use it with closed parantheses, what occurs is that pandas drops any row with any single or more null values

Unnamed: 0,A,B,C
0,1.0,5.0,1


In [6]:
# if you wanted to do this for columns, you would have to change the axis for it (pass in "axis=1" as an argument)
# the axis (if not stated) defaults to "0" which represents rows

df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [7]:
# "thresh" stands for threshold, which is an integer value and it would require that many non-"NA" values in order to not get 
# dropped

df.dropna(thresh=2)

# this will only drop rows (again, "axis=0" by default here) with a threshold of 2 or above "NaN"/"null"
# it kept row "1" (the second row) because it only had one "NaN" value

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,,2


In [8]:
# what if we wanted to to fill in those missing values? we can use ".fillna()" method

df.fillna(value='FILL VALUE')

# it replaced all "NaN" values with the string "FILL VALUE"

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,FILL VALUE,2
2,FILL VALUE,FILL VALUE,3


In [9]:
df['A']

# notice how there's a missing value for row "2" (third row)?
# we can fill it in

0    1.0
1    2.0
2    NaN
Name: A, dtype: float64

In [17]:
df['A'].fillna(value=df['A'].mean())

# we will set the value equal to the mean of the entire row using ".mean()" method

0    1.0
1    2.0
2    1.5
Name: A, dtype: float64

# Great Job!