# Missing Data

- When reading in missing values, pandas will display them as NaN
- or maybe ps.NaT like "Not a timestamp"

## Options for missing data

- keep it, remove it, replace it

## Keeping it

- Easiest to do.
- Does not change true data.
BUT
- many methods do nto support NaN.
- Often there are reasonable guesses.

## Droping it

- Easy to do.
- Can be based on rules.
BUT
- Potential loss of data or useful info.
- limits trained models for future data.

### Droping row vs column or feature
- Dropping row: makes sense when a lot of info on an instance is MISSING (row is not very valuable).
    - often good to calculate a percentage of what is dropped (too many loss of data points?)

- Dropping a column:
    - if only a few instan ces have info on the column (the column is quite incomplete).
    
## Fill it in
- Potential to save a lot of data for use in a training model.
BUT
- The hardest to do
- Reasoning for filling in data COULD lead to misleading findings, need to be careful.

In [1]:
import numpy as np
import pandas as pd

In [2]:
np.nan # Displays a null value (newer versions will have pd.NA)

nan

In [3]:
pd.NA

<NA>

In [4]:
pd.NaT

NaT

In [6]:
# Typical comparsions should be avoided
np.nan == np.nan # Two missing values equality are evaluated as false

False

In [7]:
# To check for is NAN
np.nan is np.nan

True

In [8]:
df = pd.read_csv('movie_scores.csv')

In [9]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [10]:
# Returns true or false rows if there are ANY null values in the row.
df.isnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [11]:
# Give false if it is a null value instead.
df.notnull()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


In [16]:
# All actors where we are not missing their pre-movie score
premovie_cond = df['pre_movie_score'].notnull() # notnull can be done on a series!

In [17]:
# Now we can use it as a condition to the dataframe
df[premovie_cond] # Only select columns where certain features are present.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [19]:
premovie_cond = (df['pre_movie_score'].isnull() & df['first_name'].notnull()) # Where pre-movie score null and first name is present
df[premovie_cond]

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
2,Hugh,Jackman,51.0,m,,


In [20]:
# KEEP THE DATA
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [21]:
# DROP THE DATA
help(df.dropna)

Help on method dropna in module pandas.core.frame:

dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        .. versionchanged:: 1.0.0
    
           Pass tuple or list to drop on multiple axes.
           Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.
    
        * 'any' : If any NA values are present, dro

In [22]:
# Will drop any rows with any missing values
df.dropna()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [23]:
# Only drop rows with all those values.
df.dropna(thresh=1) # Drop any rows with null values that have not got at least 1 not null value

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [25]:
df.dropna(thresh=5)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [26]:
# Axis argument
df.dropna(axis=1) # Drop columns that are missing any values.
# We usually want axis = 0 to check for features = null, axois=1 is a bit weird.

0
1
2
3
4


In [28]:
# Consider where certain columns are null (subset = choose columns)
df.dropna(subset=['last_name'])

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


## FILL IN THE DATA

In [30]:
# Null to default value! Typically would not want this because it mixes up data types.
df.fillna('NEW VALUE!')

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!,NEW VALUE!
2,Hugh,Jackman,51.0,m,NEW VALUE!,NEW VALUE!
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [32]:
# We can run fillna on a series
df['pre_movie_score'].fillna(0)

# To make change permanent
# df['pre_movie_score'] = df['pre_movie_score'].fillna(0)

0    8.0
1    0.0
2    0.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [34]:
# prefill with calculated value
mean = df['pre_movie_score'].mean()

In [35]:
df['pre_movie_score'].fillna(mean)

0    8.0
1    7.0
2    7.0
3    6.0
4    7.0
Name: pre_movie_score, dtype: float64

In [37]:
# Could do this but probs don't make sense
df.fillna(df.mean()) # Fill in averages for ALL numerical cols.

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,52.75,,7.0,9.0
2,Hugh,Jackman,51.0,m,7.0,9.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [39]:
# Could do linear interpolation like:
# ser.interpolate()

In [40]:
# np.nan is, notnull(), isnull(), dropna()