##### * Real World data will often be missing data for a variety of reasons
##### * Many machine learning models and statistical methods can not work with missing data points, in which case we need to decide what to do with the missing data

#### Keeping the missing data

##### * Pros:
      - Does not manipulate or change the true data
      - Easiest to do

##### * Cons:
      - Many methods do not support NaN
      - Often there are reasonable guesses

#### Dropping or Removing the missing data

##### *Pros:
       - Easy to do.
       - Can be based on rules.
##### *Cons:
       -Potential to lose a lot of data or useful information.
       -Limits trained models for future data

#### Filling in the missing data
##### * Pros:
      - Potential to save a lot of data for use in training a mode
##### * Cons:
      - Hardest to do and somewhat arbitrary
      - Potential to lead to false conclusion

## Pandas Operations

In [1]:
import numpy as np
import pandas as pd

In [2]:
np.nan

nan

In [3]:
pd.NA

<NA>

In [4]:
np.nan == np.nan

False

In [5]:
np.nan is np.nan

True

In [7]:
# to check if a variable is missing value
my_var = np.nan
my_var is np.nan        # not my_var == np.nan

True

In [8]:
df = pd.read_csv('/home/ziya/Documents/machine/03-Pandas/movie_scores.csv')

In [9]:
df.head()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [10]:
df

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
1,,,,,,
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [13]:
df.isnull()         # return TRUE if you have a null value

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,False,False,False,False,False,False
1,True,True,True,True,True,True
2,False,False,False,False,True,True
3,False,False,False,False,False,False
4,False,False,False,False,False,False


In [15]:
df.notnull()        # return TRUE if the value is not null

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,True,True,True,True,True,True
1,False,False,False,False,False,False
2,True,True,True,True,False,False
3,True,True,True,True,True,True
4,True,True,True,True,True,True


In [19]:
df[df['pre_movie_score'].notnull()]         # return a data frame with not null values

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [22]:
df[(df['pre_movie_score'].isnull()) & (df['first_name'].notnull())]         # combining conditions

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
2,Hugh,Jackman,51.0,m,,


In [23]:
# KEEP DATA
# DROP DATA
# FILL DATA

In [25]:
help(df.dropna)         # useful help 

Help on method dropna in module pandas.core.frame:

dropna(axis: 'Axis' = 0, how: 'str' = 'any', thresh=None, subset: 'IndexLabel' = None, inplace: 'bool' = False) method of pandas.core.frame.DataFrame instance
    Remove missing values.
    
    See the :ref:`User Guide <missing_data>` for more on which values are
    considered missing, and how to work with missing data.
    
    Parameters
    ----------
    axis : {0 or 'index', 1 or 'columns'}, default 0
        Determine if rows or columns which contain missing values are
        removed.
    
        * 0, or 'index' : Drop rows which contain missing values.
        * 1, or 'columns' : Drop columns which contain missing value.
    
        .. versionchanged:: 1.0.0
    
           Pass tuple or list to drop on multiple axes.
           Only a single axis is allowed.
    
    how : {'any', 'all'}, default 'any'
        Determine if row or column is removed from DataFrame, when we have
        at least one NA or all NA.
    
      

In [28]:
df.dropna(thresh=1)             # drop a row if it has nan value, unless it has one none Nan value

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [29]:
df.dropna(thresh=5)

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [31]:
df.dropna(axis=1)           # dropping the columns that has at least one missing value

0
1
2
3
4


In [33]:
df.dropna()

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0


In [35]:
df.dropna(subset=['last_name'])             # drop the instance without a last_name

Unnamed: 0,first_name,last_name,age,sex,pre_movie_score,post_movie_score
0,Tom,Hanks,63.0,m,8.0,10.0
2,Hugh,Jackman,51.0,m,,
3,Oprah,Winfrey,66.0,f,6.0,8.0
4,Emma,Stone,31.0,f,7.0,9.0
