# Missing Values
__Math 3080: Fundamentals of Data Science__

Reading:
* [McKinney, *Python for Data Science*, Chapter 6](https://wesmckinney.com/book/accessing-data)
* Chapter 7

Class notes are found through GitHub. As changes are made, they will automatically be uploaded to GitHub. A link to the repository is on Canvas.

-----
## Outline
* Missing data
  * Locating/Identifying missing data
  * Ways to handle missing data

-----

CRISP-DM: Cross-Industry Standard Process for Data Mining
1. Business Understanding
2. Data Understanding
3. Data Preparation <--- Starting this today
4. Modeling  
5. Evaluation
6. Deployment

In order to have data ready for analysis or for modeling, the data needs to be prepared. We call this __preprocessing__.
1. Cleaning the Data
    * Handling missing data <--- Today's subject
    * Cleaning Labels
    * Formats
      * str to int/float
      * DateTime
2. Data Wrangling
    * Encoding categorical data
    * Rearranging data
    * Combining datasets

-----
## Missing Data

There are many ways to identify missing data:
* Common in coding: `NaN` (Not a number)
* Blank spaces (auto fill in with NaN)
* Large and/or unreasonable values (9999, -1)
* Characters/strings, such as `-`, `N/A`, or `missing`

We can deal with missing values in two ways:
1. Dropping them
2. Filling them in with a reasonable value

In [None]:
import numpy as np
import pandas as pd

titanic = pd.read_csv('./data/titanic.csv')
display(titanic)

### Dropping missing values
When to drop a variable
* When so much data is missing, the observation/variable doesn't provide any significant information
    * If more than half of a variable is missing, consider dropping the variable
    * If less than 10% of a variable is missing, consider dropping the observations
* Duplicated entries

The exact numbers are up to the Data Scientist's discretion, but these are good ballpark numbers.

*Note*: Sometimes, removing data could unintentionally also remove other valuable data. For example, removing a whole observation with a couple of missing values could also remove non-missing values that add to those variables. So, when deciding whether to drop missing data or not, consider if dropping that data will affect other variables first.

In [None]:
## Identify Missing Values
print(titanic.isna().sum())

In [None]:

display(titanic.describe())
display(titanic['embarked'].value_counts())
display(titanic['deck'].value_counts())

In [None]:
## 1. 'age' variable has 9999 values - these are ridicuous ages and should be treated as missing
## 2. 'embarked' variable has 9999 values - these will be treated as missing
## 3. 'deck' variable has NaN values - python will recognize these as missing, so we don't need to do anything

titanic.replace({'age':9999}, np.nan, inplace=True)
titanic.replace({'embarked':'9999'}, np.nan, inplace=True)
titanic.replace({'embark_town':'9999'}, np.nan, inplace=True)

print(titanic.isna().sum())
print("Size of titanic DF: ", titanic.shape)

In [None]:
## How should we address missing values?
##   'deck' variable has many missing values - we will drop this column
titanic.drop('deck', axis=1, inplace=True)

In [None]:
##   'embarked' only has two missing values - can we drop these two observations
titanic[titanic['embarked'].isna()]

In [None]:
##   'embarked' only has two missing values and we're not likely to lose any crucial data - we will drop these two observations
titanic.dropna(subset=['embarked'], inplace=True)

#### Duplicate Values
Check for duplicate values using `df.duplicated()`

Drop duplicated entries with `df.drop_duplicates()`

In [None]:
filter = titanic.duplicated()
display(titanic[filter])

In [None]:
titanic.drop_duplicates(inplace=True)
display(titanic)

### Imputing missing values
We can also fill in a missing value with another value that makes sense. Common choices are:
* Constant value
* A calculated value (Mean/Median/Min/Max of the variable)
* Interpolation (Good for numbers that follow a sequence, such as time)
    * Average of surrounding values (good when variable is continuous)
    * Forward Fill (good when variable is discrete (non-continuous))
    * Back Fill (good when variable is discrete (non-continuous))

We can fill in values using the `.fillna()` and `.interpolate()` methods
* To fill all missing values in the dataframe with a specific value:
    ```python
    val = 7.0
    df = df.fillna(val)
    ```

* To fill one column with a specific value, use a dictionary:
    ```python
    fill_values = {'col1':0, 'col2':5, 'col3':df['col3'].mean()}
    df = df.fillna(fill_values)
    ```

* To perform a forward fill: 
    ```python
    df['col'] = df['col'].ffill()
    ```

* To perform a back fill: 
    ```python
    df['col'] = df['col'].bfill()
    ```

* To perform an interpolation:
    ```python
    df['col'] = df['col'].interpolate()
    ```

In [None]:
## How should we address these missing values?
##   'age' variable has some missing values - we will impute these with the mean age

mean_age = titanic['age'].mean()
titanic.fillna({'age': mean_age}, inplace=True)

In [None]:
# Double-check that there are no missing values left
print(titanic.isna().sum())

In [None]:
display(titanic)