# Missing Data 

* 72% of corporations believe that data quality issues hinder customer trust and perception

#### Workflow for treating missing values
* 1) Convert all missing values to null values
* 2) Analyze the amount and type of missingness in the data
* 3) Appropriately delete or impute missing values
* 4) Evaluate and compare the performance of the treated/imputed dataset.

#### NULL value operations
* There are two types of null values for consideration:
    * 1) **`None`** 
        * `None or True` #returns `True`
        * does not support arithmetic operations
            * returns `TypeError`
        * `type(None)` = `NoneType`
            * supports logical operations only
        * `None == None` returns `True`
        * `np.isnan(None)` returns `False`
    * 2) **`np.nan`**
        * `np.nan or True` #returns `np.nan`
        * `np.nan` does not return an error with arithmetic operations (it returns `np.nan`)
        * Note that `np.nan` is equivalent to undefined and any operation on 'undefined' is undefined
        * `type(np.nan)` = `float` 
            * supports both logical and arithmetic operations
        * `np.nan == np.nan` returns `False`
            * "an undefined number cannot be equal and therefore is false"
            * instead, the correct way to check for `NaN` is:
                * `np.isnan(np.nan)`
                
#### Handling Missing Values
* Missing values are usually filled with dummy values like `NA`, `-`, `.`, etc.
* Replace missing values in `read_csv()` call:
    * `college = pd.read_csv('college.csv', na_values='.')`
* Look for incorrect 0 values ("hidden" missing values):
    * `diabetes.BMI[diabetes.BMI ==0]`
    * Replace incorrect zero values with NaN:
        * `diabetes.BMI[diabetes.BMI ==0] = np.nan`
    * Re-check for NaN values:
        * `diabetes.BMI[np.isnan(diabetes.BMI)]`
        
#### Analyze the amount of missingness
* Find total numberand percentage of missing values in a column of the dataset
* `.isnull()` = `.sina()`
    * `airquality_nullity = airquality.isnull()` # returns a Boolean array that can be called a nullity or dummy dataframe
    * total missing values: `.sum()`
    * percentage missing values: `airquality_nullity.mean() * 100`

* **`Missingno`** package:
    * **Nullity bar**
    * Package for graphical analysis of missing values
    * `import missingno as msno`
    * `msno.bar(air_quality)
    * **Nullity matrix:** Visualize the locations of missing values in the dataset
    * Allows us to quickly analyze the patterns in missing data
    * `msno.matrix('air_quality')`
    * The sparkline on the right summarizes the general shape of data completeness and points out the row with the minimum number of null values in the DataFrame as well as the total count of columns at the bottom.
    * **Nullity matrix for time-series data:**
    * `msno.matrix(airquality, freq='M')
        * Clearly observe during which season there is a higher amount of missingness
    * Further slice data frame to specific time period to obtain even more clarity:
        * `msno.matrix(air_quality.loc['May-1976': 'Jul-1976'], freq='M')`
        
#### Is the data missing at random?
* It turns out missingness has a pattern, and often for a good reason

* **Types of missingness:**
    * 1) **Missing Completely At Random (MCAR):** Missingness has no relationship between any values, observed or missing."
        * Visually demonstrate MCAR with `msno.matrix(diabetes)`
    * 2) **Missing At Random (MAR):** There is a systematic relationship between missingness and other observed data, but not the missing data.
        * It's important here to note that missingness is dependent only on the observed values and not the missing values for MAR.
    * 3) **Missing Not At Random (MNAR):** There is a relationship between missingness and its values, missing or non-missing
        * often easily interpretable with: `sorted = diabetes.sort_values('Serum_Insulin')`
* Identifying the missingness type helps narrow down the methodologies you can for treating missing data.

#### Finding patterns in missing data
* **Finding correlations between missingness:**
    * `msno.heatmaps`s or correlation maps
    * missingness dendrograms
* **Missingness heatmap:**
    * Graph of correlation of missing values between columns
    * Explains the dependencies of missingness between columns
    * `msno.heatmap(diabetes)`
        * in the graph, the redder the color, the lower the correlation between them
        * the bluer the color, the higher the correlation between them.
* **Missingness dendrogram:**
    * Tree diagram of missingness
    * Tree diagram groups similar objects in close branches
    * Missingness dendrogram describes correlation of variables by grouping them
    * `msno.dendrogram(diabetes)
        * to interpret this graph, read it from a top-down perspective
        * cluster values which are linked together at a value of 0, fully predict one another's presence
            * one variable might always be empty, while the other is filled
            * or, they may always appear missing together (or filled together)
* **Visualizing missingness across a variable:**
    * Visualize how missingness of a variable changes against another variable
    * To create this graph, we will use the Matplotlib library. However, Matplotlib skips all missing values while plotting. Therefore, we would need to first create a function that fills in dummy values for all the mising values in the dataframe before plotting:

In [4]:
from numpy.random import rand

def fill_dummy_values(df, scaling_factor):
    # Create copy of dataframe
    df_dummy = df.copy(deep=True)
    # Iterate over each column
    for col in df_dummy:
        
        # Get column, column missing values, and range
        col = df_dummy[col]
        col_null = col.isnull()
        num_nulls = col_null.sum()
        col_range = col.max() - col.min()
            
        # Shift and scale dummy values
        dummy_values = (rand(num_nulls) - 2)
        dummy_values = dummy_values * scaling_factor * col_range + col.min()
            
        # Return dummy values
        col[col_null] = dummy_values
    return df_dummy             

```
# Create dummy dataframe
diabetes_dummy = fill_dummy_values(diabetes)
# Get missing values of both columns for coloring
nullity = diabetes.Serum_Insulin.isnull()+diabetes.BMI.isnull() #series of True/False. True implies missing
# Generate scatter plot
diabetes_dummy.plot(x='Serum_Insulin', y='BMI', kind ='scatter', alpha=0.5, c=nullity, cmap='rainbow')
```

#### When and how to delete missing data:
* **Types of missing data:**
    * 1) **Pairwise deletions:** only the missing values are skipped during calculations
    * 2) **Listwise deletions:** the complete row is deleted
    * **Note:** Both of these deletions used only when MCAR
    
* **Listwise deletion or complete case:** 
    * `diabetes.dropna(subset=['Glucose'], how='any', inplace=True)`
    * recommended to only use when number of missing values is very small
    * disadvantage: loss of data

# "youre one project away from changing your job"