# Missing Data 

* 72% of corporations believe that data quality issues hinder customer trust and perception

#### Workflow for treating missing values
* 1) Convert all missing values to null values
* 2) Analyze the amount and type of missingness in the data
* 3) Appropriately delete or impute missing values
* 4) Evaluate and compare the performance of the treated/imputed dataset.

#### NULL value operations
* There are two types of null values for consideration:
    * 1) **`None`** 
        * `None or True` #returns `True`
        * does not support arithmetic operations
            * returns `TypeError`
        * `type(None)` = `NoneType`
            * supports logical operations only
        * `None == None` returns `True`
        * `np.isnan(None)` returns `False`
    * 2) **`np.nan`**
        * `np.nan or True` #returns `np.nan`
        * `np.nan` does not return an error with arithmetic operations (it returns `np.nan`)
        * Note that `np.nan` is equivalent to undefined and any operation on 'undefined' is undefined
        * `type(np.nan)` = `float` 
            * supports both logical and arithmetic operations
        * `np.nan == np.nan` returns `False`
            * "an undefined number cannot be equal and therefore is false"
            * instead, the correct way to check for `NaN` is:
                * `np.isnan(np.nan)`
                
#### Handling Missing Values
* Missing values are usually filled with dummy values like `NA`, `-`, `.`, etc.
* Replace missing values in `read_csv()` call:
    * `college = pd.read_csv('college.csv', na_values='.')`
* Look for incorrect 0 values ("hidden" missing values):
    * `diabetes.BMI[diabetes.BMI ==0]`
    * Replace incorrect zero values with NaN:
        * `diabetes.BMI[diabetes.BMI ==0] = np.nan`
    * Re-check for NaN values:
        * `diabetes.BMI[np.isnan(diabetes.BMI)]`
        
#### Analyze the amount of missingness
* Find total numberand percentage of missing values in a column of the dataset
* `.isnull()` = `.sina()`
    * `airquality_nullity = airquality.isnull()` # returns a Boolean array that can be called a nullity or dummy dataframe
    * total missing values: `.sum()`
    * percentage missing values: `airquality_nullity.mean() * 100`

* **`Missingno`** package:
    * **Nullity bar**
    * Package for graphical analysis of missing values
    * `import missingno as msno`
    * `msno.bar(air_quality)
    * **Nullity matrix:** Visualize the locations of missing values in the dataset
    * Allows us to quickly analyze the patterns in missing data
    * `msno.matrix('air_quality')`
    * The sparkline on the right summarizes the general shape of data completeness and points out the row with the minimum number of null values in the DataFrame as well as the total count of columns at the bottom.
    * **Nullity matrix for time-series data:**
    * `msno.matrix(airquality, freq='M')
        * Clearly observe during which season there is a higher amount of missingness
    * Further slice data frame to specific time period to obtain even more clarity:
        * `msno.matrix(air_quality.loc['May-1976': 'Jul-1976'], freq='M')`
        
   

In [None]:
"youre one project away from changing your job"