# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 1)}}$
## $\color{purple}{\text{Introduction to Missingness}}$

In [None]:
### $\color{purple}{\text{Colab Environmental Setup}}$
We will be saving and reading files. If you are following along with me and use colab. It will be a lot easier to mount your google drive.


In [None]:
from google.colab import drive
drive.mount('/content/drive')
import os
#os.mkdir('/content/drive/My Drive/missingness_tutorial')
os.chdir('/content/drive/My Drive/missingness_tutorial')

### $\color{purple}{\text{Libraries for this lesson}}$

In [None]:
import pandas as pd
import numpy as np

## $\color{purple}{\text{Nomenclature}}$

* columns
  * features
  * variables

* rows
  * observations
  * cases
  * records

Below is a little mnemonic Dataframe that encapsulates these synonyms

In [None]:
pd.DataFrame(np.random.randn(4, 3), columns=['columns', 'features', 'variables'], index=['rows', 'observations', 'cases', 'records'])

## $\color{purple}{\text{Identifying Missing Features}}$
$\color{red}{\Large{\text{ ⚠}}}$ Always examine your data and determine what are missing values

  * Values that are missing:
      * CSV files with blanks
      * NULL in a database
      * Impossible Values
  * Values that are non necessarily missing:
      * N/A
        * Not Available - missing
        * Not Applicable - maybe not missing
      * NaN

| Patient | Sex| Pregnant | Testicular Cancer |
|---------|----|----------|-------------------|
| Mary    | F  |  N       |   N/A             |
| John    | M  |  N/A     |    N              |


Is N/A really missing or just Not Applicable in this case?

### $\color{red}{\Large{\text{ ⚠}}}$ Remember NaN really stands for Not a Number. 

Many python tools use NaN to denote missing values in a table. But it does not necessarily mean the value is missing. A NaN could be incurred from bad processing somewhere in the pipeline or could even be expected.

However, for the rest of this we will use NaN in a `pandas` `DataFrame` to denote a missing value.

In [None]:
df = pd.DataFrame({'a': [1,2,3], 'b': [1, -1, 4]})
df['sqrt_b']=np.sqrt(df.b)

In [None]:
df

## $\color{purple}{\text{Missingness Mechanism vs. Pattern}}$

* Missingness Mechanism (Is an observation complete?)
  * Missing Completely at Random (MCAR)
  * Missing at Random (MAR)
  * Missing not at Random (MNAR)

* Missingness Pattern (What features are missing within an observation?)
  * Uniform
  * Monotonic
  * Random

## $\color{purple}{\text{Missingness Patterns}}$
No standard terminology but here is one version:
* If a row has missing data, the same features are always missing
#### $\color{purple}{\text{Uniform}}$

| Height | Weight | Sex  | Temperature | Pulse | BP Systolic |
| ------ | ------ | ---- | ----------- | ----- | ----------- |
| NaN    | NaN    | M    | 98.6        | 70    | NaN         |
| 61     | 120    | F    | 98.2        | 77    | 110         |
| 65     | 160    | M    | 99.1        | 62    | 140         |
| NaN    | NaN    | F    | 98.9        | 55    | NaN         |

* If a feature is missing, the rest of the features in the row are missing

<font color='red'>
The monotone missing pattern is where the set of observed rows  in one column is
always a subset of the set of observed rows for another column. This means that the columns can be ordered by the respective number of missing elements (least-to-greatest) in each column. When reordered this way, a missing element along any row/column implies that the rest of the columns to the right along that row are also missing.
</font> 


#### $\color{purple}{\text{Monotonic}}$

| Height | Weight | Sex  | Temperature | Pulse | BP Systolic |
| ------ | ------ | ---- | ----------- | ----- | ----------- |
| 71     | 190    | M    | 98.6        | 70    | 120         |
| 61     | 120    |      | NaN         | NaN   | NaN         |
| 65     | 160    | M    | 99.1        | 62    | 140         |
| 63     | 125    | F    | 98.9        | NaN   | NaN         |

* No particular pattern to the missingness
#### $\color{purple}{\text{Random}}$

| Height | Weight | Sex  | Temperature | Pulse | BP Systolic |
| ------ | ------ | ---- | ----------- | ----- | ----------- |
| 71     | NaN    | M    | 98.6        | NaN   | 120         |
| 61     | 120    | F    | 98.2        | 77    | 110         |
| 65     | NaN    | M    | NaN         | NaN   | 140         |
| 63     | 125    | F    | 98.9        | 55    | 100         |






## $\color{purple}{\text{Missingness Mechanisms}}$

Rubin’s taxonomy:
  * MCAR: the probability of data being missing does not depend on the values
observed or unobserved variables
  * MAR: the probability of data being missing does not depend on the unobserved
data, conditional on the observed data
  * MNAR: the probability of data being missing depends on the unobserved data,
**even after conditioning on the observed data** 

Source: [Missing data in propensity
score analysis](
https://www.ucl.ac.uk/population-health-sciences/sites/population_health_sciences/files/nash-missing_dataps_clemence_leyret.pdf)

## $\color{purple}{\text{Why Should I Care? Missing is Missing}}$
 * MCAR can apply deletion
 * MAR can use imputation
 * MNAR requires looking outside the data to handle

$\color{red}{\text{⚠}}$ Many research papers purport to be able to impute MNAR data. Do not believe them, most of the methodology is the claims is faulty.

## $\color{purple}{\text{Simple Case of Unimputatble NMAR Data}}$
The following example is a simple example of an MNAR mechanism. 


This data set is a fictitious scenario where 100 people were surveyed to pick a number between 1 and 10. Due to some unknown mechanism even numbers on occasionally lost. 

* `first_name` - first name of person polled
* `number` - They were asked to pick a number between 1-10
* missingness - with a small probability, numbers are missing if they are odd

In [None]:
df=pd.read_csv('data/favorite_numbers.csv')
df.head(20)

### $\color{purple}{\text{What can we tell about the missing data ?}}$
* Data itself can tell us nothing
* Outside knowledge can help
* More data can help

Based on the data set alone, there is knowledge as to what the missing values are aside from that they are even. 
One might assume that since most people tend to pick numbers from the middle hence you might infer that the numbers 4 and 6 are more common. However, this is based on knowledge outside the data set itself. Furthermore you might have statistics from outside sources indicating gender preference.


Suppose now we have their last names. We see the surnames and might be able to apply additional domain knowledge to better determine how likely the distribution of the missing numbers might be.

Traditionally, in Chinese culture, 6 and 8 are considered lucky where 4 and 10 are considered unlucky.

In [None]:
df=pd.read_csv('data/favorite_numbers_full_name.csv')
df.head(20)

## $\color{purple}{\text{Key Takeaways}}$
* Be sure to correctly identify what is truly missing
* The proper treatment of missing data requires identifying the type of missingness mechanism (MCAR, MAR, MNAR)
* Data missing due to MNAR requires domain knowledge to address

## $\color{purple}{\text{References}}$
 * Leyrat, C., Williamson E.,: Missing data in propensity score analysis. _Perils and promises of propensity scores_, https://www.ucl.ac.uk/population-health-sciences/sites/population_health_sciences/files/nash-missing_dataps_clemence_leyret.pdf * Li, L., Shen, C., Li, X., Robins, J.M.: On weighting approaches for missing data. _Statistical methods in medical research
22(1), 14–30 (Feb 2013)_. https://doi.org/10.1177/0962280211403597,
https://www.ncbi.nlm.nih.gov/pubmed/21705435, 21705435[pmid]
 * Rubin, D.: Inference and missing data. _Biometrika, Volume 63, Issue 3, December 1976_, Pages 581–592, https://doi.org/10.1093/biomet/63.3.581