# $\color{purple}{\text{Understanding Missing Data and How to Deal with It (Part 1)}}$

### $\color{purple}{\text{Colab Environmental Setup}}$
We will be saving and reading files. If you are following along with me and use colab. It will be a lot easier to mount your google drive.


In [None]:
from google.colab import drive
drive.mount('/content/drive')
import os
#os.mkdir('/content/drive/My Drive/missingness_tutorial')
os.chdir('/content/drive/My Drive/missingness_tutorial')

## $\color{purple}{\text{Nomenclature}}$

* columns
  * features
  * variables

* rows
  * observations
  * cases
  * records

In [None]:
import pandas as pd
import numpy as np

In [None]:
pd.DataFrame(np.random.randn(4, 3), columns=['columns', 'features', 'variables'], index=['rows', 'observations', 'cases', 'records'])

## $\color{purple}{\text{Identifying Missing Features}}$
$\color{red}{\Large{\text{ ⚠}}}$ Always examine your data and determine what are missing values

  * Values that are missing:
      * CSV files with blanks
      * NULL in a database
      * Impossible Values
  * Values that are non necessarily missing:
      * N/A
        * Not Available - missing
        * Not Applicable - maybe not missing
      * NaN

| Patient | Sex| Pregnant | Testicular Cancer |
|---------|----|----------|-------------------|
| Mary    | F  |  N       |   N/A             |
| John    | M  |  N/A     |    N              |


Is N/A really missing or just Not Applicable in this case?

### $\color{red}{\Large{\text{ ⚠}}}$ Remember NaN really stands for Not a Number. 

Many python tools use NaN to denote missing values in a table. But it does not necessarily mean the value is missing. A NaN could be incurred from bad processing somewhere in the pipeline or could even be expected.

However, for the rest of this we will use NaN in a `pandas` `DataFrame` to denote a missing value.

In [None]:
df = pd.DataFrame({'a': [1,2,3], 'b': [1, -1, 4]})
df['sqrt_b']=np.sqrt(df.b)

In [None]:
df

## $\color{purple}{\text{Missingness Mechanism vs. Pattern}}$

* Missingness Mechanism (Is an observation complete?)
  * Missing Completely at Random (MCAR)
  * Missing at Random (MAR)
  * Missing not at Random (MNAR)

* Missingness Pattern (What features are missing within an observation?)
  * Uniform
  * Monotonic
  * Random

## $\color{purple}{\text{Missingness Patterns}}$
No standard terminology but here is one version:
* If a row has missing data, the same features are always missing
#### $\color{purple}{\text{Uniform}}$

| Height | Weight | Sex  | Temperature | Pulse | BP Systolic |
| ------ | ------ | ---- | ----------- | ----- | ----------- |
| NaN    | NaN    | M    | 98.6        | 70    | NaN         |
| 61     | 120    | F    | 98.2        | 77    | 110         |
| 65     | 160    | M    | 99.1        | 62    | 140         |
| NaN    | NaN    | F    | 98.9        | 55    | NaN         |

* If a feature is missing, the rest of the features in the row are missing
#### $\color{purple}{\text{Monotonic}}$

| Height | Weight | Sex  | Temperature | Pulse | BP Systolic |
| ------ | ------ | ---- | ----------- | ----- | ----------- |
| 71     | 190    | M    | 98.6        | 70    | 120         |
| 61     | 120    |      | NaN         | NaN   | NaN         |
| 65     | 160    | M    | 99.1        | 62    | 140         |
| 63     | 125    | F    | 98.9        | NaN   | NaN         |

* No particular pattern to the missingness
#### $\color{purple}{\text{Random}}$

| Height | Weight | Sex  | Temperature | Pulse | BP Systolic |
| ------ | ------ | ---- | ----------- | ----- | ----------- |
| 71     | NaN    | M    | 98.6        | NaN   | 120         |
| 61     | 120    | F    | 98.2        | 77    | 110         |
| 65     | NaN    | M    | NaN         | NaN   | 140         |
| 63     | 125    | F    | 98.9        | 55    | 100         |






## $\color{purple}{\text{Missingness Mechanisms}}$

Rubin’s taxonomy:
  * MCAR: the probability of data being missing does not depend on the values
observed or unobserved variables
  * MAR: the probability of data being missing does not depend on the unobserved
data, conditional on the observed data
  * MNAR: the probability of data being missing depends on the unobserved data,
**even after conditioning on the observed data** 

Source: [Missing data in propensity
score analysis](
https://www.ucl.ac.uk/population-health-sciences/sites/population_health_sciences/files/nash-missing_dataps_clemence_leyret.pdf)

## $\color{purple}{\text{Why Should I Care? Missing is Missing}}$
 * MCAR can apply deletion
 * MAR can use imputation
 * MNAR requires looking outside the data to handle

$\color{red}{\text{⚠}}$ Many research papers purport to be able to impute MNAR data. Do not believe them, most of the methodology is the claims is faulty.

## $\color{purple}{\text{Simple Case of Unimputatble NMAR Data}}$

In [None]:
import numpy as np

In [None]:
cov=[[1.6545195264181267,
  0.6346001403246381,
  1.573255077832285,
  0.7457615955325402],
 [0.6346001403246381,
  0.5636389213610075,
  0.5861890592085826,
  0.6638139531999303],
 [1.573255077832285,
  0.5861890592085826,
  1.6461885333121087,
  0.4916921086792136],
 [0.7457615955325402,
  0.6638139531999303,
  0.4916921086792136,
  1.0900299890979697]]
mean = np.random.normal(np.random.uniform(low=1, high=5), np.random.uniform(high=5), 4)

In [None]:
mean

In [None]:
import pandas as pd
df = pd.DataFrame(np.random.multivariate_normal(mean, cov, size=10000), columns=['feature a', 'feature b', 'feature c', 'feature d'])
df.cov()

In [None]:
pd.DataFrame(cov)

In [None]:
df['uncorrelated']=np.random.rand(10000)

In [None]:
df.corr()

In [None]:
def clobber(df, column, probability, depends=[]):
  clob = df[column]==df[column] # Always True
  for dep_column in depends:
    clob &= df[dep_column]>df[dep_column].median()
  clob *= probability
  rand = np.random.uniform(0,1, size=len(clob))
  ret = df.copy()
  ret[column]=np.where(clob < rand, df[column], np.nan)
  return ret

In [None]:
mar_df=clobber(df, 'feature a', 0.4, depends=['feature c'])
mar_df['missing']=mar_df['feature a'].isnull()

In [None]:
mcar_df = clobber(df, 'feature a', 0.2)
mcar_df['missing']=mcar_df['feature a'].isnull()

In [None]:
fmnar_df=clobber(df, 'feature a', 0.4, depends=['feature a'])
fmnar_df['missing']=fmnar_df['feature a'].isnull()

In [None]:
mcar_df.mean()

In [None]:
fmnar_df.mean()

In [None]:
c_df.mean()

In [None]:
df.mean()

In [None]:
c_df['feature a'].corr(c_df['feature b'])

In [None]:
missing=mcar_df['feature a'].isnull()

In [None]:
mcar_df['missing']=missing

In [None]:
fmnar_df.corr()

In [None]:
from sklearn.covariance import EmpiricalCovariance

In [None]:
mcar_df.mean()

In [None]:
mcar_df.dropna().mean()

In [None]:
mcar_df[mcar_df['feature a'].isnull()].mean()

In [None]:
mar_df.dropna().mean()

In [None]:
mar_df.mean()

In [None]:
mar_df[mar_df['feature a'].isnull()].mean()

In [None]:
r

:warning

# $\color{red}{\text{⚠}}$
