# Missing Data

Unit 1 / Lesson 3 / Assignment 7

Most datasets will how at least some missing values. Cleaning the dataset can even increase the number of missing values.

Even if missingness is random, it can still cause difficulties during analysis. The basic `Python` statistical methods like `ANOVA`, `t-tests`, and `correlations` will fail if there are any missing values in the variables involved in those calculations.

One solution is to use the `Pandas` package `dropna()` method.

In [9]:
import numpy as np
import pandas as pd

In [5]:
# sample data
data = {
    'age': [27, 50, 34, None, None, None],
    'gender': ['f', 'f', 'f', 'm', 'm', None],
    'height' : [64, None, 71, 66, 68, None],
    'weight' : [140, None, 130, 110, 160, None],
}

df = pd.DataFrame(data)

# view data
print(df)

# drop all rows with missing values in any column
print(df.dropna())

# drop only rows where all values are missing
print(df.dropna(how='all'))

# drop only rows with two or more values missing
print(df.dropna(thresh=2))

# drop only rows that have missing values in the 'gender' or 'height' column
print(df.dropna(subset=['gender','height']))

# drop only rows that have missing values in both the 'height' and 'weight' column
print(df.dropna(how='any', subset=['height', 'weight']))

    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
5   NaN   None     NaN     NaN
    age gender  height  weight
0  27.0      f    64.0   140.0
2  34.0      f    71.0   130.0
    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
    age gender  height  weight
0  27.0      f    64.0   140.0
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
    age gender  height  weight
0  27.0      f    64.0   140.0
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0


### When does missingness matter?

Sometimes dropping all rows with missing data is fine, sometimes it will create more problems.
Missing data matters when we believe the missingness will cause:
- loss of analytical relevance because so many rows had to be dropped
- bias because certain values are more likely to be missing than others

To help understand when to drop missing data and when not to, determine where the missing data falls in the following categories:

__Missing Completely at Random (MCaR)__:
Damaged equipment caused a loss of 20% of all data. In this case, the missingness is tolerable.
Unless the loss of data is so much that our sample size is now too small, we can throw out the missing data.

__Missing at Random (MaR)__:
Women are more likely to skip questions about weight, lower-income individuals are more likely to skip questions about earnings, people who consume high amounts of alcohol each week are more likely to skip questions about alcoholic consumption.
If we can explain why this data is missing use the data we already have, we can proceed without the data.
Though we must include the variable that 'explains' the missingness in our analysis.
There is no way to know completely if data is __MaR__, but sometimes it's safe to assume it is.
If we find a variable during our analysis that seems to have a clear differentiation between missing and non-missing values (for example 90% of missing entries from a mental wellness survey are from males), we can reasonably suspect __MaR__.

__Missing Not at Random (MNaR)__:
Data that appears to have systematic missingness can be classified as __MNaR__.
This type of data can not be thrown out as we will end up with a biased sample and biased conclusions.
An instance of __MNaR__ data would be would be people who would answer a survey question a certain way may not answer the questions at all.


For example, people out of work might be less likely to answer a survey question about unemployment.
In that case our analysis would show proportionately fewer unemployed people in a population than there truly are.
Since we can't know what __MNaR__ data would be, we can only at best make an assumption by looking for what's not in the data.
Abnormally low counts of reported homelessness, fewer LGBTQ people in a population than expected, 90% of questions about mental wellness left blank by male survey-takers, variables with missingness where no one picks the highest and lowest values.


### Imputation
What do you do when you have missing data that you can't drop, or doing so would leave your sample too small?
We can "guess" what the missing data would have been and use a `fill` method to input that data.
This process is called `imputation`.

In most cases, `imputation` involves replacing missing values with some kind of statistical measure--the mean, median, or mode--of the variable.
This method isn't perfect, but it preserves central tendancy at the cost of reducing variance and correlations among variables.

In [63]:
# Sample data to play with.
data = {
    'age': [27, 50, 34, None, None, None],
    'gender': ['f', 'f', 'f', 'm', 'm', None],
    'height' : [64, None, 71, 66, 68, None],
    'weight' : [140, None, 130, 110, 160, None],
}

df = pd.DataFrame(data)

print(df, '\n')

# For each numeric column, replace the missing values with the mean for that column
df.fillna(df.mean(), inplace=True)

print(df, '\n')

# For each column, replace the missing values with the most common value for that column.
# This is useful for filling in missing categorical values.
# As written, this command will fill in missing values for both numerical and categorical columns.
df = pd.DataFrame(data)
df = df.apply(lambda x: x.fillna(x.value_counts().index[0]))

print(df, '\n')

# for each column, replace each missing value with the median, mode, or another statistic of your choice.
df = pd.DataFrame(data)
# replace missing 'age' values with the average value
df['age'].fillna(df['age'].mean(), inplace=True)
# replace missing 'gender' values with the most common value
df['gender'] = df['gender'].fillna(df['gender'].value_counts().index[0])
# replace missing 'height' and 'weight' values with medians values
df['height'].fillna(df['height'].median(), inplace=True)
df['weight'].fillna(df['weight'].median(), inplace=True)

print(df)

    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f     NaN     NaN
2  34.0      f    71.0   130.0
3   NaN      m    66.0   110.0
4   NaN      m    68.0   160.0
5   NaN   None     NaN     NaN 

    age gender  height  weight
0  27.0      f   64.00   140.0
1  50.0      f   67.25   135.0
2  34.0      f   71.00   130.0
3  37.0      m   66.00   110.0
4  37.0      m   68.00   160.0
5  37.0   None   67.25   135.0 

    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f    68.0   160.0
2  34.0      f    71.0   130.0
3  34.0      m    66.0   110.0
4  34.0      m    68.0   160.0
5  34.0      f    68.0   160.0 

    age gender  height  weight
0  27.0      f    64.0   140.0
1  50.0      f    67.0   135.0
2  34.0      f    71.0   130.0
3  37.0      m    66.0   110.0
4  37.0      m    68.0   160.0
5  37.0      f    67.0   135.0


Proper __imputation__ is a complex topic. A more sophisticated method than the one we used above, would be grouping existing data entries in categories based on similarities.
Then we can subset the data with those groups and perform statistical measures specific to those groups to fill in the missing data.
For more information, check out his [imputation tutuorial](https://www.analyticsvidhya.com/blog/2014/09/data-munging-python-using-pandas-baby-steps-python/)

### Beyond Imputation
If the cause of missing data is an easy fix, then collection new data may be the simplest option to replace the missing data rather than imputating data.
Either run the study again, refresh the API, or collect more data with a focus on the groups with the highest instances of missingness.
For example, a coding error in a survey means data wasn't recorded for any Mac users, it may be easier to fix the coding error and run the study again (or fix the coding error and collect data from just Mac users) than try to impute such a centrally important variable.