# Introduction

There are various ways to deal with missing data in tabular datsets. This notebook will explore various techniques. 

**It is important to note that making decisions about missing data introduces biases and assumptions about the dataset. Missing data may be a normal occurance in the population the data represents. In other cases like health and safety it may put people at risk to make false assumptions about the dataset. Please evaluate your situation and compute responsibly.**

# Remove a variable

If there is not enough information in a given column (a feature) we can simply drop the column. We usually want to see a high percentage of missing data for this to be a good choice.

Base data set

In [37]:
import pandas as pd
data = [
    ["A", 25, "Ford", None], 
    ["A", 37, "Toyota", "Yes"],
    ["B", 26, "Chevy", None],
    ["B", 41, "Subaru", None],
    ["C", 66, "Honda", None],
    ["C", 28, "Toyota", None],
    ["D", 52, "Chevy", None],
    ["D", 43, "Toyota", None],
    ["E", 54, "Lexus", None],
    ["E", 47, "BMW", None],
    ["F", None, "Honda", None],
    ["F", 42, None, None],
]
df = pd.DataFrame(data, columns=["Region", "Age", "Car Brand", "Due For Service Soon"])
df

Unnamed: 0,Region,Age,Car Brand,Due For Service Soon
0,A,25.0,Ford,
1,A,37.0,Toyota,Yes
2,B,26.0,Chevy,
3,B,41.0,Subaru,
4,C,66.0,Honda,
5,C,28.0,Toyota,
6,D,52.0,Chevy,
7,D,43.0,Toyota,
8,E,54.0,Lexus,
9,E,47.0,BMW,


Columns with an empty rate above a threshold can be dropped.

In [65]:
import numpy as np
import pandas as pd

threshold=0.7

reduced_data = df[df.columns[df.isnull().mean() < threshold]]
reduced_data

Unnamed: 0,Region,Age,Car Brand
0,A,25.0,Ford
1,A,37.0,Toyota
2,B,26.0,Chevy
3,B,41.0,Subaru
4,C,66.0,Honda
5,C,28.0,Toyota
6,D,52.0,Chevy
7,D,43.0,Toyota
8,E,54.0,Lexus
9,E,47.0,BMW


# Fill numerical data with mean or median

This approach has pros and cons. Namely

- Pros
  - Simplicity
  - Preserves data structure
  - Preserves central tendency of the data
- Cons
  - Can be influenced by outliers
  - May not be accurate if the data is skewed
  - Potential information loss 

In [66]:
numeric_df = reduced_data.select_dtypes(include=np.number)
mean = numeric_df.mean(skipna=True)
numeric_df.fillna(round(mean), inplace=True)
reduced_data = reduced_data.assign(Age = numeric_df["Age"])
reduced_data

Unnamed: 0,Region,Age,Car Brand
0,A,25.0,Ford
1,A,37.0,Toyota
2,B,26.0,Chevy
3,B,41.0,Subaru
4,C,66.0,Honda
5,C,28.0,Toyota
6,D,52.0,Chevy
7,D,43.0,Toyota
8,E,54.0,Lexus
9,E,47.0,BMW


# Fill in categorical data with the mode

Categorical data needs to have a discrete value with one of the valid values. Similar to filling in anumerical value with the mean, this has the downside that it may not be inline with the distribution, accurate with the dataset and it may hide information.