In [63]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Handling Missing Values

Not all missing values are equal. Consider arbitrary survey data taken from the general population.

**Missing Not At Random** - when a value is missing for a reason related to the true value. (Ex: if a survey responding chooses not to disclose their income, this could be because they have an abnormally high or low income)

**Missing at Random** - when a value is missing for a reason related to another observed variable. (Ex: many age values are missing for survey respondents of a particular gender)

**Missing Completely at Random** - when there's no patterns in the missing values.

In [64]:
df = pd.read_csv("../data/eramissingvalues.csv")

## Deletion

- Column deletion: removing a column that has too many missing values and is non-essential for your model
- Row deletion: removing rows with missing values, ideally if the missing values are Missing At Random, to avoid biasing your model

Unnamed: 0,time,solar radiation
0,2022-01-01 06:00:00,0.000
2,2022-01-01 08:00:00,374614.120
3,2022-01-01 09:00:00,834108.250
4,2022-01-01 10:00:00,1202242.500
5,2022-01-01 11:00:00,1403760.400
...,...,...
3697,2022-08-08 17:00:00,1238234.800
3698,2022-08-08 18:00:00,534686.500
3699,2022-08-08 19:00:00,83661.125
3701,2022-08-08 21:00:00,0.000


KeyError: "['solar radiation'] not found in axis"

## Imputation

- Fill missing values with their defaults (empty string, zero, etc...)
- Fill missing values with the mean, median, or mode
- Backward or forward fill
- Imputation risks injecting your own bias and adding noise to the data, and should be performed with caution

Unnamed: 0,time,solar radiation
0,2022-01-01 06:00:00,0.000
1,2022-01-01 07:00:00,0.000
2,2022-01-01 08:00:00,374614.120
3,2022-01-01 09:00:00,834108.250
4,2022-01-01 10:00:00,1202242.500
...,...,...
3698,2022-08-08 18:00:00,534686.500
3699,2022-08-08 19:00:00,83661.125
3700,2022-08-08 20:00:00,0.000
3701,2022-08-08 21:00:00,0.000


1135037.7999999998

Unnamed: 0,time,solar radiation
0,2022-01-01 06:00:00,0.000
1,2022-01-01 07:00:00,0.000
2,2022-01-01 08:00:00,374614.120
3,2022-01-01 09:00:00,834108.250
4,2022-01-01 10:00:00,1202242.500
...,...,...
3698,2022-08-08 18:00:00,534686.500
3699,2022-08-08 19:00:00,83661.125
3700,2022-08-08 20:00:00,83661.125
3701,2022-08-08 21:00:00,0.000


Unnamed: 0,time,solar radiation
0,2022-01-01 06:00:00,0.000
1,2022-01-01 07:00:00,374614.120
2,2022-01-01 08:00:00,374614.120
3,2022-01-01 09:00:00,834108.250
4,2022-01-01 10:00:00,1202242.500
...,...,...
3698,2022-08-08 18:00:00,534686.500
3699,2022-08-08 19:00:00,83661.125
3700,2022-08-08 20:00:00,0.000
3701,2022-08-08 21:00:00,0.000


Unnamed: 0,time,solar radiation
0,2022-01-01 06:00:00,0.000000e+00
1,2022-01-01 07:00:00,1.873071e+05
2,2022-01-01 08:00:00,3.746141e+05
3,2022-01-01 09:00:00,8.341082e+05
4,2022-01-01 10:00:00,1.202242e+06
...,...,...
3698,2022-08-08 18:00:00,5.346865e+05
3699,2022-08-08 19:00:00,8.366112e+04
3700,2022-08-08 20:00:00,4.183056e+04
3701,2022-08-08 21:00:00,0.000000e+00
