# Missing Value Inputation

## Why Do Missing Values Occur?
### Missing values can sneak into your data for a variety of reasons. 

Here are some common reasons:

1. **Data Entry Errors**: Sometimes, it’s just human error. Someone might forget to input a value or accidentally delete one.

2. **Sensor Malfunctions**: In IoT or scientific experiments, a faulty sensor might fail to record data at certain times.

3. **Survey Non-Response**: In surveys, respondents might skip questions they’re uncomfortable answering or don’t understand.

4. **Merged Datasets**: When combining data from multiple sources, some entries might not have corresponding values in all datasets.

5. **Data Corruption**: During data transfer or storage, some values might get corrupted and become unreadable.

6. **Intentional Omissions**: Some data might be intentionally left out due to privacy concerns or irrelevance.

7. **Sampling Issues**: The data collection method might systematically miss certain types of data.

8. **Time-Sensitive Data**: In time series data, values might be missing for periods when data wasn’t collected (e.g., weekends, holidays).


[Link to the original source](https://towardsdatascience.com/missing-value-imputation-explained-a-visual-guide-with-code-examples-for-beginners-93e0726284eb)


![Alt text](https://miro.medium.com/v2/resize:fit:700/format:webp/1*8q4lX67ocYMFXgFcIr5SFA.png)

### Code

Data set code: 

```
import pandas as pd
import numpy as np

# Create the dataset as a dictionary
data = {
    'Date': ['08-01', '08-02', '08-03', '08-04', '08-05', '08-06', '08-07', '08-08', '08-09', '08-10',
             '08-11', '08-12', '08-13', '08-14', '08-15', '08-16', '08-17', '08-18', '08-19', '08-20'],
    'Weekday': [0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5],
    'Holiday': [0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    'Temp': [25.1, 26.4, np.nan, 24.1, 24.7, 26.5, 27.6, 28.2, 27.1, 26.7, np.nan, 24.3, 23.1, 22.4, np.nan, 26.5, 28.6, np.nan, 27.0, 26.9],
    'Humidity': [99.0, np.nan, 96.0, 68.0, 98.0, 98.0, 78.0, np.nan, 70.0, 75.0, np.nan, 77.0, 77.0, 89.0, 80.0, 88.0, 76.0, np.nan, 73.0, 73.0],
    'Wind': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, np.nan, 1.0, 0.0],
    'Outlook': ['rainy', 'sunny', 'rainy', 'overcast', 'rainy', np.nan, 'rainy', 'rainy', 'overcast', 'sunny', np.nan, 'overcast', 'sunny', 'rainy', 'sunny', 'rainy', np.nan, 'rainy', 'overcast', 'sunny'],
    'Crowdedness': [0.14, np.nan, 0.21, 0.68, 0.20, 0.32, 0.72, 0.61, np.nan, 0.54, np.nan, 0.67, 0.66, 0.38, 0.46, np.nan, 0.52, np.nan, 0.62, 0.81]
}
```

In [1]:
import pandas as pd
import numpy as np

# Create the dataset as a dictionary
data = {
    'Date': ['08-01', '08-02', '08-03', '08-04', '08-05', '08-06', '08-07', '08-08', '08-09', '08-10',
             '08-11', '08-12', '08-13', '08-14', '08-15', '08-16', '08-17', '08-18', '08-19', '08-20'],
    'Weekday': [0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5, 6, 0, 1, 2, 3, 4, 5],
    'Holiday': [0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, np.nan, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
    'Temp': [25.1, 26.4, np.nan, 24.1, 24.7, 26.5, 27.6, 28.2, 27.1, 26.7, np.nan, 24.3, 23.1, 22.4, np.nan, 26.5, 28.6, np.nan, 27.0, 26.9],
    'Humidity': [99.0, np.nan, 96.0, 68.0, 98.0, 98.0, 78.0, np.nan, 70.0, 75.0, np.nan, 77.0, 77.0, 89.0, 80.0, 88.0, 76.0, np.nan, 73.0, 73.0],
    'Wind': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, np.nan, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, np.nan, 1.0, 0.0],
    'Outlook': ['rainy', 'sunny', 'rainy', 'overcast', 'rainy', np.nan, 'rainy', 'rainy', 'overcast', 'sunny', np.nan, 'overcast', 'sunny', 'rainy', 'sunny', 'rainy', np.nan, 'rainy', 'overcast', 'sunny'],
    'Crowdedness': [0.14, np.nan, 0.21, 0.68, 0.20, 0.32, 0.72, 0.61, np.nan, 0.54, np.nan, 0.67, 0.66, 0.38, 0.46, np.nan, 0.52, np.nan, 0.62, 0.81]
}

# Create a DataFrame from the dictionary
df = pd.DataFrame(data)

# Display basic information about the dataset
print(df.info())

# Display the first few rows of the dataset
print(df.head())

# Display the count of missing values in each column
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         20 non-null     object 
 1   Weekday      20 non-null     int64  
 2   Holiday      18 non-null     float64
 3   Temp         16 non-null     float64
 4   Humidity     16 non-null     float64
 5   Wind         18 non-null     float64
 6   Outlook      17 non-null     object 
 7   Crowdedness  15 non-null     float64
dtypes: float64(5), int64(1), object(2)
memory usage: 1.4+ KB
None
    Date  Weekday  Holiday  Temp  Humidity  Wind   Outlook  Crowdedness
0  08-01        0      0.0  25.1      99.0   0.0     rainy         0.14
1  08-02        1      0.0  26.4       NaN   0.0     sunny          NaN
2  08-03        2      0.0   NaN      96.0   0.0     rainy         0.21
3  08-04        3      0.0  24.1      68.0   0.0  overcast         0.68
4  08-05        4      NaN  24.7      98.0   0.0   

In [2]:
print(df)

     Date  Weekday  Holiday  Temp  Humidity  Wind   Outlook  Crowdedness
0   08-01        0      0.0  25.1      99.0   0.0     rainy         0.14
1   08-02        1      0.0  26.4       NaN   0.0     sunny          NaN
2   08-03        2      0.0   NaN      96.0   0.0     rainy         0.21
3   08-04        3      0.0  24.1      68.0   0.0  overcast         0.68
4   08-05        4      NaN  24.7      98.0   0.0     rainy         0.20
5   08-06        5      0.0  26.5      98.0   0.0       NaN         0.32
6   08-07        6      0.0  27.6      78.0   0.0     rainy         0.72
7   08-08        0      0.0  28.2       NaN   0.0     rainy         0.61
8   08-09        1      0.0  27.1      70.0   0.0  overcast          NaN
9   08-10        2      1.0  26.7      75.0   NaN     sunny         0.54
10  08-11        3      0.0   NaN       NaN   0.0       NaN          NaN
11  08-12        4      NaN  24.3      77.0   0.0  overcast         0.67
12  08-13        5      0.0  23.1      77.0   1.0  

## Listwise Deletion

In [3]:
# Count missing values in each row
missing_count = df.isnull().sum(axis=1)

# Keep only rows with less than 4 missing values
df_clean = df[missing_count < 4].copy()

In [4]:
print(missing_count)

0     0
1     2
2     1
3     0
4     1
5     1
6     0
7     1
8     1
9     1
10    4
11    1
12    0
13    0
14    1
15    1
16    1
17    4
18    0
19    0
dtype: int64


## Simple imputation

In [5]:
# Mean imputation for Humidity
df_clean['Humidity'] = df_clean['Humidity'].fillna(df_clean['Humidity'].mean())

# Mode imputation for Holiday
df_clean['Holiday'] = df_clean['Holiday'].fillna(df_clean['Holiday'].mode()[0])

In [6]:
print(df_clean)

     Date  Weekday  Holiday  Temp  Humidity  Wind   Outlook  Crowdedness
0   08-01        0      0.0  25.1   99.0000   0.0     rainy         0.14
1   08-02        1      0.0  26.4   82.1875   0.0     sunny          NaN
2   08-03        2      0.0   NaN   96.0000   0.0     rainy         0.21
3   08-04        3      0.0  24.1   68.0000   0.0  overcast         0.68
4   08-05        4      0.0  24.7   98.0000   0.0     rainy         0.20
5   08-06        5      0.0  26.5   98.0000   0.0       NaN         0.32
6   08-07        6      0.0  27.6   78.0000   0.0     rainy         0.72
7   08-08        0      0.0  28.2   82.1875   0.0     rainy         0.61
8   08-09        1      0.0  27.1   70.0000   0.0  overcast          NaN
9   08-10        2      1.0  26.7   75.0000   NaN     sunny         0.54
11  08-12        4      0.0  24.3   77.0000   0.0  overcast         0.67
12  08-13        5      0.0  23.1   77.0000   1.0     sunny         0.66
13  08-14        6      0.0  22.4   89.0000   1.0  

# Linear Interpolation

In [7]:
df_clean['Temp'] = df_clean['Temp'].interpolate(method='linear')

In [8]:
print(df_clean)

     Date  Weekday  Holiday   Temp  Humidity  Wind   Outlook  Crowdedness
0   08-01        0      0.0  25.10   99.0000   0.0     rainy         0.14
1   08-02        1      0.0  26.40   82.1875   0.0     sunny          NaN
2   08-03        2      0.0  25.25   96.0000   0.0     rainy         0.21
3   08-04        3      0.0  24.10   68.0000   0.0  overcast         0.68
4   08-05        4      0.0  24.70   98.0000   0.0     rainy         0.20
5   08-06        5      0.0  26.50   98.0000   0.0       NaN         0.32
6   08-07        6      0.0  27.60   78.0000   0.0     rainy         0.72
7   08-08        0      0.0  28.20   82.1875   0.0     rainy         0.61
8   08-09        1      0.0  27.10   70.0000   0.0  overcast          NaN
9   08-10        2      1.0  26.70   75.0000   NaN     sunny         0.54
11  08-12        4      0.0  24.30   77.0000   0.0  overcast         0.67
12  08-13        5      0.0  23.10   77.0000   1.0     sunny         0.66
13  08-14        6      0.0  22.40   8

# Forward/Backward Fill

In [9]:
df_clean['Outlook'] = df_clean['Outlook'].fillna(method='ffill').fillna(method='bfill')

  df_clean['Outlook'] = df_clean['Outlook'].fillna(method='ffill').fillna(method='bfill')


In [10]:
print(df_clean)

     Date  Weekday  Holiday   Temp  Humidity  Wind   Outlook  Crowdedness
0   08-01        0      0.0  25.10   99.0000   0.0     rainy         0.14
1   08-02        1      0.0  26.40   82.1875   0.0     sunny          NaN
2   08-03        2      0.0  25.25   96.0000   0.0     rainy         0.21
3   08-04        3      0.0  24.10   68.0000   0.0  overcast         0.68
4   08-05        4      0.0  24.70   98.0000   0.0     rainy         0.20
5   08-06        5      0.0  26.50   98.0000   0.0     rainy         0.32
6   08-07        6      0.0  27.60   78.0000   0.0     rainy         0.72
7   08-08        0      0.0  28.20   82.1875   0.0     rainy         0.61
8   08-09        1      0.0  27.10   70.0000   0.0  overcast          NaN
9   08-10        2      1.0  26.70   75.0000   NaN     sunny         0.54
11  08-12        4      0.0  24.30   77.0000   0.0  overcast         0.67
12  08-13        5      0.0  23.10   77.0000   1.0     sunny         0.66
13  08-14        6      0.0  22.40   8

# Constant Value Imputation

In [11]:
df_clean['Wind'] = df_clean['Wind'].fillna(-1)

In [12]:
print(df_clean)

     Date  Weekday  Holiday   Temp  Humidity  Wind   Outlook  Crowdedness
0   08-01        0      0.0  25.10   99.0000   0.0     rainy         0.14
1   08-02        1      0.0  26.40   82.1875   0.0     sunny          NaN
2   08-03        2      0.0  25.25   96.0000   0.0     rainy         0.21
3   08-04        3      0.0  24.10   68.0000   0.0  overcast         0.68
4   08-05        4      0.0  24.70   98.0000   0.0     rainy         0.20
5   08-06        5      0.0  26.50   98.0000   0.0     rainy         0.32
6   08-07        6      0.0  27.60   78.0000   0.0     rainy         0.72
7   08-08        0      0.0  28.20   82.1875   0.0     rainy         0.61
8   08-09        1      0.0  27.10   70.0000   0.0  overcast          NaN
9   08-10        2      1.0  26.70   75.0000  -1.0     sunny         0.54
11  08-12        4      0.0  24.30   77.0000   0.0  overcast         0.67
12  08-13        5      0.0  23.10   77.0000   1.0     sunny         0.66
13  08-14        6      0.0  22.40   8

# KNN Imputation

In [13]:
from sklearn.impute import KNNImputer

# One-hot encode the 'Outlook' column
outlook_encoded = pd.get_dummies(df_clean['Outlook'], prefix='Outlook')

# Prepare features for KNN imputation
features_for_knn = ['Weekday', 'Holiday', 'Temp', 'Humidity', 'Wind']
knn_features = pd.concat([df_clean[features_for_knn], outlook_encoded], axis=1)

# Apply KNN imputation
knn_imputer = KNNImputer(n_neighbors=3)
df_imputed = pd.DataFrame(knn_imputer.fit_transform(pd.concat([knn_features, df_clean[['Crowdedness']]], axis=1)),
                          columns=list(knn_features.columns) + ['Crowdedness'])

# Update the original dataframe with the imputed Crowdedness values
df_clean['Crowdedness'] = df_imputed['Crowdedness']

In [14]:
print(df_clean)

     Date  Weekday  Holiday   Temp  Humidity  Wind   Outlook  Crowdedness
0   08-01        0      0.0  25.10   99.0000   0.0     rainy     0.140000
1   08-02        1      0.0  26.40   82.1875   0.0     sunny     0.580000
2   08-03        2      0.0  25.25   96.0000   0.0     rainy     0.210000
3   08-04        3      0.0  24.10   68.0000   0.0  overcast     0.680000
4   08-05        4      0.0  24.70   98.0000   0.0     rainy     0.200000
5   08-06        5      0.0  26.50   98.0000   0.0     rainy     0.320000
6   08-07        6      0.0  27.60   78.0000   0.0     rainy     0.720000
7   08-08        0      0.0  28.20   82.1875   0.0     rainy     0.610000
8   08-09        1      0.0  27.10   70.0000   0.0  overcast     0.703333
9   08-10        2      1.0  26.70   75.0000  -1.0     sunny     0.540000
11  08-12        4      0.0  24.30   77.0000   0.0  overcast     0.660000
12  08-13        5      0.0  23.10   77.0000   1.0     sunny     0.380000
13  08-14        6      0.0  22.40   8