# Data missing imputation considering a grouping

This class implements data imputation considering a grouping from a categorical variable. The data used in this example is related to the Australian weather. There are a lot of missing values and they will be filled by the mean, median and mode taking into consideration the location where the data were collected.

The data will be downloaded directly using the kaggle api, but you can find it [here](https://www.kaggle.com/jsphyg/weather-dataset-rattle-package) as well.

The code can be found [here](https://github.com/abreukuse/ml_utilities/blob/master/group_imputer.py) and it has fit and transform capabilities, so it can be used together with sklearn pipelines.

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import make_pipeline

from group_imputer import GroupImputer

import os
os.environ['KAGGLE_USERNAME'] = 'kaggle_username'
os.environ['KAGGLE_KEY'] = 'kaggle_api_key'

In [2]:
!kaggle datasets download -d jsphyg/weather-dataset-rattle-package

Downloading weather-dataset-rattle-package.zip to /content
  0% 0.00/3.83M [00:00<?, ?B/s]
100% 3.83M/3.83M [00:00<00:00, 62.5MB/s]


In [3]:
!unzip 'weather-dataset-rattle-package.zip'

Archive:  weather-dataset-rattle-package.zip
  inflating: weatherAUS.csv          


In [4]:
data = pd.read_csv('weatherAUS.csv')
data.head()

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,,,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,,,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,,,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,,,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,,,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,,,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,,,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No


In [5]:
data.isnull().sum()

Date                 0
Location             0
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustDir      10326
WindGustSpeed    10263
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
RainToday         3261
RainTomorrow      3267
dtype: int64

In [6]:
imputer_mean = GroupImputer(grouping='Location', 
                            
                            columns=['MaxTemp',
                                     'MinTemp',
                                     'Rainfall',
                                     'Evaporation',
                                     'Humidity9am',
                                     'Humidity3pm',
                                     'Temp9am',
                                     'Temp3pm'],
                            
                             strategy='mean')

imputer_median = GroupImputer(grouping='Location', 
                              
                              columns=['Sunshine',
                                       'WindSpeed9am',
                                       'WindSpeed3pm',
                                       'WindGustSpeed',
                                       'Pressure9am',
                                       'Pressure3pm'],
                              
                               strategy='median')

imputer_mode = GroupImputer(grouping='Location', 
                            
                            columns=['WindGustDir',
                                     'WindDir9am',
                                     'WindDir3pm',
                                     'Cloud9am',
                                     'Cloud3pm',
                                     'RainToday'],
                            
                             strategy='mode')

In [7]:
pipeline = make_pipeline(imputer_mean, imputer_median, imputer_mode)
data_nan_imputed = pipeline.fit_transform(data)

In [8]:
data_nan_imputed.head(10)

Unnamed: 0,Date,Location,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustDir,WindGustSpeed,WindDir9am,WindDir3pm,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,Pressure9am,Pressure3pm,Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday,RainTomorrow
0,2008-12-01,Albury,13.4,22.9,0.6,5.476532,8.35,W,44.0,W,WNW,20.0,24.0,71.0,22.0,1007.7,1007.1,8.0,8.0,16.9,21.8,No,No
1,2008-12-02,Albury,7.4,25.1,0.0,5.476532,8.35,WNW,44.0,NNW,WSW,4.0,22.0,44.0,25.0,1010.6,1007.8,8.0,8.0,17.2,24.3,No,No
2,2008-12-03,Albury,12.9,25.7,0.0,5.476532,8.35,WSW,46.0,W,WSW,19.0,26.0,38.0,30.0,1007.6,1008.7,8.0,2.0,21.0,23.2,No,No
3,2008-12-04,Albury,9.2,28.0,0.0,5.476532,8.35,NE,24.0,SE,E,11.0,9.0,45.0,16.0,1017.6,1012.8,8.0,8.0,18.1,26.5,No,No
4,2008-12-05,Albury,17.5,32.3,1.0,5.476532,8.35,W,41.0,ENE,NW,7.0,20.0,82.0,33.0,1010.8,1006.0,7.0,8.0,17.8,29.7,No,No
5,2008-12-06,Albury,14.6,29.7,0.2,5.476532,8.35,WNW,56.0,W,W,19.0,24.0,55.0,23.0,1009.2,1005.4,8.0,8.0,20.6,28.9,No,No
6,2008-12-07,Albury,14.3,25.0,0.0,5.476532,8.35,W,50.0,SW,W,20.0,24.0,49.0,19.0,1009.6,1008.2,1.0,8.0,18.1,24.6,No,No
7,2008-12-08,Albury,7.7,26.7,0.0,5.476532,8.35,W,35.0,SSE,W,6.0,17.0,48.0,19.0,1013.4,1010.1,8.0,8.0,16.3,25.5,No,No
8,2008-12-09,Albury,9.7,31.9,0.0,5.476532,8.35,NNW,80.0,SE,NW,7.0,28.0,42.0,9.0,1008.9,1003.6,8.0,8.0,18.3,30.2,No,Yes
9,2008-12-10,Albury,13.1,30.1,1.4,5.476532,8.35,W,28.0,S,SSE,15.0,11.0,58.0,27.0,1007.0,1005.7,8.0,8.0,20.1,28.2,Yes,No


In [9]:
data_nan_imputed.isnull().sum()

Date                0
Location            0
MinTemp             0
MaxTemp             0
Rainfall            0
Evaporation         0
Sunshine            0
WindGustDir         0
WindGustSpeed       0
WindDir9am          0
WindDir3pm          0
WindSpeed9am        0
WindSpeed3pm        0
Humidity9am         0
Humidity3pm         0
Pressure9am         0
Pressure3pm         0
Cloud9am            0
Cloud3pm            0
Temp9am             0
Temp3pm             0
RainToday           0
RainTomorrow     3267
dtype: int64

RainTomorrow was not imputed because it is the target.