<a id="toc"></a>

# <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:175%; text-align:center; border-radius:5px 5px;">Auto Scout Car Prices Prediction Project: <br> Data Cleaning</p>

## <p style="background-color: #008080; font-family:newtimeroman; color:#FFF9ED; font-size:150%; text-align:center; border-radius:10px 10px;">Content</p>

* [INTRODUCTION](#0)
* [IMPORTING LIBRARIES NEEDED IN THIS NOTEBOOK](#1)
* [DATA IMPUTATION COLUMN BY COLUMN](#2)
* [THE END OF DATA IMPUTATION](#3)

<a id="0"></a>

## Introduction

Copy from the file 00_data_cleaning after the project is finished


<a id="1"></a>

## Importing Libraries Needed For Data Imputation

In [1]:
import json
import matplotlib.pyplot as plt
from itertools import product
import numpy as np
import os
import pandas as pd
from random import choices
import re

## UserCreated Functions for Data Imputation

In [2]:
def drop_rows(df, f, vals):
    '''
    function takes a dataframe, drops rows for which df[f] assumes values listed in vals. 
    returns mutated dataframe
    
    df = dataframe
    f = name of feature
    vals = a list of values for which rows are to be dropped
    '''
    for v in vals:
        df.drop(df[df[f] == v].index, axis=0, inplace=True)
        
    return df

In [3]:
def fill(df, f_fill, f_use, how):
    '''
    function takes a dataframe, a feature for which missing values are to be filled, and 
    a list of features to group by, and a method from the list [median, mode].
    
    df = dataframe
    f_fill = name of feature with missing values to impute
    l_f_use = list of features based on which the missing values are to be impute, in order of decreasing importance
    '''
    
    if how == 'mode':
        uniq_f_use = df[f_use].unique()
        for u in uniq_f_use:
            if len(df[df[f_use] == u][f_fill]) > 0:
                v = df[df[f_use] == u][f_fill].mode()[0]
                df.loc[df[f_use] == u, f_fill] = \
                    df.loc[df[f_use] == u, f_fill].fillna(v)
            else:
                print('empty class')
                return df
        return df
    
    if how == 'median':
        uniq_f_use = df[f_use].unique()
        for u in uniq_f_use:
            if len(df[df[f_use] == u][f_fill]) > 0:
                v = df[df[f_use] == u][f_fill].median()
                df.loc[df[f_use] == u, f_fill] = \
                    df.loc[df[f_use] == u, f_fill].fillna(v)
            else:
                print('empty class')
                return df
        return df


<a id="2"></a>
## Data Imputation Column by Column


In [4]:
df = pd.read_json('data_post00.json', lines=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15919 entries, 0 to 15918
Data columns (total 34 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   make_model               15919 non-null  object 
 1   body_type                15859 non-null  object 
 2   price                    15919 non-null  int64  
 3   km                       14895 non-null  float64
 4   prev_owner               9091 non-null   float64
 5   hp                       15831 non-null  float64
 6   type                     15917 non-null  object 
 7   first_registration       14322 non-null  float64
 8   body_color               15322 non-null  object 
 9   paint_type               10147 non-null  object 
 10  nr_doors                 15707 non-null  float64
 11  nr_seats                 14942 non-null  float64
 12  gearing_type             15919 non-null  object 
 13  displacement             15423 non-null  float64
 14  cylinders             

### Explore 'make_model'
* we have very few data for audi_a2 and renault_duster.
* we can drop these from the data (35 rows)

In [6]:
df.make_model.value_counts(dropna=False)

audi_a3           3097
audi_a1           2614
opel_insignia     2598
opel_astra        2526
opel_corsa        2219
renault_clio      1839
renault_espace     991
renault_duster      34
audi_a2              1
Name: make_model, dtype: int64

In [7]:
df.drop(df[df['make_model'] == "audi_a2"].index, axis=0, inplace=True)

In [8]:
df.drop(df[df['make_model'] == "renault_duster"].index, axis=0, inplace=True)

In [9]:
df.make_model.value_counts(dropna=False)

audi_a3           3097
audi_a1           2614
opel_insignia     2598
opel_astra        2526
opel_corsa        2219
renault_clio      1839
renault_espace     991
Name: make_model, dtype: int64

### Impute 'body_type'

* there are too many categories, we can put any less than 100 into other category
* there are 60 missing
* fill missing values by the mode of make_model
* make a function that will fill missing values of column x by mode of x by y

In [10]:
df.body_type.value_counts(dropna=False)

sedans           7903
station wagon    3553
compact          3153
van               783
other             290
transporter        88
NaN                60
coupe              25
off-road           21
convertible         8
Name: body_type, dtype: int64

In [11]:
btv = df.body_type.value_counts()
btv

sedans           7903
station wagon    3553
compact          3153
van               783
other             290
transporter        88
coupe              25
off-road           21
convertible         8
Name: body_type, dtype: int64

In [12]:
df.body_type = [x if x not in btv[-4:] else "other" for x in df['body_type']]

In [13]:
df.body_type.value_counts()

sedans           7903
station wagon    3553
compact          3153
van               783
other             432
Name: body_type, dtype: int64

In [14]:
df['body_type'].groupby(df['make_model']).value_counts(dropna=False)

make_model      body_type    
audi_a1         sedans           1538
                compact          1039
                station wagon      21
                other              15
                van                 1
audi_a3         sedans           2598
                station wagon     282
                compact           182
                other              28
                NaN                 7
opel_astra      station wagon    1211
                sedans           1053
                compact           185
                other              70
                NaN                 7
opel_corsa      compact          1230
                sedans            875
                other             110
                NaN                 2
                van                 2
opel_insignia   station wagon    1611
                sedans            900
                other              56
                compact            27
                NaN                 3
                van 

In [15]:
# check for error
df2 = fill(df,'body_type','make_model','mode')

In [16]:
df = df2.copy()
del df2

In [17]:
df['body_type'].groupby(df['make_model']).value_counts(dropna=False)

make_model      body_type    
audi_a1         sedans           1538
                compact          1039
                station wagon      21
                other              15
                van                 1
audi_a3         sedans           2605
                station wagon     282
                compact           182
                other              28
opel_astra      station wagon    1218
                sedans           1053
                compact           185
                other              70
opel_corsa      compact          1232
                sedans            875
                other             110
                van                 2
opel_insignia   station wagon    1614
                sedans            900
                other              56
                compact            27
                van                 1
renault_clio    sedans            933
                compact           484
                station wagon     337
                othe

### Impute km

* There are 1006 missing values
* I could've used first_registration, but it seems to have too many missing values
* Impute with median of the distance by a new feature which is groups by body_type (different kind of cars are used in different ways) and type (new, pre_registered, etc.)


In [18]:
df.km.isna().sum()

1006

In [19]:
df.first_registration[df.km.isna()].value_counts(dropna=False)

NaN       820
2019.0    147
2018.0     38
2017.0      1
Name: first_registration, dtype: int64

In [20]:
df.type[df.km.isna()].value_counts(dropna=False)

new               831
pre_registered    118
demonstration      33
used               15
employees_car       8
NaN                 1
Name: type, dtype: int64

In [21]:
df.body_type[df.km.isna()].value_counts(dropna=False)

sedans           456
compact          260
station wagon    228
van               42
other             20
Name: body_type, dtype: int64

In [22]:
df.first_registration[df.km.isna()].value_counts(dropna=False)

NaN       820
2019.0    147
2018.0     38
2017.0      1
Name: first_registration, dtype: int64

In [23]:
df['for_km'] = df.groupby(['body_type', 'type']).ngroup()

In [24]:
df2 = fill(df,'km','for_km','median')

In [25]:
df = df2.copy()
del df2

In [26]:
df.km.isna().sum()

0

In [27]:
df.columns

Index(['make_model', 'body_type', 'price', 'km', 'prev_owner', 'hp', 'type',
       'first_registration', 'body_color', 'paint_type', 'nr_doors',
       'nr_seats', 'gearing_type', 'displacement', 'cylinders', 'weight',
       'drive_chain', 'fuel', 'co2_emission', 'comfort_convenience',
       'entertainment_media', 'extras', 'safety_security', 'gears',
       'country_version', 'warranty_mo', 'vat_deductible',
       'upholstery_material', 'upholstery_color', 'emission_class',
       'consumption_comb', 'consumption_city', 'consumption_country',
       'registration_continuous', 'for_km'],
      dtype='object')

In [28]:
df = df.drop(['for_km'], axis=1)

In [29]:
df.columns

Index(['make_model', 'body_type', 'price', 'km', 'prev_owner', 'hp', 'type',
       'first_registration', 'body_color', 'paint_type', 'nr_doors',
       'nr_seats', 'gearing_type', 'displacement', 'cylinders', 'weight',
       'drive_chain', 'fuel', 'co2_emission', 'comfort_convenience',
       'entertainment_media', 'extras', 'safety_security', 'gears',
       'country_version', 'warranty_mo', 'vat_deductible',
       'upholstery_material', 'upholstery_color', 'emission_class',
       'consumption_comb', 'consumption_city', 'consumption_country',
       'registration_continuous'],
      dtype='object')

### Impute 'prev_owner'

* as noted in the 00_data_cleaning, missing values are for cases where there are no previous owners
* fillna with 0

In [30]:
df.prev_owner.value_counts(dropna=False)

1.0    8293
NaN    6794
2.0     778
3.0      17
4.0       2
Name: prev_owner, dtype: int64

In [31]:
 df['prev_owner'].fillna(0, inplace=True)

In [32]:
df.prev_owner.value_counts(dropna=False)

1.0    8293
0.0    6794
2.0     778
3.0      17
4.0       2
Name: prev_owner, dtype: int64

### Impute hp

* impute based on mode of groups by make_model and body_type

In [33]:
df.hp.value_counts(dropna=False)

85.0     2541
66.0     2122
81.0     1401
100.0    1308
110.0    1112
         ... 
84.0        1
195.0       1
44.0        1
239.0       1
9.0         1
Name: hp, Length: 81, dtype: int64

In [34]:
df['for_hp'] = df.groupby(['make_model','body_type']).ngroup()

In [35]:
df['for_hp'].unique()
# 27 groups

array([ 2,  3,  0,  1,  4,  6,  7,  5,  8, 12,  9, 11, 10, 15, 13, 14, 16,
       20, 19, 17, 18, 21, 22, 25, 24, 23, 26, 31, 30, 28, 29, 27],
      dtype=int64)

In [36]:
df2 = fill(df, 'hp','for_hp','mode')

In [37]:
df = df2.copy()

In [38]:
del df2

In [39]:
df.hp.isna().sum()

0

In [40]:
df = df.drop('for_hp', axis=1)

### Impute 'first_registration'

* when first_registration is missing, 'km' is very low in most cases
* I will assign 2019 to the first_registration to all missing cases of all types. I will right in all but 17 or so cases of type 'used' where mileage is higher than 500 km.



In [41]:
df['km'][df['first_registration'].isna()].describe()

count      1579.000000
mean        566.630146
std        6029.426399
min           0.000000
25%          10.000000
50%          10.000000
75%          10.000000
max      115137.000000
Name: km, dtype: float64

In [42]:
df[df['first_registration'].isna()].groupby('type')['km'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
demonstration,5.0,6501.4,4232.293681,3000.0,3000.0,4307.0,11000.0,11200.0
employees_car,3.0,8700.0,8155.979402,3500.0,4000.0,4500.0,11300.0,18100.0
new,1529.0,11.238064,20.99759,0.0,10.0,10.0,10.0,500.0
pre_registered,6.0,8.5,3.674235,1.0,10.0,10.0,10.0,10.0
used,35.0,20106.6,29616.877021,5.0,10.0,497.0,33124.0,89982.0


In [43]:
df['first_registration'] = df['first_registration'].fillna(2019)

In [44]:
df['first_registration'].isna().sum()

0

### impute 'body_color'

* missing body_color for disproportionately more 'new' (425/600 = 70% cases) and 'opel_insignia' + 'opel_astra' (60%)
* a little search reveals that the most popular color for cars in europe is 'black'
* 'black' is also the most popular color in this dataset
* but 'white' for many make_models if new, which seems odd. 
* I will impute the missing body_color with a corrective 'black'. 

In [45]:
df.body_color.isna().sum()

597

In [46]:
df[df.body_color.isna()]['type'].value_counts()

new               426
used              140
employees_car      15
pre_registered     11
demonstration       5
Name: type, dtype: int64

In [47]:
425/600

0.7083333333333334

In [48]:
df['type'].value_counts()

used              11080
new                1632
pre_registered     1364
employees_car      1010
demonstration       796
Name: type, dtype: int64

In [49]:
df[df.type == 'new']['body_color'].value_counts()

white     357
black     294
grey      242
blue      144
red        65
silver     63
yellow     16
green      16
brown       6
beige       2
orange      1
Name: body_color, dtype: int64

In [50]:
df[df.body_color.isna()]['make_model'].value_counts()

opel_insignia     218
opel_astra        139
audi_a1            73
audi_a3            73
renault_clio       46
opel_corsa         28
renault_espace     20
Name: make_model, dtype: int64

In [51]:
(218 + 139) / 600

0.595

In [52]:
df['make_model'].value_counts()

audi_a3           3097
audi_a1           2614
opel_insignia     2598
opel_astra        2526
opel_corsa        2219
renault_clio      1839
renault_espace     991
Name: make_model, dtype: int64

In [53]:
df.body_color.value_counts()

black     3737
grey      3504
white     3380
silver    1647
blue      1431
red        957
brown      289
green      154
beige      108
yellow      51
violet      18
bronze       6
orange       3
gold         2
Name: body_color, dtype: int64

In [54]:
df[df.type=='new'].groupby('make_model')['body_color'].value_counts(dropna=False)

make_model      body_color
audi_a1         white         118
                black          75
                NaN            36
                grey           33
                red            20
                blue           19
                yellow         15
                green          11
                silver          1
audi_a3         white          87
                grey           77
                black          75
                NaN            46
                red            14
                silver         14
                blue           13
opel_astra      NaN           108
                black          38
                white          33
                grey           31
                blue           19
                red            11
                silver         11
                green           3
                brown           1
opel_corsa      blue           44
                grey           36
                silver         22
                black

In [55]:
df['body_color'] = df['body_color'].fillna('black')

In [56]:
df[df.type=='new'].groupby('make_model')['body_color'].value_counts(dropna=False)

make_model      body_color
audi_a1         white         118
                black         111
                grey           33
                red            20
                blue           19
                yellow         15
                green          11
                silver          1
audi_a3         black         121
                white          87
                grey           77
                red            14
                silver         14
                blue           13
opel_astra      black         146
                white          33
                grey           31
                blue           19
                red            11
                silver         11
                green           3
                brown           1
opel_corsa      blue           44
                grey           36
                black          35
                silver         22
                white          10
                red             7
                yello

### Impute 'paint_type'

* among the unique type of paints the values are only metallic (98%), basic (2%), and perlescent (rare).
* white, black, and grey body colors makes about 80% of the body_colors when paint_type is missing
* metallic is clearly the most popular finish for these body_colors even by make_model
* imputing all missing paint_type with 'Metallic'

In [57]:
df.paint_type.isna().sum()

5755

In [58]:
df.paint_type.value_counts()/df.paint_type.count()

Metallic       0.966729
Uni/basic      0.032678
Perl effect    0.000592
Name: paint_type, dtype: float64

In [59]:
df[df.paint_type.isna()].body_color.value_counts()/df[df.paint_type.isna()].body_color.count()

white     0.413553
black     0.261512
grey      0.123892
red       0.070895
blue      0.070721
silver    0.033362
brown     0.009904
green     0.006429
beige     0.005560
yellow    0.002085
violet    0.000869
bronze    0.000521
orange    0.000348
gold      0.000348
Name: body_color, dtype: float64

In [60]:
df[(df.body_color=='white') | (df.body_color=='black') | (df.body_color=='grey')].groupby('make_model').paint_type\
.value_counts()

make_model      paint_type 
audi_a1         Metallic       1134
                Uni/basic        38
audi_a3         Metallic       1263
                Uni/basic        85
opel_astra      Metallic        965
                Uni/basic        21
                Perl effect       2
opel_corsa      Metallic        769
                Uni/basic        42
                Perl effect       1
opel_insignia   Metallic       1179
                Uni/basic        16
renault_clio    Metallic        544
                Uni/basic        38
renault_espace  Metallic        507
                Uni/basic        16
Name: paint_type, dtype: int64

In [61]:
df['paint_type'] = df['paint_type'].fillna('Metallic')

In [62]:
df.paint_type.value_counts(dropna=False)

Metallic       15547
Uni/basic        331
Perl effect        6
Name: paint_type, dtype: int64

### impute 'nr_doors'

* most cars by body_type have a mode of 5 doors
* I will replace any missing with 5 doors. 
* I will also reduce the nr_doors to two classes: 5 or more and 4 or less.

In [63]:
df.nr_doors.isna().sum()

212

In [64]:
df.nr_doors.value_counts(dropna=False)/df.nr_doors.count()

5.0    0.736345
4.0    0.196465
3.0    0.053088
2.0    0.013974
NaN    0.013527
1.0    0.000064
7.0    0.000064
Name: nr_doors, dtype: float64

In [65]:
df.groupby('body_type').nr_doors.value_counts(dropna=False)/df.groupby('body_type').nr_doors.count()

body_type      nr_doors
compact        5.0         0.600449
               4.0         0.211168
               3.0         0.137677
               2.0         0.050385
               NaN         0.012516
               1.0         0.000321
other          5.0         0.847059
               3.0         0.065882
               4.0         0.061176
               2.0         0.025882
               NaN         0.016471
sedans         5.0         0.766017
               4.0         0.179780
               3.0         0.047796
               NaN         0.015505
               2.0         0.006407
station wagon  5.0         0.743553
               4.0         0.255596
               NaN         0.009634
               3.0         0.000567
               2.0         0.000283
van            5.0         0.885965
               4.0         0.112782
               NaN         0.013784
               7.0         0.001253
Name: nr_doors, dtype: float64

In [66]:
df['nr_doors'] = df['nr_doors'].fillna(5)

In [67]:
df.loc[df.nr_doors<=4, 'nr_doors'] = 4
df.loc[df.nr_doors>=5, 'nr_doors'] = 5

In [68]:
df.nr_doors.value_counts(dropna=False)/df.nr_doors.count()

5.0    0.739927
4.0    0.260073
Name: nr_doors, dtype: float64

In [69]:
df['nr_doors'] = ['<=4' if  x == 4 else '>=5' for x in df.nr_doors]

In [70]:
df.nr_doors.value_counts(dropna=False)/df.nr_doors.count()

>=5    0.739927
<=4    0.260073
Name: nr_doors, dtype: float64

### impute 'nr_seats'

* 977 missing values
* makes sense to impute the number of seats by body_type
* for all body_type other than van we can impute missing values using 5 (mode)
* renault_espace is the only true 'van' in this dataset. The no. of seats are 5 or 7 in proportion 55:45
* imputing with randomly assigned 5 or 7 in proportion 55:45

In [71]:
df.nr_seats.value_counts(dropna=False)

5.0    13301
4.0     1125
NaN      977
7.0      362
2.0      116
6.0        2
3.0        1
Name: nr_seats, dtype: int64

In [72]:
df.groupby('body_type').nr_seats.value_counts(dropna=False)/df.groupby('body_type').nr_seats.count()

body_type      nr_seats
compact        5.0         0.849158
               4.0         0.146801
               NaN         0.062290
               2.0         0.002694
               7.0         0.000673
               3.0         0.000337
               6.0         0.000337
other          5.0         0.831658
               2.0         0.087940
               NaN         0.085427
               7.0         0.055276
               4.0         0.025126
sedans         5.0         0.900805
               4.0         0.089128
               NaN         0.063758
               2.0         0.009530
               7.0         0.000537
station wagon  5.0         0.989666
               NaN         0.051963
               7.0         0.005610
               4.0         0.004133
               2.0         0.000295
               6.0         0.000295
van            5.0         0.548433
               7.0         0.448718
               NaN         0.152422
               2.0         0.001425
    

In [73]:
df[df.body_type=='van'].nr_seats.value_counts(dropna=False)

5.0    385
7.0    315
NaN    107
2.0      1
4.0      1
Name: nr_seats, dtype: int64

In [74]:
df[df.body_type=='van'].groupby('make_model').nr_seats.value_counts(dropna=False)

make_model      nr_seats
audi_a1         NaN           1
opel_corsa      NaN           2
opel_insignia   5.0           1
renault_clio    2.0           1
                5.0           1
renault_espace  5.0         383
                7.0         315
                NaN         104
                4.0           1
Name: nr_seats, dtype: int64

In [75]:
383/(383+315)

0.5487106017191977

In [76]:
df.loc[df.body_type != 'van', 'nr_seats'] = df.loc[df.body_type != 'van', 'nr_seats'].fillna(5)

In [77]:
df.loc[(df.body_type=='van') & (df.nr_seats.isna()),'nr_seats'] = \
    [choices((5,7), weights=(55, 45))[0] for \
     x in df.loc[(df.body_type=='van') & (df.nr_seats.isna()),'nr_seats']]

In [78]:
df[df.body_type=='van'].nr_seats.value_counts(dropna=False)

5.0    440
7.0    367
2.0      1
4.0      1
Name: nr_seats, dtype: int64

In [79]:
df.nr_seats.isna().sum()

0

### impute 'gearing_type'

In [80]:
df.gearing_type.value_counts()

manual            8126
automatic         7289
semi-automatic     469
Name: gearing_type, dtype: int64

### impute 'displacement'

* missing in 495 rows
* displacement and cylinders are missing together in 448 cases
* the displacement medians by make_model show some trend with first_registration
* filling with medians based on groups based on 'make_model' and 'first_registration'.

In [81]:
df.displacement.isna().sum()

495

In [82]:
df[df.displacement.isna()].cylinders.value_counts(dropna=False)

NaN    448
4.0     42
3.0      2
2.0      1
5.0      1
8.0      1
Name: cylinders, dtype: int64

In [83]:
df[df.displacement.isna()].make_model.value_counts(dropna=False)

renault_clio      91
opel_astra        89
renault_espace    87
opel_insignia     75
opel_corsa        74
audi_a3           50
audi_a1           29
Name: make_model, dtype: int64

In [84]:
df.groupby(['make_model','first_registration']).displacement.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
make_model,first_registration,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
audi_a1,2016.0,622.0,1283.254019,244.469974,999.0,999.0,1395.0,1422.0,1798.0
audi_a1,2017.0,424.0,1261.502358,248.676259,929.0,999.0,1395.0,1422.0,1798.0
audi_a1,2018.0,735.0,1236.282993,252.003312,929.0,999.0,999.0,1422.0,1798.0
audi_a1,2019.0,804.0,1033.807214,155.925339,995.0,999.0,999.0,999.0,1984.0
audi_a3,2016.0,791.0,1635.101138,213.117932,999.0,1598.0,1598.0,1598.0,2480.0
audi_a3,2017.0,666.0,1541.614114,268.762007,999.0,1598.0,1598.0,1598.0,2480.0
audi_a3,2018.0,768.0,1539.876302,265.731612,999.0,1598.0,1598.0,1598.0,2480.0
audi_a3,2019.0,822.0,1338.698297,296.402131,997.0,999.0,1498.0,1598.0,2480.0
opel_astra,2016.0,498.0,1534.534137,158.972176,998.0,1598.0,1598.0,1598.0,1686.0
opel_astra,2017.0,580.0,1480.041379,194.797486,998.0,1399.0,1598.0,1598.0,1696.0


In [85]:
df['for_displacement'] = df.groupby(['make_model','first_registration']).displacement.ngroup()

In [86]:
df2 = fill(df, 'displacement', 'for_displacement', 'median')

In [87]:
df2 = df.copy()
del df2

In [88]:
df.displacement.isna().sum()

0

### impute 'cylinders'

In [89]:
df.cylinders.value_counts(dropna=False)

4.0    8072
NaN    5678
3.0    2104
5.0      22
6.0       3
8.0       2
2.0       2
1.0       1
Name: cylinders, dtype: int64

In [90]:
df.groupby(['make_model','first_registration']).cylinders.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
make_model,first_registration,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
audi_a1,2016.0,375.0,3.312,0.518368,3.0,3.0,3.0,4.0,8.0
audi_a1,2017.0,312.0,3.320513,0.467423,3.0,3.0,3.0,4.0,4.0
audi_a1,2018.0,534.0,3.280899,0.44986,3.0,3.0,3.0,4.0,4.0
audi_a1,2019.0,534.0,3.06367,0.244394,3.0,3.0,3.0,3.0,4.0
audi_a3,2016.0,548.0,4.0,0.085514,3.0,4.0,4.0,4.0,5.0
audi_a3,2017.0,521.0,3.871401,0.36264,3.0,4.0,4.0,4.0,5.0
audi_a3,2018.0,633.0,3.903633,0.353825,3.0,4.0,4.0,4.0,5.0
audi_a3,2019.0,592.0,3.621622,0.492315,3.0,3.0,4.0,4.0,5.0
opel_astra,2016.0,296.0,3.939189,0.239388,3.0,4.0,4.0,4.0,4.0
opel_astra,2017.0,383.0,3.911227,0.284788,3.0,4.0,4.0,4.0,4.0


In [91]:
df['for_cylinders'] = df.groupby(['make_model','first_registration']).cylinders.ngroup()

In [92]:
df2 = fill(df, 'cylinders', 'for_cylinders', 'median')

In [93]:
df = df2.copy()
del df2

In [94]:
df.cylinders.value_counts(dropna=False)

4.0    12673
3.0     3181
5.0       22
6.0        3
8.0        2
2.0        2
1.0        1
Name: cylinders, dtype: int64

In [95]:
df.cylinders = [3 if x <= 3 else 4 for x in df.cylinders]

In [96]:
df.cylinders = ['<=3 or less' if x == 3 else '>=4' for x in df.cylinders]

### impute 'weight'

* imputing weight using make_model and first_registration

In [97]:
df.weight.isna().sum()

6939

In [98]:
df['for_weight'] = df.groupby(['make_model', 'first_registration']).ngroup()

In [99]:
df2 = fill(df, 'weight', 'for_weight', 'median')

In [100]:
df = df2.copy()
del df2

In [101]:
df.weight.isna().sum()

0

### impute 'drive_chain'

* search shows that all the models described here are also available in 4WD
* first assign 'rear' drive_chain as missing.
* then impute using random assignment of front and 4WD based on proportions within class by make_model and first_registration

In [102]:
df.drive_chain.isna().sum()

6826

In [103]:
df.drive_chain.value_counts()

front    8885
4WD       169
rear        4
Name: drive_chain, dtype: int64

In [104]:
df.drive_chain = [float('nan') if x=='rear' else x for x in df.drive_chain]

In [105]:
df.groupby('make_model').drive_chain.value_counts(dropna=False)

make_model      drive_chain
audi_a1         front          1693
                NaN             918
                4WD               3
audi_a3         front          2008
                NaN            1070
                4WD              19
opel_astra      front          1346
                NaN            1178
                4WD               2
opel_corsa      front          1170
                NaN            1049
opel_insignia   front          1409
                NaN            1095
                4WD              94
renault_clio    NaN            1033
                front           795
                4WD              11
renault_espace  NaN             487
                front           464
                4WD              40
Name: drive_chain, dtype: int64

In [106]:
df['for_drivechain'] = df.groupby(['make_model', 'first_registration']).ngroup()

In [107]:
for g in df.for_drivechain.unique():
    vc = [v for v in df[df.for_drivechain==g].drive_chain.value_counts()]
    if len(vc) == 1:
        df.loc[(df.for_drivechain==g) & (df.drive_chain.isna()), 'drive_chain'] = \
            df.loc[(df.for_drivechain==g) & (df.drive_chain.isna()), 'drive_chain'].fillna('front')
    else:
        df.loc[(df.for_drivechain==g) & (df.drive_chain.isna()), 'drive_chain'] = \
            [choices(('front','4WD'), weights=vc)[0] for \
                 x in df.loc[(df.for_drivechain==g) & (df.drive_chain.isna()), 'drive_chain']]

In [108]:
df.groupby('make_model').drive_chain.value_counts(dropna=False)

make_model      drive_chain
audi_a1         front          2609
                4WD               5
audi_a3         front          3069
                4WD              28
opel_astra      front          2521
                4WD               5
opel_corsa      front          2219
opel_insignia   front          2463
                4WD             135
renault_clio    front          1808
                4WD              31
renault_espace  front           923
                4WD              68
Name: drive_chain, dtype: int64

### impute fuel

* although there are no missing values, there are too many classes in this feature
* some research allows to demystify the types of fules and classify them as gas, diesel, low_emission, and other.
* low_emission and other are very infrequent, so we can actually stick with only gas, diesel, and other.

In [109]:
df.fuel.value_counts()

Diesel (Particulate Filter)                                                                        4314
Super 95                                                                                           3338
Gasoline                                                                                           3141
Diesel                                                                                             2984
Super 95,Regular,Benzine 91                                                                         424
                                                                                                   ... 
Regular,Benzine 91,Super 95,Regular,Benzine E10 91                                                    1
Super Plus 98,Super E10 95                                                                            1
Regular,Benzine 91,Super 95,Regular,Benzine E10 91,Super E10 95,Super Plus 98,Super Plus E10 98       1
Regular,Benzine 91,Super Plus 98,Regular,Benzine E10 91,Super 95

In [110]:
fuels = set()
for entry in df.fuel:
    for fu in entry.split(','):
        fuels.add(fu)
print(fuels)

{'Electric', 'Gasoline', 'Liquid petroleum gas (LPG)', 'Super 95', 'LPG', 'Biodiesel', 'Benzine 91', 'Diesel (Particulate Filter)', 'Super Plus E10 98 (Particulate Filter)', 'Regular', 'Biogas', 'Benzine 91 (Particulate Filter)', 'Diesel', 'Domestic gas H', 'Super Plus 98', 'Benzine E10 91 (Particulate Filter)', 'Super E10 95', 'Gasoline (Particulate Filter)', 'Benzine E10 91', 'CNG (Particulate Filter)', 'CNG', 'Super E10 95 (Particulate Filter)', 'Super 95 (Particulate Filter)', 'Others', 'Super Plus 98 (Particulate Filter)', 'Others (Particulate Filter)', 'Super Plus E10 98'}


In [111]:
fuels_dict = {
    'CNG (Particulate Filter)': 'other', 
    'LPG':'other', 
    'Domestic gas H':'other', 
    'Super 95 (Particulate Filter)': 'gas', 
    'CNG':'other', 
    'Benzine 91': 'gas', 
    'Gasoline': 'gas', 
    'Super 95': 'gas', 
    'Diesel (Particulate Filter)': 'diesel', 
    'Super Plus E10 98 (Particulate Filter)': 'gas', 
    'Benzine E10 91 (Particulate Filter)': 'gas', 
    'Others':'other', 
    'Electric':'other', 
    'Liquid petroleum gas (LPG)':'other', 
    'Gasoline (Particulate Filter)': 'gas', 
    'Super E10 95': 'gas', 
    'Others (Particulate Filter)': 'other', 
    'Diesel': 'diesel', 
    'Biogas':'other', 
    'Super Plus 98 (Particulate Filter)': 'gas', 
    'Super E10 95 (Particulate Filter)': 'gas', 
    'Benzine 91 (Particulate Filter)': 'gas', 
    'Benzine E10 91': 'gas', 
    'Regular': 'gas', 
    'Biodiesel':'other', 
    'Super Plus 98': 'gas', 
    'Super Plus E10 98': 'gas'
}

In [112]:
df['fuel'] = [','.join(set([fuels_dict[x] for x in y.split(',')])) for y in df.fuel]

In [113]:
df.fuel.value_counts()

gas          8510
diesel       7298
other          71
other,gas       5
Name: fuel, dtype: int64

In [114]:
df['fuel'] = ['other' if re.search(',',x) else x for x in df.fuel]

In [115]:
df.fuel.value_counts(dropna=False)

gas       8510
diesel    7298
other       76
Name: fuel, dtype: int64

### impute 'co2_emission'

* 

In [116]:
df.co2_emission.isna().sum()

2434

In [117]:
df[df.co2_emission.isna()].fuel.value_counts()

diesel    1331
gas       1084
other       19
Name: fuel, dtype: int64

In [118]:
df.groupby(['make_model','first_registration']).co2_emission.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
make_model,first_registration,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
audi_a1,2016.0,571.0,101.357268,11.726048,90.0,97.0,99.0,104.0,331.0
audi_a1,2017.0,368.0,103.182065,8.403775,90.0,97.0,99.5,104.25,136.0
audi_a1,2018.0,679.0,102.686303,5.930793,89.0,98.0,102.0,104.0,134.0
audi_a1,2019.0,686.0,109.297376,5.653671,92.0,108.0,110.0,111.0,142.0
audi_a3,2016.0,679.0,106.02651,10.044703,85.0,99.0,101.0,114.0,189.0
audi_a3,2017.0,601.0,106.838602,8.835472,88.0,103.0,106.0,109.0,189.0
audi_a3,2018.0,716.0,110.736034,50.036086,36.0,103.75,106.0,108.0,1060.0
audi_a3,2019.0,752.0,113.513298,8.334059,95.0,106.75,114.0,117.0,194.0
opel_astra,2016.0,410.0,138.826829,591.75194,87.0,97.0,104.0,119.0,12087.0
opel_astra,2017.0,508.0,107.001969,13.476621,88.0,95.0,102.0,119.0,150.0


In [119]:
df['for_co2'] = df.groupby(['make_model', 'first_registration']).co2_emission.ngroup()

In [120]:
df2 = fill(df, 'co2_emission','for_co2', 'mode')

In [121]:
df = df2.copy()
del df2

In [122]:
df = df.drop('for_co2', axis=1)

In [123]:
df.co2_emission.isna().sum()

0

In [124]:
df.groupby(['make_model','first_registration']).co2_emission.describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
make_model,first_registration,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
audi_a1,2016.0,629.0,100.955485,11.242461,90.0,97.0,98.0,104.0,331.0
audi_a1,2017.0,432.0,102.266204,8.060443,90.0,97.0,98.0,104.0,136.0
audi_a1,2018.0,744.0,102.626344,5.668753,89.0,99.0,102.0,104.0,134.0
audi_a1,2019.0,809.0,109.100124,5.226419,92.0,108.0,108.0,111.0,142.0
audi_a3,2016.0,818.0,104.832518,9.523793,85.0,99.0,99.0,109.75,189.0
audi_a3,2017.0,675.0,106.856296,8.336493,88.0,103.0,107.0,108.0,189.0
audi_a3,2018.0,777.0,110.364221,48.046127,36.0,104.0,106.0,108.0,1060.0
audi_a3,2019.0,827.0,113.829504,8.009601,95.0,108.0,117.0,117.0,194.0
opel_astra,2016.0,519.0,134.662813,525.881364,87.0,97.0,119.0,119.0,12087.0
opel_astra,2017.0,587.0,105.386712,13.188656,88.0,95.0,97.0,119.0,150.0


### impute

## Summary

<a id="3"></a>

## End of Data Imputation

<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:blue; background-color:#dfa8e4" data-toggle="popover">Content</a>

## Next: [Handling Outliers](02_handling_outliers.ipynb)