# Missing Values

👇 Import the dataset `cars.csv` located in the folder `data`

In [4]:
import pandas as pd
cars = pd.read_csv('../data/cars.csv')
cars.head()

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,enginetype,cylindernumber,stroke,peakrpm,price
0,std,front,64.1,2548,dohc,four,2.68,5000,13495.0
1,std,front,64.1,2548,dohc,four,2.68,5000,16500.0
2,std,front,65.5,2823,ohcv,six,3.47,5000,16500.0
3,std,front,,2337,ohc,four,3.4,5500,13950.0
4,std,front,66.4,2824,ohc,five,3.4,5500,17450.0


Each row of the dataset represents a car that is described by a mix of ordinal, categorical, and continuous features. The target is the price of the car.

👇 Check presence of missing values in the dataset

In [6]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   aspiration      205 non-null    object 
 1   enginelocation  195 non-null    object 
 2   carwidth        203 non-null    object 
 3   curbweight      205 non-null    int64  
 4   enginetype      205 non-null    object 
 5   cylindernumber  205 non-null    object 
 6   stroke          205 non-null    float64
 7   peakrpm         205 non-null    int64  
 8   price           205 non-null    float64
dtypes: float64(2), int64(2), object(5)
memory usage: 14.5+ KB


## 1. `enginelocation`

👇 Check the unique categories of the column `enginelocation`

In [7]:
cars['enginelocation'].unique()

array(['front', nan, 'rear'], dtype=object)

👇 Check the count for each unique category

In [8]:
cars['enginelocation'].value_counts()

front    192
rear       3
Name: enginelocation, dtype: int64

👇 Using `SimpleImputer`, fill the missing values by the column's most frequent.

[`SimpleImputer` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

In [12]:
from sklearn.impute import SimpleImputer

# Instanciate an object of the class SimpleImputer
# Specify the desired parameters
imputer = SimpleImputer(strategy="most_frequent")

# Call the method "fit" on the object
imputer.fit(cars[['enginelocation']])

#Call the method "transform" on the object
cars['enginelocation'] = imputer.transform(cars[['enginelocation']])

# The mean is stored in the transformer's memory
imputer.statistics_

# cars.info()

array(['front'], dtype=object)

## 2. `carwidth`

👇 Check for different representations of missing values in the column `carwidth` and replace by `np.nan`

In [15]:
cars['carwidth'].unique()

array(['64.1', '65.5', nan, '66.4', '66.3', '71.4', '67.9', '64.8',
       '66.9', '70.9', '60.3', '*', '63.6', '63.8', '64.6', '63.9', '64',
       '65.2', '66', '61.8', '69.6', '70.6', '64.2', '65.7', '66.5',
       '66.1', '70.3', '71.7', '70.5', '72', '68', '64.4', '65.4', '68.4',
       '68.3', '65', '72.3', '66.6', '63.4', '65.6', '67.7', '67.2',
       '68.9', '68.8'], dtype=object)

In [26]:
cars['carwidth'] = cars['carwidth'].replace('*', np.nan)
cars['carwidth'].unique()

array(['64.1', '65.5', nan, '66.4', '66.3', '71.4', '67.9', '64.8',
       '66.9', '70.9', '60.3', '63.6', '63.8', '64.6', '63.9', '64',
       '65.2', '66', '61.8', '69.6', '70.6', '64.2', '65.7', '66.5',
       '66.1', '70.3', '71.7', '70.5', '72', '68', '64.4', '65.4', '68.4',
       '68.3', '65', '72.3', '66.6', '63.4', '65.6', '67.7', '67.2',
       '68.9', '68.8'], dtype=object)

👇 Use the transformer `KNNImputer` to fill in the missing values.

[`KNNImputer` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html)

In [36]:
from sklearn.impute import KNNImputer

#Instantiate the imputer with choice of k
knnImputer = KNNImputer()

#Call the method fit_transform
cars['carwidth'] = knnImputer.fit_transform(cars[['carwidth']])

👇 Export the dataframe as a CSV file under the name `cars_clean.csv` in the folder `data`. The following exercices will be building onto the work your started here.

In [40]:
cars_clean = cars.to_csv(r'../data/cars_clean.csv', index = False)
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   aspiration      205 non-null    object 
 1   enginelocation  205 non-null    object 
 2   carwidth        205 non-null    float64
 3   curbweight      205 non-null    int64  
 4   enginetype      205 non-null    object 
 5   cylindernumber  205 non-null    object 
 6   stroke          205 non-null    float64
 7   peakrpm         205 non-null    int64  
 8   price           205 non-null    float64
dtypes: float64(3), int64(2), object(4)
memory usage: 14.5+ KB


⚠️ Please push the exercice when completed. Thanks 🙃

🏁