# Car Prices

🎯 This exercise consists of the data preparation and feature selection techniques you have learnt today to a new dataset.

👇 Download the `ML_Cars_dataset.csv` [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Cars_dataset.csv) and place it in the `data` folder.  Load into this notebook as a pandas dataframe named `df`, and display its first 5 rows.

In [3]:
!curl https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Cars_dataset.csv > data/ML_cars_dataset.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9671  100  9671    0     0  45191      0 --:--:-- --:--:-- --:--:-- 45191


In [12]:
import pandas as pd

df = pd.read_csv('data/ML_cars_dataset.csv')
df.head()

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,enginetype,cylindernumber,stroke,peakrpm,price
0,std,front,64.1,2548,dohc,four,2.68,5000,expensive
1,std,front,64.1,2548,dohc,four,2.68,5000,expensive
2,std,front,65.5,2823,ohcv,six,3.47,5000,expensive
3,std,front,,2337,ohc,four,3.4,5500,expensive
4,std,front,66.4,2824,ohc,five,3.4,5500,expensive


In [13]:
df.shape

(205, 9)

ℹ️ The description of the dataset is available [here](https://wagon-public-datasets.s3.amazonaws.com/Machine%20Learning%20Datasets/ML_Cars_dataset_description.txt). Make sure to use refer to it through the exercise.

# Duplicates

👇 Remove the duplicates from the dataset if there are any. Overwite the dataframe `df`.

In [14]:
df.duplicated().sum()

14

In [15]:
# YOUR CODE HERE
df = df.drop_duplicates()
df.shape

(191, 9)

# Missing values

👇 Locate missing values, investigate them, and apply the solutions below accordingly:

- Impute with most frequent
- Impute with median

Make changes effective in the dataset `df`.

In [17]:
# YOUR CODE HERE
df.isnull().sum().sort_values(ascending=False)/len(data)

enginelocation    0.048780
carwidth          0.009756
aspiration        0.000000
curbweight        0.000000
enginetype        0.000000
cylindernumber    0.000000
stroke            0.000000
peakrpm           0.000000
price             0.000000
dtype: float64

In [19]:
from sklearn.impute import SimpleImputer

## `carwidth`

<details>
    <summary> 💡 Hint </summary>
    <br>
    ℹ️ <code>carwidth</code> has multiple representations of missing values. Some are <code>np.nans</code>, some are  <code>*</code>. Once located, they can be imputed by the median value, since there is less than 30% of missing values.
</details> 

In [23]:
import numpy as np

In [24]:
df.carwidth.unique()

array(['64.1', '65.5', nan, '66.4', '66.3', '71.4', '67.9', '64.8',
       '66.9', '70.9', '60.3', '*', '63.6', '63.8', '64.6', '63.9', '64',
       '65.2', '66', '61.8', '69.6', '70.6', '64.2', '65.7', '66.5',
       '66.1', '70.3', '71.7', '70.5', '72', '68', '64.4', '65.4', '68.4',
       '68.3', '65', '72.3', '66.6', '63.4', '65.6', '67.7', '67.2',
       '68.9', '68.8'], dtype=object)

In [25]:
df.carwidth.replace('*', np.nan, inplace=True)

In [26]:
# YOUR CODE HERE
imputer = SimpleImputer(strategy='median')
imputer.fit(df[['carwidth']])
df['carwidth'] = imputer.transform(df[['carwidth']])

In [28]:
df.carwidth.isnull().sum(), df.carwidth.unique()

(0,
 array([64.1, 65.5, 66.4, 66.3, 71.4, 67.9, 64.8, 66.9, 70.9, 60.3, 63.6,
        63.8, 64.6, 63.9, 64. , 65.2, 66. , 61.8, 69.6, 70.6, 64.2, 65.7,
        66.5, 66.1, 70.3, 71.7, 70.5, 72. , 68. , 64.4, 65.4, 68.4, 68.3,
        65. , 72.3, 66.6, 63.4, 65.6, 67.7, 67.2, 68.9, 68.8]))

## `enginelocation`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ Considering that <code>enginelocation</code> is a categorical feature, and that the vast majority of the category is front, impute with the most frequent.
</details>

In [29]:
# YOUR CODE HERE
imputerC = SimpleImputer(strategy='most_frequent')
imputerC.fit(df[['enginelocation']])
df['enginelocation'] = imputerC.transform(df[['enginelocation']])

In [30]:
df.enginelocation.isnull().sum(), df.enginelocation.unique()

(0, array(['front', 'rear'], dtype=object))

### ☑️ Test your code

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('missing_values',
                         dataset = df)
result.write()
print(result.check())

# Scaling

👇 Investigate the numerical features for outliers and distribution, and apply the solutions below accordingly:
- Robust Scale
- Standard Scale

Replace the original columns by the transformed values.

## `peakrpm` , `carwidth` , & `stroke`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>peakrpm</code>, <code>carwidth</code>, & <code>stroke</code> have normal distributions and outliers. They must be Robust Scaled.
</details>

In [0]:
# YOUR CODE HERE

In [0]:
# YOUR CODE HERE

## `curbweight`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>curbweight</code> has a normal distribution and no outliers. It can be Standard Scaled.
</details>

In [0]:
# YOUR CODE HERE

In [0]:
# YOUR CODE HERE

### ☑️ Test your code

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('scaling',
                         dataset = df
)

result.write()
print(result.check())

# Encoding

👇 Investigate the features that require encoding, and apply the following techniques accordingly:

- One hot encoding
- Manual ordinal encoding

In the dataframe, replace the original features by their encoded version(s).

## `aspiration` & `enginelocation`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>aspiration</code> and <code>enginelocation</code> are binary categorical features.
</details>

In [0]:
# YOUR CODE HERE

## `enginetype`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>enginetype</code> is a multicategorical feature and must be One hot encoded.
</details>

In [0]:
# YOUR CODE HERE

## `cylindernumber`

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>cylindernumber</code> is an ordinal feature and must be manually encoded.
</details>

In [0]:
# YOUR CODE HERE

## `price`

👇 Encode the target `price`.

<details>
    <summary>💡 Hint </summary>
    <br>
    ℹ️ <code>price</code> is the target and must be Label encoded.
</details>

In [0]:
# YOUR CODE HERE

### ☑️ Test your code

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('encoding',
                         dataset = df)
result.write()
print(result.check())

# Collinearity

👇 Perform a collinearity investigation on the dataset and remove unecessary features. Make changes effective in the dataframe `df`.

In [0]:
# YOUR CODE HERE

ℹ️ Out of the highly correlated feature pairs, remove the one with less granularity.

### ☑️ Test your code

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('collinearity',
                         dataset = df)
result.write()
print(result.check())

# Base Modelling

👇 Cross validate a Logistic regression model. Save its score under variable name `base_model_score`.

In [0]:
# YOUR CODE HERE

### ☑️ Test your code

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('base_model',
                         score = base_model_score
)

result.write()
print(result.check())

# Feature Selection

👇 Perform feature permutation to remove the weak features from the feature set. With that strong feature set, cross-validate a new model, and save its score under variable name `strong_model_score`.

In [0]:
# YOUR CODE HERE

### ☑️ Test your code

In [0]:
from nbresult import ChallengeResult

result = ChallengeResult('strong_model',
                         score = strong_model_score
)

result.write()
print(result.check())

# 🏁