# Classification

👇 Import the dataset `cars_clean.csv`.

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
cars_clean = pd.read_csv('../data/cars_clean.csv')

Instead of modelling the exact price of the car, you are going to turn the dataset into a classification task.

## Binning

👇 Transform the continous target `price` into two categories:
- **Cheap** for prices lower than the median value
- **Expensive** for prices equal or higher than the median value

In [4]:
cars_clean['price'].max()

45400.0

In [14]:
cars_clean['price_cat'] = pd.cut(cars_clean['price'],\
                            bins = [0, cars_clean['price'].median(), cars_clean['price'].max()],\
                            labels=['cheap', 'expensive']) 
cars_clean['price_cat'].value_counts()

cheap        103
expensive    102
Name: price_cat, dtype: int64

## Encoding

👇 Encode the newly created binary target. 

In [16]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
cars_clean['price_cat'] = le.fit_transform(cars_clean['price_cat'])
cars_clean

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,enginetype,cylindernumber,stroke,peakrpm,price,aspiration_encoded,enginelocation_encoded,dohc_encoded,dohcv_encoded,l_encoded,ohc_encoded,ohcf_encoded,ohcv_encoded,rotor_encoded,price_cat
0,std,front,-0.608696,-0.014566,dohc,0.2,-2.033333,-0.285714,13495.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,std,front,-0.608696,-0.014566,dohc,0.2,-2.033333,-0.285714,16500.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,std,front,0.000000,0.514882,ohcv,0.4,0.600000,-0.285714,16500.0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1
3,std,front,0.168669,-0.420797,ohc,0.2,0.366667,0.428571,13950.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1
4,std,front,0.391304,0.516807,ohc,0.3,0.366667,0.428571,17450.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,std,front,1.478261,0.763241,ohc,0.2,-0.466667,0.285714,16845.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1
201,turbo,front,1.434783,0.949992,ohc,0.2,-0.466667,0.142857,19045.0,1,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1
202,std,front,1.478261,0.878757,ohcv,0.4,-1.400000,0.428571,21485.0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1
203,turbo,front,1.478261,1.273437,ohc,0.4,0.366667,-0.571429,22470.0,1,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1


## Modelling

👇 Train and score two classification models: Logistic Regression and KNN.

In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
log = LogisticRegression()
features = ['curbweight', 'cylindernumber', 'enginelocation_encoded', 'ohc_encoded',  'ohcv_encoded']
# features = ['carwidth', 'cylindernumber', 'enginelocation_encoded']
target_name = 'price_cat'

X = cars_clean[features]

y = cars_clean[target_name]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# #fit model
log.fit(X_train, y_train)
# # score mode
log.score(X_test, y_test)


0.9193548387096774

In [51]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
knn = KNeighborsClassifier()
features = ['curbweight', 'cylindernumber', 'enginelocation_encoded', 'ohc_encoded',  'ohcv_encoded']
target_name = 'price_cat'

X = cars_clean[features]
y = cars_clean[target_name]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# #fit model
knn.fit(X_train, y_train)
# # score mode
knn.score(X_test, y_test)

0.8709677419354839

❓ Which of the two models performs best?

⚠️ Please push the exercice when completed. Thanks 🙃

🏁