# Regression

👇 Import the dataset `cars_clean.csv` that you exported in the previous exercice.

In [13]:
import pandas as pd

car_clean = pd.read_csv("../data/cars_clean.csv")
car_clean.head(20)

Unnamed: 0,aspiration,enginelocation,carwidth,curbweight,enginetype,cylindernumber,stroke,peakrpm,price,aspiration_encoded,enginelocation_encoded,dohc_encoded,dohcv_encoded,l_encoded,ohc_encoded,ohcf_encoded,ohcv_encoded,rotor_encoded
0,std,front,-0.608696,-0.014566,dohc,0.2,-2.033333,-0.285714,13495.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
1,std,front,-0.608696,-0.014566,dohc,0.2,-2.033333,-0.285714,16500.0,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,std,front,0.0,0.514882,ohcv,0.4,0.6,-0.285714,16500.0,0,0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,std,front,0.168669,-0.420797,ohc,0.2,0.366667,0.428571,13950.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,std,front,0.391304,0.516807,ohc,0.3,0.366667,0.428571,17450.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
5,std,front,0.347826,-0.093502,ohc,0.3,0.366667,0.428571,15250.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
6,std,front,2.565217,0.555313,ohc,0.3,0.366667,0.428571,17710.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
7,std,front,0.168669,0.767092,ohc,0.3,0.366667,0.428571,18920.0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
8,turbo,front,2.565217,1.021227,ohc,0.3,0.366667,0.428571,23875.0,1,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
9,turbo,front,1.043478,0.957693,ohc,0.3,0.366667,0.428571,17859.167,1,0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


## Modelling

👇 Train and score a Linear Regression model for which the target is the `price` of the car.

In [4]:
car_clean.columns

Index(['aspiration', 'enginelocation', 'carwidth', 'curbweight', 'enginetype',
       'cylindernumber', 'stroke', 'peakrpm', 'price', 'aspiration_encoded',
       'enginelocation_encoded', 'dohc_encoded', 'dohcv_encoded', 'l_encoded',
       'ohc_encoded', 'ohcf_encoded', 'ohcv_encoded', 'rotor_encoded'],
      dtype='object')

In [15]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

from sklearn.model_selection import train_test_split

features = ['carwidth', 'curbweight', 'cylindernumber', \
       'stroke', 'peakrpm', 'aspiration_encoded',
       'enginelocation_encoded', 'dohc_encoded', 'dohcv_encoded', 'l_encoded',
       'ohc_encoded', 'ohcf_encoded', 'ohcv_encoded', 'rotor_encoded']

target_name = 'price'

# Define X and y
X = car_clean[features]
y = car_clean[target_name]
#split train_test
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
#fit model
model.fit(X_train, y_train)
# score mode
model.score(X_test, y_test)

0.8089074339650647

## Feature Selection

❓Which are the 5 most important features to predict the price of the cars? Which is the least useful feature? Use Feature Permutation to answer the questions.

[Sklearn's `permutation_importance` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html)

In [25]:
from sklearn.inspection import permutation_importance

permutation_score = permutation_importance(model, X_train, y_train, n_repeats=100)

result = np.vstack((X.columns, permutation_score.importances_mean)).T
result

array([['carwidth', 0.045207875587863953],
       ['curbweight', 0.5174211694385044],
       ['cylindernumber', 0.6254916745771758],
       ['stroke', 0.016755831887580275],
       ['peakrpm', 0.008530977443255212],
       ['aspiration_encoded', 0.004150549011716449],
       ['enginelocation_encoded', 0.07960064140947926],
       ['dohc_encoded', 0.00017285847217338502],
       ['dohcv_encoded', 0.015428279656573169],
       ['l_encoded', 0.0013195379902332593],
       ['ohc_encoded', 0.030362822418476165],
       ['ohcf_encoded', 1.2543434345817505e-05],
       ['ohcv_encoded', 0.07118262964556511],
       ['rotor_encoded', 0.07439859259553547]], dtype=object)

In [37]:
result_df = pd.DataFrame(result, columns = ['feature', 'score'])
result_df.sort_values(by='score', ascending = False)

Unnamed: 0,feature,score
2,cylindernumber,0.625492
1,curbweight,0.517421
6,enginelocation_encoded,0.0796006
13,rotor_encoded,0.0743986
12,ohcv_encoded,0.0711826
0,carwidth,0.0452079
10,ohc_encoded,0.0303628
3,stroke,0.0167558
8,dohcv_encoded,0.0154283
4,peakrpm,0.00853098


## Refined Modelling

👇 Train a new model only using the most useful feature. It is up to you to chose a tradeoff between model performance and complexity.

In [38]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()

from sklearn.model_selection import train_test_split

features = ['curbweight', 'cylindernumber', 
       'enginelocation_encoded',  
       'ohc_encoded',  'ohcv_encoded']

target_name = 'price'

# Define X and y
X = car_clean[features]
y = car_clean[target_name]
#split train_test
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)
#fit model
model.fit(X_train, y_train)
# score mode
model.score(X_test, y_test)

NameError: name 'cars_clean' is not defined

⚠️ Please push the exercice when completed. Thanks 🙃

🏁