# Module 7: Exercise B

In this exercise, you will practice tree-based methods for regression.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.metrics import classification_report, mean_squared_error, r2_score

## Data Preprocessing

We will apply the methods on a car fuel data set.

In [2]:
mpg = pd.read_csv('car_mpg.csv')
mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


__horsepower__ is listed as object instead of float or int. Usually this is caused by the column containing non-numeric values. Let's find out why.

>__Task 1__
>
>Convert the __horsepower__ column to numeric
>
>- Find out __horsepower__ with unusual values `?`
>- Identify non-numeric values, using `.to_numeric()` with `errors='coerce` argument
>- Drop the rows where values are not convertible (they are missing values after converted to numerical values)

In [None]:
...

Now, let's check the outliers by first looking at the quantiles of columns.

In [5]:
mpg.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin
count,392.0,392.0,392.0,392.0,392.0,392.0,392.0,392.0
mean,23.445918,5.471939,194.41199,104.469388,2977.584184,15.541327,75.979592,1.576531
std,7.805007,1.705783,104.644004,38.49116,849.40256,2.758864,3.683737,0.805518
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0,1.0
25%,17.0,4.0,105.0,75.0,2225.25,13.775,73.0,1.0
50%,22.75,4.0,151.0,93.5,2803.5,15.5,76.0,1.0
75%,29.0,8.0,275.75,126.0,3614.75,17.025,79.0,2.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0,3.0


>__Task 2__
>
>Use "mean+-2\*sd" rule to check outliers in weight and display the rows with outliers
>
>How do you suggest to handle the outliers at this point?

In [None]:
...

>__Task 3__
>
>Plot pairwise relationships of all features

In [None]:
...

### Train/Test Split

>__Task 4__
>
>- Assign the __mpg__ to `y` and the rest (except __car name__) to `X`
>- Split with a 80(train):20(test) ratio and set 156 randomness
>- Make sure your function returns `X_train`, `X_test`, `y_train`, `y_test`

In [None]:
...

## Decision Tree Regressor

>__Task 5__
>
>- Initiate a decision-tree regressor, set tree depth to 2 and 156 randomness
>- Print a text report showing the rules of tree
>- Plot the tree

In [None]:
...

### Feature Importance

>__Task 6__
>
>- Map feature names to their importance scores
>- Print and plot feature importance

In [None]:
...

### Performance Evaluation

>__Task 7__
>
>- Predict the first 10 prices on test set
>- Calculate the mean squared error (MSE) and R-squared of the model

In [None]:
...

### The Best Performing Depth

>__Task 8__
>
>Find the depth that maximizes accuracy
>
>- Fill in a for loop that iterates over the `k` argument
>- Initiate the model with `max_depth=k`
>- Fit the model on train set
>- Predict on both train and test sets
>- Calculate MSE by comparing the predictions with `y_train` and `y_test` respectively
>- Plot the results
>- Print the best performing depth and its MSE

In [None]:
...

---

## Ensemble Methods

### Bagging

>__Task 9__
>
>Build a bagging model for regression task
>
>- Set 100 base estimators and 156 randomness
>- Calculate MSE and R-squared on test set

In [None]:
...

>__Task 10__
>
>Fit different bagging models by changing `n_estimators` parameter in a loop
>
>- Set a range between 50 and 210 with a step of 10
>- Initiate a bagging model with `n_estimators=n`and 156 randomness
>- Fit the model to train pairs
>- Predict values on test sets
>- Add train MSE to `mse['train_mse']` and test MSE to `mse['test_mse']`
>- Print the number of estimators and its MSE

In [None]:
...

### Random Forest

>__Task 11__
>
>Build a random forest model for regression task
>
>- Set 100 base estimators, 30 features for split, 156 randomness
>- Calculate MSE and R-squared on test set

In [None]:
...

>__Task 12__
>
>Check feature importance of the random forest model and plot the results

In [None]:
...

>__Task 13__
>
>Tune hyperparameter for the random forest model
>
>- Define the regressor with 156 randomness
>- Define the parameter grid with:
>     - `max_depth` range `(5,30,5)` 
>     - `n_estimators` range `(50,210,50)`
>- Define the grid search with the parameter grid and set:
>     - `neg_mean_absolute_error` as the evaluation score
>     - `n_jobs=-1`
>     - 5-fold cross-validation
>     - `verbose=1`
>     - `return_train_score=True`
>- Fit the grid search to train set
>- Print the best resulting parameters

In [None]:
...

>__Task 14__
>
>Predict on test set using `.best_estimator_` and print MSE and R-squared of the tuned model

In [None]:
...

### Gradient Boosting

>__Task 15__
>
>Build a gradient boosting model for regression task
>
>- Set 0.01 learning rate, 30 base estimators, 5 features, 5 depth, and 156 randomness  
>- Calculate MSE and R-squared on test set

In [None]:
...

>__Task 16__
>
>Find the best performing learning rate
>
>- Define the classifier with 156 randomness
>- Define the parameter grid with:
>     - `max_depth` range `(5,30,5)` 
>     - `n_estimators` range `(50,210,50)`
>     - `learning_rate` range `(0.01,0.31,0.1)`
>- Define the grid search with the parameter grid and set:
>     - `neg_mean_absolute_error` as the evaluation score
>     - `n_jobs=-1`
>     - 5-fold cross-validation
>     - `verbose=1`
>     - `return_train_score=True`
>- Fit the grid search to train set
>- Print the best resulting parameters

In [None]:
...

>__Task 17__
>
>Predict on test set using `.best_estimator_` and print MSE and R-squared of the tuned model

In [None]:
...