A.S. Lundervold, v. 290822

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Setup" data-toc-modified-id="Setup-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setup</a></span></li><li><span><a href="#Understand-the-problem-and-look-at-the-big-picture" data-toc-modified-id="Understand-the-problem-and-look-at-the-big-picture-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Understand the problem and look at the big picture</a></span><ul class="toc-item"><li><span><a href="#Frame-the-problem" data-toc-modified-id="Frame-the-problem-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Frame the problem</a></span></li><li><span><a href="#Select-performance-measures" data-toc-modified-id="Select-performance-measures-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Select performance measures</a></span></li></ul></li><li><span><a href="#Get-the-data" data-toc-modified-id="Get-the-data-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Get the data</a></span></li><li><span><a href="#Explore-the-data" data-toc-modified-id="Explore-the-data-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Explore the data</a></span><ul class="toc-item"><li><span><a href="#Feature-distributions" data-toc-modified-id="Feature-distributions-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Feature distributions</a></span></li><li><span><a href="#Converting-the-features'-data-types" data-toc-modified-id="Converting-the-features'-data-types-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Converting the features' data types</a></span></li><li><span><a href="#Feature-encoding" data-toc-modified-id="Feature-encoding-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Feature encoding</a></span></li><li><span><a href="#Setting-up-our-$f:-X-\to-y$" data-toc-modified-id="Setting-up-our-$f:-X-\to-y$-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Setting up our $f: X \to y$</a></span></li></ul></li><li><span><a href="#Create-training-and-test-sets" data-toc-modified-id="Create-training-and-test-sets-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Create training and test sets</a></span></li><li><span><a href="#Data-preprocessing:-Data-cleaning,-feature-scaling-and-imputing-missing-data" data-toc-modified-id="Data-preprocessing:-Data-cleaning,-feature-scaling-and-imputing-missing-data-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Data preprocessing: Data cleaning, feature scaling and imputing missing data</a></span></li><li><span><a href="#Training-a-regression-model" data-toc-modified-id="Training-a-regression-model-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Training a regression model</a></span></li><li><span><a href="#Evaluating-models-/-performance-measures" data-toc-modified-id="Evaluating-models-/-performance-measures-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Evaluating models / performance measures</a></span><ul class="toc-item"><li><span><a href="#Mean-absolute-error" data-toc-modified-id="Mean-absolute-error-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Mean absolute error</a></span></li><li><span><a href="#Mean-squared-error-and-root-mean-squared-error" data-toc-modified-id="Mean-squared-error-and-root-mean-squared-error-9.2"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>Mean squared error and root mean squared error</a></span></li><li><span><a href="#In-scikit-learn" data-toc-modified-id="In-scikit-learn-9.3"><span class="toc-item-num">9.3&nbsp;&nbsp;</span>In scikit learn</a></span></li></ul></li></ul></div>

# Introduction

This notebook reviews some core concepts related to **regression** in machine learning based on concrete examples. 

We'll use two data sets for this, of increasing complexity: _vehicles_ and _housing prices_. This notebook goes through the _vehicles_ example. You'll study the housing data in Assignment #1.

<img src="https://github.com/alu042/DAT158-2022/raw/main/notebooks/assets/cars.jpg">

We'll also look at strategies and techniques for evaluating models beyond those explored in previous notebooks. 

# Setup

In [None]:
# This is a quick check of whether the notebook is currently running on Google Colaboratory
# or on Kaggle, as that makes some difference for the code below.
# We'll do this in every notebook of the course.
try:
    import colab
    colab=True
except:
    colab=False

import os
kaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

In [None]:
# To display plots directly in the notebook:
%matplotlib inline

We import our standard framework:

In [None]:
import numpy as np, pandas as pd, matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sklearn

In [None]:
# Set the directory in which to store data
NB_DIR = Path.cwd()     
DATA = NB_DIR/'data'/'vehicles'     

DATA.mkdir(parents=True, exist_ok=True)

# Understand the problem and look at the big picture

## Frame the problem

Our task will be to predict the price of a car from various descriptive features. One can imagine using this to determine whether an offered price is fair or, if you're a car dealer, to decide your sale price. Such a predictive model can also be used to see if there are interesting general trends linking the cost of the car to its features. 

## Select performance measures

If we imagine that our model will be used as part of a more comprehensive pricing system, the broader picture may influence the performance measures we'd like to use.

In this case, we keep things simple: we just want the predicted price to, on average, correspond to the actual sale price. 

Two widely used performance measures for regression problems are the ***Root Mean Squared Error*** (RMSE) and the ***Mean Absolute Error*** (MAE). 

We'll talk more about these later in the notebook. 

# Get the data

We'll use the data provided by Nehal Birla here: https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho. Store it in the `DATA` directory to continue. 

In [None]:
import shutil

In [None]:
if (colab or kaggle):
    !wget https://github.com/alu042/DAT158-2022/raw/main/notebooks/data/vehicles/archive.zip
    shutil.unpack_archive('archive.zip', extract_dir=DATA)
else:
    shutil.unpack_archive(DATA/'archive.zip', extract_dir=DATA)

There are three different data sets. Let's have a quick look at them to decide which one to use:

In [None]:
list(DATA.iterdir())

In [None]:
car_data = pd.read_csv('data/vehicles/car data.csv')
car_details = pd.read_csv('data/vehicles/CAR DETAILS FROM CAR DEKHO.csv')
car_details_v3 = pd.read_csv('data/vehicles/Car details v3.csv')

In [None]:
car_data.head()

In [None]:
car_data.info()

In [None]:
car_details.head()

In [None]:
car_details.info()

In [None]:
car_details_v3.head()

In [None]:
car_details_v3.info()

Let's use the last one as it has the most features and instances. Note that there are some missing values in `mileage`, `engine`, `max_power`, `torque` and `seats` that we'll have to deal with.

In [None]:
df = car_details_v3.copy()

# Explore the data

In [None]:
df.head()

## Feature distributions

Here's a plot of the price distribution in our data:

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(df.selling_price, kde=True)
plt.show()

We see that there are some very expensive cars in the data set, but only a few.

How about the distribution of model years?

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(df.year, kde=True)
plt.show()

We note that the cars are quite new.

Is there a relationship between the model year and the price? Let's make a new categorical feature to investigate. Based on the above histogram, we say that cars from before 2010 are "old", between 2010 and 2015 are "medium" and 2015-2020 are "new".

In [None]:
df["age_cat"] = pd.cut(df["year"], bins=[1982, 2010, 2015, 2020],
                               labels=['old', 'medium', 'new'])



In [None]:
df.head()

In [None]:
df.age_cat.value_counts()

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(data=df, x='selling_price', hue='age_cat')
plt.show()

We observe a tendency for newer cars to be more expensive than older ones. 

What about transmission type?

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(df.transmission)
plt.show()

Most are manual transmission. Is there a relationship between the price and the type of transmission?

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(data=df, x='selling_price', hue='transmission')
plt.show()

Seems like the automatic transmission cars are pricier. We can see this also by computing their mean prices:

In [None]:
# We find all the rows corresponding to automatic transmission, 
# extract their selling prices, and compute the mean
df.loc[df.transmission=='Automatic'].selling_price.mean()

In [None]:
df.loc[df.transmission=='Manual'].selling_price.mean()

How about the fuel and the price?

In [None]:
df.fuel.value_counts()

In [None]:
plt.figure(figsize=(14,8))
sns.histplot(data=df, x='selling_price', hue='fuel')
plt.show()

## Extra: further exploration using `pandas-profiling`

As data exploration is such a fundamental component of machine learning, many tools have been created to support the process. 

`pandas-profiler` is a convenient library to quickly get some insights into a data set:

In [None]:
if (colab or kaggle):
    !pip install --upgrade visions
    !pip install pandas-profiling

In [None]:
from pandas_profiling import ProfileReport

In [None]:
df.head()

In [None]:
ProfileReport(df, minimal=True)

## Converting the features' data types

There are many other features we could investigate in a similar way. But we're quickly faced with the problem that some of them, like the mileage, are numbers, but not stored as such:

In [None]:
df.head()

In [None]:
df.info()

We should convert some of the features stored as strings (`object`) to integers and floats. Specifically, the mileage, the engine size and the max power.

In [None]:
df.mileage.value_counts()

First we remove the units:

In [None]:
df.max_power.value_counts()

In [None]:
df['mileage'] = df.mileage.str.replace(' kmpl', '')
df['mileage'] = df.mileage.str.replace(' km/kg', '')

Then we convert to floats:

In [None]:
df['mileage'] = df['mileage'].astype(float)

Let's do similarly for the others:

In [None]:
df.engine = df.engine.str.replace(' CC', '').astype(float)

In [None]:
df.max_power = df.max_power.str.replace(' bhp', '')
df.max_power = df.max_power.replace('', np.nan)        # Empty strings replaced by NaNs
df.max_power = df.max_power.astype(float)

In [None]:
df.head()

Rather than dealing with the heterogeneity of the torque feature, we'll simply drop it (feel free to do otherwise on your own!)

In [None]:
df.drop('torque', axis=1, inplace=True)

We'll also drop the name of the car. This is to simplify things. A better idea would be to use it to extract information about the make and model of the car. 

In [None]:
df.drop('name', axis=1, inplace=True)

This is now our data set:

In [None]:
df.head()

## Feature encoding

For our machine learning models to work, we must represent the categorical features `fuel`, `transmission`, `seller_type`, and `owner` as numbers. 

How to best do such feature encoding is a relatively large topic. The short version is that we can either do a *one hot encoding* if the feature values are not related to each other in some *ordinal* way (i.e., there's no reason to treat one as "larger" than the other), otherwise use an ordinal encoder. 

In our case, `fuel` and `transmission` are not ordinal features, while `owner` is (as it is the number of owners). 

We can use Pandas to do the one hot encoding:

In [None]:
one_hot = pd.get_dummies(df['fuel'])
df = df.join(one_hot)

In [None]:
one_hot = pd.get_dummies(df['transmission'])
df = df.join(one_hot)

In [None]:
one_hot = pd.get_dummies(df['seller_type'])
df = df.join(one_hot)

We get the following data frame:

In [None]:
df.head()

Now that we've stored the fuel and transmission information in one hot encoded vectors we can drop the original features:

In [None]:
df.drop(['fuel', 'transmission', 'seller_type'], axis=1, inplace=True)

In [None]:
df.head()

For `owner` we'd like to keep the ordinal relationship:

In [None]:
df.owner.value_counts()

In [None]:
df.replace('Test Drive Car', 0, inplace=True)
df.replace('First Owner', 1, inplace=True)
df.replace('Second Owner', 2, inplace=True)
df.replace('Third Owner', 3, inplace=True)
df.replace('Fourth & Above Owner', 4, inplace=True)

In [None]:
df.head()

## Setting up our $f: X \to y$

We'll store the features in 'X' and the labels in 'y'. Our goal is to approximate the function mapping `X` to `y`, where `y` is the `selling_price`:

<img src="https://github.com/alu042/DAT158-2022/raw/main/notebooks/assets/f_xy.png">

In [None]:
X = df.drop('selling_price', axis=1)
y = df['selling_price']

In [None]:
X.head()

In [None]:
y.head()

# Create training and test sets

> To stress a point repeated multiple times already: We're not interested in how well our models perform on the training set; what we're after is how well they generalize to unseen data. 

The test set is meant to simulate unseen data (and should, therefore, not be touched when constructing and tuning our models). 

<img width=50% src="https://github.com/alu042/DAT158-2022/raw/main/notebooks/assets/testsplit.png"> 

It is vital to ensure that the test set is a representative sample of the data. In our case, we want to ensure that it contains cars of all prices.

We should base our decision on how to split the data on the explorations we've done above and on what the model is supposed to be used for (as that influences the kind of generalization estimate we want). For example, perhaps it is important to use the car's age as part of the decision.  Or the number of seats it has (maybe we find it important that the test set contains at least some two-seaters). And so on. 

In our case, we'll simply ensure that the test set contains at least some expensive cars by performing a _stratified split_ on our new categorical feature representing the cars' expensiveness. 

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=X.age_cat, random_state=42)

We now have 6096 instances for training, 2032 for testing

In [None]:
len(X_train), len(X_test)

Their car age distributions are similar:

In [None]:
plt.hist(X_train.age_cat, alpha=0.5, label='train')
plt.hist(X_test.age_cat, alpha=0.5, label='test')
plt.legend(loc='upper right')
plt.show()

After making the split we can drop `age_cat` feature:

In [None]:
X_train = X_train.drop('age_cat', axis=1)
X_test = X_test.drop('age_cat', axis=1)

# Data preprocessing: Data cleaning, feature scaling and imputing missing data

Before we can use the data to train machine learning models, we need to make sure it is "clean", the features are scaled, and consider how to deal with missing data.

We know from earlier that we can scale the features using, for example, scikit-learn's `StandardScaler`. 

In [None]:
from sklearn.preprocessing import StandardScaler

A general strategy for dealing with missing data is to **impute**. In other words, insert data where there's none. This can be done in many ways, and a part of a comprehensive model selection design would. in practice. be dedicated to figuring out good imputing strategies. Sometimes, simply putting in the mean or median value calculated from all the instances having values for a given feature is an OK strategy. Other times one should try to be a bit more clever and use characteristics of the instance to decide what to put in for a missing value. For example, try to find the most similar instances in terms of the other features, then put in the mean or median value computed only from those. Or perhaps train machine learning models to perform the imputation. 

In our case, we go for a simple strategy of imputing using the mean value:

In [None]:
from sklearn.impute import SimpleImputer

In [None]:
imp = SimpleImputer(strategy='mean')

In [None]:
X_train = imp.fit_transform(X_train)
X_test = imp.transform(X_test)

In [None]:
std = StandardScaler()
X_train_std = std.fit_transform(X_train)
X_test_std = std.transform(X_test)

> **Your turn!** Explore other scaling and imputation strategies available in scikit-learn. For imputation, try these as starting points: <br><br>
https://machinelearningmastery.com/statistical-imputation-for-missing-values-in-machine-learning/<br> https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779. 

# Training a regression model

As for classification, we have a lot of choices when building our model. For now, we'll use one of the standard built-in models in scikit-learn. 

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf_reg = RandomForestRegressor(random_state=42)

In [None]:
rf_reg.fit(X_train, y_train)

The model is now trained on the training data, and we can use it to make predictions for the test data:

In [None]:
y_pred = rf_reg.predict(X_test)

Here are some of the 2032 predictions from the Random Forest:

In [None]:
len(y_pred), y_pred[:10]

Here are some of the correct answers:

In [None]:
len(y_test), np.array(y_test)[:10]

Let's put them next to each other and print out the first few:

In [None]:
list(zip(y_test, y_pred))[:10] # "Zip" the two above arrays and display the first 10

We observe that the model is close to correct some times, and way off for others. 

We can also make a scatter plot to compare the predicted prices agains the actual prices:

In [None]:
plt.figure(figsize=(10,10))
sns.regplot(x=y_test, y=y_pred)
plt.show()

We see that at least the model isn't terrible..

> **But how good is it, really? Can we quantify its performance?** 

As we did for classification earlier, we need metrics that we can use to evaluate our models. Again, as before, we can use these to compare different models and choice of model parameters. 

# Evaluating models / performance measures

First of all, as mentioned earlier, one should really ask, "*What is the end goal for my system"?* We're supposed to create systems that are useful in some context as part of a larger system, which typically has a higher-level goal that our system should aim to optimize. Perhaps it's worth sacrificing predictive performance for speed or not getting a lot of prices that don't lead to sales?

However, we won't consider these broader context matters in these toy problems.

A basic idea for performance measures in regression is to compute the distances between the predicted and actual values for all instances.

In [None]:
plt.figure(figsize=(10,10))
sns.regplot(x=y_test, y=y_pred)
plt.show()

That is, the distances from the points in the above figure and the straight line $y=x$. In other words, the **errors** in the predictions.

These distances are computed by taking the absolute value of the difference between the prediction and the target values, $|\hat{y_i} - y_i|$, where $\hat{y}_i$ is the predicted value for instance number i and $y_i$ is the true value. Technically, this is the [**euclidean distance**](https://en.wikipedia.org/wiki/Euclidean_distance).

If the predictions are perfect, all these distances will be zero. The larger the distance, the more severe the error. 

If we sum up all the distances and divide the result by the number of instances, we'll get the _average_ or _mean_ error. To avoid having the distances of the points above the line cancel out the distances of points below the line, we take the absolute value of each distance (to get a positive number). This gives us the following sum (where $n$ is the total number of instances):

$$(|\hat{y_1} - y_1| + |\hat{y_2} - y_2| + |\hat{y_3} - y_3| + ... + |\hat{y_n} - y_n|)\, / \, n$$

...which can be written as

$$\frac 1n \sum_i^n |\hat{y_i} - y_i| = MAE$$

## Mean absolute error

This is the so-called **mean absolute error**. 

Note again that if the predictions are perfect, this sum is zero.

Let's translate the above formula into code:

In [None]:
def mae(v1, v2):
    """
    Computes the mean absolute error between the two 
    vectors v1 and v2 (that are of equal length)
    """
    
    distance = 0
    # Add the absolute difference of all the elements 
    # in the vectors together
    
    for i in range(len(v1)):
        distance = distance + np.abs(v1[i]-v2[i])
        
    mean_distance = distance/len(v1)
    
    return mean_distance

In [None]:
v1 = [1, 2, 0.5]
v2 = [0.9, 2, 0.3]

In [None]:
mae(v1, v2)

Note that when computing with vectors one should **beware of foor loops!**. It is in general a bad idea to force the computer to do one thing at a time when it's very capable of doing computations in parallel. 

Here's a _much_ faster version of the above function:

In [None]:
def mae(v1, v2):
    """
    Computes the mean absolute error between the two 
    vectors v1 and v2 (that are of equal length)
    """
    v1, v2 = np.array(v1), np.array(v2)
    
    distance = np.sum(np.abs(v1-v2))
    mean_distance = distance/len(v1)
    
    return mean_distance

In [None]:
mae(v1, v2)

We can use this to measure the performance of the above model:

In [None]:
mae(y_test, y_pred)

This is the mean absolute error achieved by our model. 

## Mean squared error and root mean squared error

Two similar measures that are widely used are **mean squared error**, which is the average of the _squared_ distances, and the **root mean squared error**, which is the square root of mean squared error. 

By squaring the distances between the points, instances where this distance is large become much more impactful on the performance measures (as the square of a large number is even larger). In other words, _outliers_ have more impact on one's performance measures, which is a good thing if outliers are important in the setting you find yourself in. 

Here they are as formulas:



$$MSE = \frac 1n \sum_i^n |\hat{y_i} - y_i|^2$$
$$RMSE = \sqrt{\frac 1n \sum_i^n |\hat{y_i} - y_i|^2}$$

And in code:

In [None]:
def mse(v1, v2):
    """
    Computes the mean absolute error between the two 
    vectors v1 and v2 (that are of equal length)
    """
    v1, v2 = np.array(v1), np.array(v2)
    
    squared_distance = np.sum(np.abs(v1-v2)**2)
    mean_squared_distance = squared_distance/len(v1)
    
    return mean_squared_distance

In [None]:
mse(y_test, y_pred)

In [None]:
def rmse(v1, v2):
    """
    Computes the mean absolute error between the two 
    vectors v1 and v2 (that are of equal length)
    """
    v1, v2 = np.array(v1), np.array(v2)
    
    squared_distance = np.sum(np.abs(v1-v2)**2)
    root_mean_squared_distance = np.sqrt(squared_distance/len(v1))
    
    return root_mean_squared_distance

In [None]:
rmse(y_test, y_pred)

## `R^2`: The coefficient of determination

The _coefficient of determination_ is another way to measure the distance between the predicted labels and the true labels in our test set. It is computed from two quantities: the *residual sum of squares* **$SS_{res}$** and the *total sum of squares* **$SS_{tot}$**:

$$
\begin{align}
SS_{res} &= \sum_i (y_i - \hat{y}_i)^2 \\
SS_{tot} &= \sum_i (y_i - \bar{y})^2,
\end{align}
$$
where $\bar{y}$ is the mean of the label values:
$$
\bar{y} = \frac1n \sum_i y_i
$$

The formula for $R^2$ is then:

$$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

Note that if the predictions agree perfectly with the labels then $R^2 = 1$. If a model always predicts $\bar{y}$, then it has $R^2 = 0$. A model that performs worse than this will have a negative $R^2$ value.

In [None]:
def r2(y, yhat):
    ymean = np.mean(y)
    ssres = np.sum((y - yhat)**2)
    sstot = np.sum((y - ymean)**2)
    return 1 - ssres/sstot

In [None]:
r2(y_test, y_pred)

## In scikit learn

These metrics are extremely standard and can therefore of course also be found in scikit-learn. It's not necessary to make your own implementation.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
# MAE
mean_absolute_error(y_test, y_pred)

In [None]:
# MSE
mean_squared_error(y_test, y_pred, squared=True)

In [None]:
# RMSE
mean_squared_error(y_test, y_pred, squared=False)

In [None]:
# R2
r2_score(y_test, y_pred)

## Other models

Let's compare our RandomForestRegressor to some other models. Have a look at the scikit-learn documentation for more information (and inspiration): https://scikit-learn.org/stable/supervised_learning.html#supervised-learning 

### Logistic regression

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(10,10))
sns.regplot(x=y_test, y=lr.predict(X_test))
plt.show()

In [None]:
# MAE
mean_absolute_error(y_test, lr.predict(X_test))

In [None]:
# MSE
mean_squared_error(y_test, lr.predict(X_test), squared=True)

In [None]:
# RMSE
mean_squared_error(y_test, lr.predict(X_test), squared=False)

In [None]:
# R2
r2_score(y_test, lr.predict(X_test))

### ElasticNet

In [None]:
from sklearn.linear_model import ElasticNet

In [None]:
en = ElasticNet()
en.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(10,10))
sns.regplot(x=y_test, y=en.predict(X_test))
plt.show()

In [None]:
# MAE
mean_absolute_error(y_test, en.predict(X_test))

In [None]:
# MSE
mean_squared_error(y_test, en.predict(X_test), squared=True)

In [None]:
# RMSE
mean_squared_error(y_test, en.predict(X_test), squared=False)

In [None]:
# R2
r2_score(y_test, en.predict(X_test))

### Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
gb = GradientBoostingRegressor(n_estimators=500, max_depth=4, learning_rate=0.2)
gb.fit(X_train, y_train)

In [None]:
plt.figure(figsize=(10,10))
sns.regplot(x=y_test, y=gb.predict(X_test))
plt.show()

In [None]:
# MAE
mean_absolute_error(y_test, gb.predict(X_test))

In [None]:
# MSE
mean_squared_error(y_test, gb.predict(X_test), squared=True)

In [None]:
# RMSE
mean_squared_error(y_test, gb.predict(X_test), squared=False)

In [None]:
# R2
r2_score(y_test, gb.predict(X_test))

# Wrapping up

Now you've seen some basic ideas for constructing and evaluating regression models. In Assignment #1, you'll go into greater detail while building your own models to predict housing prices.