# Modeling with Multiple Linear Regression

This modeling exercise is from the course Machine Learning with Python on Coursera. Here we explore multiple linear regression.

## 1. Import relevant libraries

In [1]:
import os
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

## 2. Load dataset

In [2]:
data_path = '/Users/danielchen/Desktop/Coding/Python/Coursera/Machine Learning Python/Data/FuelConsumptionCo2.csv'
data = pd.read_csv(data_path)
data

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1062,2014,VOLVO,XC60 AWD,SUV - SMALL,3.0,6,AS6,X,13.4,9.8,11.8,24,271
1063,2014,VOLVO,XC60 AWD,SUV - SMALL,3.2,6,AS6,X,13.2,9.5,11.5,25,264
1064,2014,VOLVO,XC70 AWD,SUV - SMALL,3.0,6,AS6,X,13.4,9.8,11.8,24,271
1065,2014,VOLVO,XC70 AWD,SUV - SMALL,3.2,6,AS6,X,12.9,9.3,11.3,25,260


## 3. Fitting the Model

The steps for the machine learning pipeline in the multivariable case are the exact same as the case of simple linear regression. We split our data into a train test split and then fir the linear regression on the training data. Then we predict and evaluate.

In [3]:
linreg = linear_model.LinearRegression()
X_train, X_test, y_train, y_test = train_test_split(
    data[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB']],
    data[['CO2EMISSIONS']],
    test_size=0.2,
    random_state=23
)

In [4]:
linreg.fit(X_train, y_train)

LinearRegression()

In [5]:
f'The coefficients of the model are {", ".join([str(coef) for coef in linreg.coef_[0]])}'

'The coefficients of the model are 10.990037482254897, 7.963658885721354, 9.482656568443034'

Note that sklearn uses OLS to determine the coefficients. However, OLS may not be optimal as it uses matrix operations which become time consuming on large datasets (those over 10K rows). Optimization algorithms present alternatives to solving the problem, such as gradient descent, but those topics will not be covered here.

## 4. Prediction

Prediction is the same as what we did for the simple linear model. After fitting the model, we predict outcomes of Y by using our `X_test` data set.

In [6]:
y_hat = linreg.predict(X_test)

## 5. Evaluation

Here we have two options for evaluation. The first is the residual sum of squares. This is a measure of error (how far each predicted value was from the actual value) squared.

In [7]:
f'The residual sum of squares is {np.round(np.mean(np.power(y_hat - y_test, 2).CO2EMISSIONS), 2)}'

'The residual sum of squares is 488.65'

The second option we have is the explained variance regression score. In plain English, how well does our model account for the variance of the data set. This is a proportion, so the max score is 1 and the min score is 0. We should be skeptical of a score of 1 because we would have an overfit model. Our model likely won't perform well on outside data as it has learned this data set too well.

In [8]:
explained_variance = linreg.score(X_test, y_test)
f'The variance score is {np.round(explained_variance, 2)}'

'The variance score is 0.86'

## 6. Model Two

Instead of `FUELCONSUMPTION_COMB`, we will use `FUELCONSUMPTION_CITY` and `FUELCONSUMPTION_HWY` in our model to predict `CO2EMISSIONS`.

In [9]:
# Instantiate the class
reg = linear_model.LinearRegression()

In [10]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    data[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_COMB']],
    data[['CO2EMISSIONS']],
    test_size=.20,
    random_state=23
)

In [11]:
# Fit the model
reg.fit(X_train, y_train)

LinearRegression()

In [12]:
# Predict
y_hat = reg.predict(X_test)

In [13]:
# Residual sum of squares
np.round(np.mean(np.power(y_hat - y_test, 2).CO2EMISSIONS), 2)

491.86

In [14]:
# Explained variance score
f'The explained variance is: {reg.score(X_test, y_test)}'

'The explained variance is: 0.8580731159241379'

## 7. Cross Validation

In [15]:
from sklearn.model_selection import cross_validate

In [35]:
metrics = cross_validate(
    estimator=linear_model.LinearRegression(),
    X=data[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_COMB']],
    y=data[['CO2EMISSIONS']],
    cv=10,
    scoring='explained_variance'
)

By default, the score returned for the multiple linear regression model is R-squared, but that may not be appropriate here since R-squared will always be weakly increasing once we add more explanatory variables to the model. By adding more independent variables to the model, our R-squared score will never decrease.

In [45]:
f'The mean explained variance of our model is: {np.mean(metrics["test_score"])}'

'The mean explained variance of our model is: 0.8642145363685009'