# Multiple Linear Regression

### Our Data:
For the purposes of this notebook, we will be using a fuel consumption dataset. It contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for light-duty vehicles being sold in canada.

#### Our features include:
- <code>MODELYEAR</code>
- <code>MAKE</code>
- <code>MODEL</code>
- <code>VEHICLE CLASS</code>
- <code>ENGINE SIZE</code>
- <code>CYLINDERS</code>
- <code>TRANSMISSION</code>
- <code>FUEL CONSUMPTION in CITY (L/100KM)</code>
- <code>FUEL CONSUMPTION in HWY (L/100KM)</code>
- <code>FUEL CONSUMPTION COMB (L/100KM)</code>
- <code>CO2 EMISSIONS (g/km)</code>

#### Importing required packages:

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

#### Loading our dataset into dataframe:

In [2]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%202/data/FuelConsumptionCo2.csv')
print('data loaded!')

data loaded!


#### Selecting features we want to use for regression:

In [3]:
fts_df = df[['ENGINESIZE',
             'CYLINDERS',
             'FUELCONSUMPTION_CITY',
             'FUELCONSUMPTION_HWY',
             'FUELCONSUMPTION_COMB',
             'CO2EMISSIONS']]
fts_df.head()

Unnamed: 0,ENGINESIZE,CYLINDERS,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,CO2EMISSIONS
0,2.0,4,9.9,6.7,8.5,196
1,2.4,4,11.2,7.7,9.6,221
2,1.5,4,6.0,5.8,5.9,136
3,3.5,6,12.7,9.1,11.1,255
4,3.5,6,12.1,8.7,10.6,244


#### Splitting data into training and testing set:

In [4]:
# 80% 20% Training and Testing split respectively
mask = np.random.rand(len(df)) < 0.8
train = fts_df[mask]
test = fts_df[~mask]

## Build our models:

#### MLR Model 1:
Using <code>ENGINESIZE</code>, <code>CYLINDERS</code>, and <code>FUELCONSUMPTION_COMB</code> to predict <code>CO2EMISSIONS</code>

In [5]:
from sklearn import linear_model
rr = linear_model.LinearRegression()
x = np.asanyarray(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = np.asanyarray(train[['CO2EMISSIONS']])
rr.fit(x, y)
print('Coef: ', rr.coef_)
# sk estimates these from our data using Ordinary Least Squares
# OLS can find the best params by using close-form equations or
# by using an optimization algorithm 

Coef:  [[10.26847982  7.97934339  9.51341202]]


#### Predictions from Model 1:

In [6]:
y_hat = rr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']].values)
x = np.asanyarray(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']].values)
y = np.asanyarray(test[['CO2EMISSIONS']].values)
print('MSE : %.2F' % np.mean((y_hat-y)**2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % rr.score(x, y))

MSE : 602.70
Variance score: 0.86


$\texttt{explainedVariance}(y, \hat{y}) = 1 - \frac{Var\{ y - \hat{y}\}}{Var\{y\}}$  

#### MLR Model 2:
Using <code>ENGINESIZE</code>, <code>CYLINDERS</code>, <code>FUELCONSUMPTION_HWY</code>, and <code>FUELCONSUMPTION_CITY</code>to predict <code>CO2EMISSIONS</code>

Will using separate features for city and highway fuel consumption give us a more accurate model?

In [7]:
rr = linear_model.LinearRegression()
x = np.asanyarray(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_HWY','FUELCONSUMPTION_CITY']])
y = np.asanyarray(train[['CO2EMISSIONS']])
rr.fit(x, y)
print('Coef: ', rr.coef_)
# sk estimates these from our data using Ordinary Least Squares
# OLS can find the best params by using close-form equations or
# by using an optimization algorithm 

Coef:  [[10.33029931  7.6443349   3.11936038  6.10756951]]


#### Predictions from Model 1:

In [8]:
y_hat = rr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_HWY','FUELCONSUMPTION_CITY']].values)
x = np.asanyarray(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_HWY','FUELCONSUMPTION_CITY']].values)
y = np.asanyarray(test[['CO2EMISSIONS']].values)
print('MSE : %.2F' % np.mean((y_hat-y)**2))

# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % rr.score(x, y))

MSE : 602.30
Variance score: 0.86


##### Inference:
The use of <code>FUELCONSUMPTION_CITY</code> and <code>FUELCONSUMPTION_HWY</code> instead of <code>FUELCONSUMPTION_COMB</code> resulted in no significant change in accuracy.