<h1 style="background-color:Green;font-size:36pt;color:white">Multiple linear regression</h1>

## Giusseppe Benito Bervis Quintero
### León, Nicaragua - 2022

# Importing needed packages

In [1]:
import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library

# Uploading packages for plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.style.use(['ggplot']) # optional: for ggplot-like style

import pylab as pl # convenient for interactive calculations and plotting

from sklearn import linear_model # To model data
from sklearn.metrics import r2_score # To perform evaluation

# Downloading Data

To download the data, we will use !wget to download it from IBM Object Storage.

In [2]:
df = pd.read_csv('https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module%202/data/FuelConsumptionCo2.csv')

# Take a look at the dataset
df.head()

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244


# Understanding the Data

FuelConsumption.csv:
We have downloaded a fuel consumption dataset, FuelConsumption.csv, which contains model-specific fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale in Canada. 

The features are:
*   **MODELYEAR** e.g. 2014
*   **MAKE** e.g. Acura
*   **MODEL** e.g. ILX
*   **VEHICLE CLASS** e.g. SUV
*   **ENGINE SIZE** e.g. 4.7
*   **CYLINDERS** e.g 6
*   **TRANSMISSION** e.g. A6
*   **FUEL CONSUMPTION in CITY(L/100 km)** e.g. 9.9
*   **FUEL CONSUMPTION in HWY (L/100 km)** e.g. 8.9
*   **FUEL CONSUMPTION COMB (L/100 km)** e.g. 9.2
*   **CO2 EMISSIONS (g/km)** e.g. 182   --> low --> 0

In [3]:
# Taking out the needed features
cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]

# Take a look at the new dataset
cdf.head()

Unnamed: 0,ENGINESIZE,CYLINDERS,FUELCONSUMPTION_COMB,CO2EMISSIONS
0,2.0,4,8.5,196
1,2.4,4,9.6,221
2,1.5,4,5.9,136
3,3.5,6,11.1,255
4,3.5,6,10.6,244


## Creating train and test dataset

In [25]:
# Splitting the data into train and test sets
# Taking 80% as trainning set
random_sample = np.random.rand(len(df)) < 0.8
train = cdf[random_sample]
test = cdf[random_sample]

# Multiple Regression Model

We can perform Multiple Linear Regression to predict co2emission using the features FUELCONSUMPTION_COMB, EngineSize and Cylinders of cars.

In [26]:
# Creating an object from class linear regression
regr = linear_model.LinearRegression()

# Converting training set into arrays
x = np.array(train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = np.array(train[['CO2EMISSIONS']])

# Performing Multiple Linear Regression
regr.fit(x, y)

# Printing the coefficients
print('Coefficients: ', regr.coef_[0])
print('Constant:', regr.intercept_)

Coefficients:  [11.50963657  7.60843286  9.33433597]
Constant: [65.364736]


Scikit-learn uses plain Ordinary Least Squares method to solve this problem.

## Prediction

In [27]:
# Performing some metrics to evaluate the multiple regression model
y_pred = regr.predict(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
x = np.array(test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = np.array(test[['CO2EMISSIONS']])

print(f'Mean absolute error: {np.mean(np.absolute(y_pred - y)) :.2f}')
print(f'Residual sum of squares: {np.mean((y_pred - y) ** 2) :.2f}')

# Explained variance score: 1 is perfect prediction
print(f'Variance score: {regr.score(x, y) :.2f}')

Mean absolute error: 17.28
Residual sum of squares: 560.32
Variance score: 0.87


The best possible score for Variandec Score is 1.0, the lower values are worse.

We will use FUELCONSUMPTION_CITY and FUELCONSUMPTION_HWY instead of FUELCONSUMPTION_COMB, to check out if we get a better accuracy.

In [28]:
# Taking out the needed features
Cons_df = df[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY', 'CO2EMISSIONS']]

# Splitting the data into train and test sets
# Taking 80% as trainning set
rand_sample = np.random.rand(len(Cons_df)) < 0.8
train1 = Cons_df[rand_sample]
test1 = Cons_df[~rand_sample]

# Creating an object from class linear regression
regr1 = linear_model.LinearRegression()

# Performing Multiple Linear Regression
x = np.array(train1[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY']])
y = np.array(train1[['CO2EMISSIONS']])
regr1.fit(x, y)

# Performing some metrics to evaluate the multiple regression model
y_pred = regr1.predict(test1[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY']])
x_test = np.array(test1[['ENGINESIZE', 'CYLINDERS', 'FUELCONSUMPTION_CITY', 'FUELCONSUMPTION_HWY']])
y_test = np.array(test1[['CO2EMISSIONS']])

print(f'Mean absolute error: {np.mean(np.absolute(y_pred - y_test)) :.2f}')
print(f'Residual sum of squares: {np.mean((y_pred - y_test)**2) :.2f}')

# Explained variance score: 1 is perfect prediction
print(f'Variance score: {regr1.score(x_test, y_test) :.2f}')

Mean absolute error: 18.07
Residual sum of squares: 582.82
Variance score: 0.85


As we can see, we get a higher sum of square and variance score, too. It's better to take FUELCONSUMPTION_COMB instead the other two features.