# CO$_2$ emissions of new cars in Canada in 2023

<img src="https://media.greenmatters.com/brand-img/NTfo8bR6j/2160x1131/what-emissions-do-cars-produce2-1604596690492.jpg" width=1000/>

<a target="_blank" href="https://colab.research.google.com/github/concordia-grad-computing-seminars/data-engineering/blob/main/assignments/assignment2/ass2.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Instructions
- Complete as needed this notebook in order to be able to answer the various questions.
- Submit on moodle your notebook as well as the PDF or HTML copy of your notebook (with answers computed)
- Please submit a clean notebook (i.e. only the code needed to obtain the answers and not including all debugging / trials you did)

## Libraries

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

## Data

Data source: https://open.canada.ca/data/en/dataset/98f1a129-f628-4ce4-b24d-6f16bf24dd64<br>
License: https://open.canada.ca/en/open-government-licence-canada

Dataset: 2023 Fuel Consumption Ratings 
```
Model
=====
4WD/4X4 = Four-wheel drive
AWD = All-wheel drive
FFV = Flexible-fuel vehicle
SWB = Short wheelbase
LWB = Long wheelbase
EWB = Extended wheelbase

Engine size
===========
Engine size is in liters

Transmission
============
A = automatic
AM = automated manual
AS = automatic with select shift
AV = continuously variable
M = manual
3 – 10 = Number of gears

Fuel type
=========
X = regular gasoline
Z = premium gasoline
D = diesel
E = ethanol (E85)
N = natural gas

Fuel consumption
================
City and highway fuel consumption are shown in liters per 100 kilometers (L/100 km)
Combined consumption (55% city, 45% hwy) is shown in liters per 100 kilometers (L/100 km)

CO2 emissions / rating
======================
CO2 emissions	the tailpipe emissions of carbon dioxide (in grams per kilometer) for combined city and highway driving
CO2 rating	the tailpipe emissions of carbon dioxide rated on a scale from 1 (worst) to 10 (best)
```

In [2]:
cars = pd.read_csv("MY2023_FuelConsumptionRatings.csv")
cars.head()

Unnamed: 0,Make,Model,Class,EngineSize,Cylinders,Transmission,FuelType,ConsumptionCity,ConsumptionHwy,ConsumptionComb,CO2_Emissions,CO2_Rating
0,Alfa Romeo,Giulia,Mid-size,2.0,4,A8,Z,10.0,7.2,8.7,205,5
1,Alfa Romeo,Giulia AWD,Mid-size,2.0,4,A8,Z,10.5,7.7,9.2,217,5
2,Alfa Romeo,Giulia Quadrifoglio,Mid-size,2.9,6,A8,Z,13.5,9.3,11.6,271,4
3,Alfa Romeo,Stelvio,SUV: Small,2.0,4,A8,Z,10.3,8.1,9.3,218,5
4,Alfa Romeo,Stelvio AWD,SUV: Small,2.0,4,A8,Z,10.8,8.3,9.6,226,5


## Questions

### 1. Exploration

In this question we want to get a high level overview of the data set. You should add here some relevant graphs or tables which can help you to understand at a high level what kind of data is within the set.

Some suggested graphs are mentioned below. You may add more if you feel they are useful.

In [3]:
# Scatter-matrix of all numerical values


In [4]:
# Histograms of all numerical values


In [5]:
# Unique values of the categorical values


### 2. Predicting CO$_2$ emissions in function of car fuel consumption

The aim is to build a linear model predicting the `CO2_Emissions` in function of the `ConsumptionComb`:
$$
\text{CO2_Emissions} = \beta_0 + \beta_1 \cdot \text{ConsumptionComb}
$$

Steps:
* Build the features matrix and target values vector
* Split the set into a training and a test set (you can use [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html))
* Build and train the model using your training set (you can use [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html))
* Evaluate the trained model on your test set providing graphs and computing the RMSE (you can use [`mean_squared_error`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) but make sure to read the documentation to get indeed the RMSE, or build your own function)
* By performing none-parametric bootstrapping estimate the error on the two model parameters $\beta_0$ and $\beta_1$

In [6]:
# Features matrix X and target values y

X = np.c_[cars.ConsumptionComb]
y = np.c_[cars.CO2_Emissions]

In [7]:
# Training and test set


In [8]:
# Defintion of the model and training


In [9]:
# Evaluation on the test set of the trained model
#
# Provide a plot of the predicted CO2_Emissions versus the actual CO2_Emissions


In [10]:
# Compute the root mean square error (RMSE) on the test set


In [11]:
# Plot on a graph the errors of each sample in the test set and add two horizontal lines showing +RMSE and -RMSE


In [12]:
# Sensitivity analysis (using none-parametric bootstrapping)
#
# Create a function to draw a random sample from the full data set
# You can simply use train_test_split for this operation

def bootstrap(X, y):
    """
    Returns a random sample (X_rdm, y_rdm) from the full data set (X, y)
    """
    pass

In [13]:
# By drawing 10000 random sub-sets, build two lists beta_0 and beta_1 which contains the model parameters computed for each drawn sub-set
# You can use the following template as start point

beta_0 = []
beta_1 = []
for i in range(1000):
    pass

In [14]:
# Compute and display the mean values and standard errors of the tow model parameters beta_0 and beta_1
# Make sure to add the correct physical units to your printouts


In [15]:
# Give an appropriate visual representation of this result (e.g. plotting the histograms of beta_0 and beta_1)


### 3. Predicting the car fuel consumption in function of car characteristics

The aim is to build a linear model predicting the `ConsumptionComb` in function of the car characteristics `EngineSize`, `Cylinders` and `FuelType`.

The model will be a linear model with polynomial base functions (use [`PolynomialFeatures`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) to transform the features).

To encode the categorical feature `FuelType` we will use one-hot encoding with [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

Steps:
* Build the features matrix and target values vector
* Split the set into a training and a test set (you can use [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html))
* Build a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) which will one-hot encode the categorical feature `FuelType`, apply the `PolynomialFeatures` transformer on the numerical features `EngineSize` and `Cylinders` and afterwards perform linear regression with [`LinearRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)).
* Using cross validation, find the best degree n of the polynomial for the `PolynomialFeatures` transformer
* Build and train the model using your training set and optimal degree n
* Evaluate the trained model on your test set providing graphs and computing the RMSE (you can use [`mean_squared_error`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) but make sure to read the documentation to get indeed the RMSE, or build your own function)

In [16]:
# Features matrix X and target values y

X = np.c_[cars.EngineSize, cars.Cylinders, cars.FuelType]
y = np.c_[cars.ConsumptionComb]

In [17]:
# Training and test set


In [18]:
# Pipline to transform the features (PolynomialFeatures for numerical and OneHotEncoder for categorical features) and feed a LinearRegression regressor
# Use the template below
#
# Note: make sure to include the categories option in the OneHotEncoder as otherwhise your cross validation will run into problems

def createModel(n):
    """
    Creates and returns pipeline for the model
    n = degree of polynomial of the PolynomialFeatures transformer
    """
    pass

![Pipeline](img/pipeline1.png "Visual representation of the pipeline")

In [19]:
# Use cross validation (using your training set) to find the best degree n of of the polynomial of the PolynomialFeatures transformer
#

In [20]:
# Train your model with the optimal degree n using your training set


In [21]:
# Evaluation on the test set of the trained model
#
# Provide a plot of the predicted combined fuel consumption versus the actual combined fuel consumption


In [22]:
# Compute the root mean square error (RMSE) on the test set


In [23]:
# Plot on a graph the errors of each sample in the test set and add two horizontal lines showing +RMSE and -RMSE


### 4. Predicting the car fuel consumption in function of car characteristics

The aim is to build a KNN regression model predicting the `ConsumptionComb` in function of the car characteristics `EngineSize`, `Cylinders`, `FuelType` and `Class`.

The model will be a KNN model. Recall that a KNN model requires scaling of the numerical features in order to bring them all in a similar scale. In our case we will use the [`StandardScaler`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html).

To encode the categorical features `FuelType` and `Class` we will use one-hot encoding with [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).

Steps:
* Build the features matrix and target values vector
* Split the set into a training and a test set (you can use [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html))
* Build a [`Pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) which will one-hot encode the categorical features `FuelType` and `class`, apply a `StandardScaler` transformer on the numerical features `EngineSize` and `Cylinders` and afterwards perform KNN regression with [`KNeighborsRegressor`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)).
* Using cross validation, find the the number K of nearest neighbors to use in the `KNeighborsRegressor`
* Build and train the model using your training set and the number K of nearest neighbors to use in the `KNeighborsRegressor`
* Evaluate the trained model on your test set providing graphs and computing the RMSE (you can use [`mean_squared_error`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) but make sure to read the documentation to get indeed the RMSE, or build your own function)

In [24]:
# Features matrix X and target values y

X = np.c_[cars.EngineSize, cars.Cylinders, cars.FuelType, cars.Class]
y = np.c_[cars.ConsumptionComb]

In [25]:
# Training and test set


In [26]:
# Pipline to transform the features (StandardScaler for numerical and OneHotEncoder for categorical features) and feed a KNeighborsRegressor regressor
# Use the template below
#
# Note: make sure to include the categories option in the OneHotEncoder as otherwhise your cross validation will run into problems

def createModel(n):
    """
    Creates and returns pipeline for the model
    n = degree of polynomial of the PolynomialFeatures transformer
    """
    pass

![Pipeline](img/pipeline2.png "Visual representation of the pipeline")

In [27]:
# Use cross validation (using your training set) to find the number K of nearest neighbors to use in the KNeighborsRegressor
#

In [28]:
# Train your model with the optimal number K of nearest neighbors


In [29]:
# Evaluation on the test set of the trained model
#
# Provide a plot of the predicted combined fuel consumption versus the actual combined fuel consumption


In [30]:
# Compute the root mean square error (RMSE) on the test set
