# Modeling with Linear Regression

This modeling exercise is from the machine learning course Machine Learning with Python on Coursera. Here we explore simple linear regression.

## 1. Import relevant libraries

In [1]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

## 2. Load dataset

In [2]:
data_path = '/Users/danielchen/Desktop/Coding/Python/Coursera/Machine Learning Python/Data/FuelConsumptionCo2.csv'
data = pd.read_csv(data_path)
data

Unnamed: 0,MODELYEAR,MAKE,MODEL,VEHICLECLASS,ENGINESIZE,CYLINDERS,TRANSMISSION,FUELTYPE,FUELCONSUMPTION_CITY,FUELCONSUMPTION_HWY,FUELCONSUMPTION_COMB,FUELCONSUMPTION_COMB_MPG,CO2EMISSIONS
0,2014,ACURA,ILX,COMPACT,2.0,4,AS5,Z,9.9,6.7,8.5,33,196
1,2014,ACURA,ILX,COMPACT,2.4,4,M6,Z,11.2,7.7,9.6,29,221
2,2014,ACURA,ILX HYBRID,COMPACT,1.5,4,AV7,Z,6.0,5.8,5.9,48,136
3,2014,ACURA,MDX 4WD,SUV - SMALL,3.5,6,AS6,Z,12.7,9.1,11.1,25,255
4,2014,ACURA,RDX AWD,SUV - SMALL,3.5,6,AS6,Z,12.1,8.7,10.6,27,244
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1062,2014,VOLVO,XC60 AWD,SUV - SMALL,3.0,6,AS6,X,13.4,9.8,11.8,24,271
1063,2014,VOLVO,XC60 AWD,SUV - SMALL,3.2,6,AS6,X,13.2,9.5,11.5,25,264
1064,2014,VOLVO,XC70 AWD,SUV - SMALL,3.0,6,AS6,X,13.4,9.8,11.8,24,271
1065,2014,VOLVO,XC70 AWD,SUV - SMALL,3.2,6,AS6,X,12.9,9.3,11.3,25,260


## 3. Identify goals
The goal of this exercise is to fit a linear model that predicts carbon dioxide emissions using feature of the dataset.

### 3a. Carbon Dioxide as a function of engine size

First we'll need to split our data into training and testing splits. 

In [3]:
X_train, X_test, y_train, y_test = train_test_split(
    data.ENGINESIZE, 
    data.CO2EMISSIONS, 
    test_size=0.2, 
    random_state=23
)

We'll then need to create an instance of the `linear_model` class.

In [4]:
linreg = linear_model.LinearRegression()

Note that in the below, we'll need to reshape our data first. The resulting X_train object is a series. 

In [5]:
type(X_train)

pandas.core.series.Series

We'll need to extract the actual values first and then reshape the data so it's compatible with our sklearn's linear regression model. The below returns a numpy array which we can then use the `reshape` method on.

In [6]:
type(X_train.values)

numpy.ndarray

In [7]:
linreg.fit(X_train.values.reshape(-1, 1), y_train.values.reshape(-1, 1))

LinearRegression()

Let's return the coefficients of the model.

In [8]:
f'The coefficient of the model is {linreg.coef_[0][0]} and the intercept is {linreg.intercept_[0]}.'

'The coefficient of the model is 40.39907773043087 and the intercept is 121.20457424141983.'

#### Make predictions

Now that we have a mathematical relationship to describe the model, we can make predictions using our test set.

In [9]:
y_predictions = linreg.predict(X_test.values.reshape(-1, 1))

#### Evaluate the model

Now that we have the predicted values, we can test how far off they're from the actual values that exist in our data. We'll take the Mean Absolute Error here or the average difference between each predicted data point and its corresponding actual value.

In [10]:
np.mean(np.absolute(y_predictions - y_test.values.reshape(-1, 1)))

23.592519183520775

### 3b. Carbon Dioxide as a function of fuel consumption

It's possible that engine size may not be the best predictor of carbon dioxide emissions. We can check to see if there's a better predictor by using another feature and comparing how close the new model is to predicting the right points.

Since we already CO2 emissions saved as an object in `y_train` and `y_test`, we'll need new arrays for `X_train` and `X_test`. The new `X_train` will be used with `y_train` and the new `X_test` with the existing `y_test`. 

In [11]:
X_train = np.asanyarray(data.loc[X_train.index.values, 'FUELCONSUMPTION_COMB']).reshape(-1, 1)

In [12]:
X_test = np.asanyarray(data.loc[X_test.index.values, 'FUELCONSUMPTION_COMB']).reshape(-1, 1)

The important thing that we've done above is that we need to ensure that the same observations in `X_train` corresponding to the `ENGINESIZE` are used for the `FUELCONSUMPTION_COMB` values. This will allow us to train on the same `Y_train` outcomes. This also ensures our 80/20 train test split. 

In [13]:
linreg.fit(X_train, y_train)

LinearRegression()

Now we can predict again and evaluate. 

In [14]:
predicted_y_test = linreg.predict(X_test)

In [15]:
f'The mean absolute error of the new model is {np.mean(np.absolute(predicted_y_test - y_test))}'

'The mean absolute error of the new model is 19.5575279644988'

We can also evaluate our model with Rsquared. Generally speaking Rsquared is a measure of how well our model fits our data.

In [16]:
(f'The R-squared score of our model of carbon dioxide as a function of fuel'
f'consumption is {r2_score(y_true=y_test, y_pred=predicted_y_test)}')

'The R-squared score of our model of carbon dioxide as a function of fuelconsumption is 0.7845783391764188'

In [17]:
(f'The R-squared score of our model of carbon dioxide as a function of engine size is'
f' {r2_score(y_true=y_test, y_pred=y_predictions)}')

'The R-squared score of our model of carbon dioxide as a function of engine size is 0.7263801311946887'

We can see that the mean squared error of the second model is lower and the R-squared value is also larger. 

## 4. K-fold Cross Validation

In [18]:
from sklearn.model_selection import cross_val_score
lr = linear_model.LinearRegression()

In [19]:
cross_val_score(
    estimator=lr, 
    X=data.FUELCONSUMPTION_COMB.values.reshape(-1, 1),
    y=data.CO2EMISSIONS.values.reshape(-1, 1), 
    cv=10,
    scoring='r2'
)

array([0.85929272, 0.75859349, 0.59844487, 0.6934474 , 0.71552105,
       0.81053574, 0.82156464, 0.87783855, 0.83546748, 0.90312981])

In [20]:
cross_val_score(
    estimator=lr, 
    X=data.ENGINESIZE.values.reshape(-1, 1), 
    y=data.CO2EMISSIONS.values.reshape(-1, 1), 
    cv=10,
    scoring='r2'
)

array([0.76272315, 0.71586712, 0.7257285 , 0.81913386, 0.74170318,
       0.69006264, 0.66620099, 0.77980502, 0.83251248, 0.69577451])

By default, the scoring metric for linear regression is Rsquared. According to this metric, we can see that fuel consumption is a better fit for predicting carbon dioxide emissions than the model containing engine size as a predictor.

This makes sense intuitively given the fact that the more fuel you use, the more carbon dioxide will be produced. 

Even so, we should be weary of high Rsquared values. If Rsquared is too high, we should be worried that our model is overfitted. It's possible that our model is too well trained on our data set. As a result, it may not be the best when using it on new data. 

K-fold cross validation is an attempt to use different splits or folds of the data since our evaluation result may be dependent on the split of the data we decide to use for training and testing. The results are variable depending on the split.

### 4a. `cross_validate`

An alternative to `cross_val_score` is `cross_validate`. The latter still allows us to specify the number of folds or splits of the data that we'd like to test on. However, it differs in the sense that it'll allow us to specify multiple scores and the output is not an array of scoring values, but rather a dictionary of scoring values.


In [21]:
from sklearn.model_selection import cross_validate

In [22]:
cross_validation_scores = cross_validate(
    estimator = lr,
    X=data.FUELCONSUMPTION_COMB.values.reshape(-1, 1),
    y=data.CO2EMISSIONS.values.reshape(-1, 1),
    cv=10,
    return_train_score=True
)

In [23]:
cross_validation_scores.keys()

dict_keys(['fit_time', 'score_time', 'test_score', 'train_score'])

In [24]:
cross_validation_scores['test_score']

array([0.85929272, 0.75859349, 0.59844487, 0.6934474 , 0.71552105,
       0.81053574, 0.82156464, 0.87783855, 0.83546748, 0.90312981])