[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/databyjp/AcademyXI_DA/blob/main/notebooks/AcademyXi_DA_Module_8_data_analysis_2.ipynb)

## AcademyXi Data Analysis - Data Analysis 2: 
### Workshop - Regression with Python
In this workshop module, we will show you examples of how Python can be used to build regression models. 

As we mentioned before, each of regression analysis and machine learning are very complex and large topics. Additionally, we appreciate that you will have varying degrees of familiarity with Python including its syntaxes, which we have not covered in this course.

**So treat this section as an introductory demonstration to see some things that can be possible with these techniques, rather than something which you need to reproduce.**


Let's begin by importing a few libraries and modules to be used.

In [1]:
# Install additional libraries required (fsspec and s3fs) to load files through AWS S3
%%capture tmp
!pip install fsspec s3fs

# Import libraries to be used
import pandas as pd
import numpy as np
import plotly.express as px

## Linear regression with scikit-learn

[Scikit-learn](https://scikit-learn.org/stable/user_guide.html) may be the most commonly-used machine learning library for Python. It enables the user to build all manners of machine learning models from linear models to decision trees and even neural network models. 

Its relative simplicity and comprehensive documentation makes it not only a great tool but also a great learning resource.

Here, we will use it to develop a few simple linear regression models.

In [2]:
# Load data from S3
df = pd.read_csv("s3://databyjp/academyxi/auto-mpg.csv")

In [None]:
df.head()

We'll use the [auto-mpg dataset](https://archive.ics.uci.edu/ml/datasets/auto+mpg) again, which you would have seen before. 

The dataset includes a number of parameters of cars, which can be used to predict their fuel efficiency (mpg) characteristics. 

### My first regression model


To get started with, let's build a regression model to predict the fuel efficiency, based on the engine size (displacement).

First, we'll briefly visually inspect the data.

In [None]:
import plotly.express as px
fig = px.scatter(df, x="displacement", y="mpg")
fig.show()

The relationship here between money spent and impressions look somewhat linear - so let's fit a simple [linear regression model](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html).

In [5]:
from sklearn import linear_model
reg = linear_model.LinearRegression()

Here, we specify a linear regression model with the `fit_intercept` argument being `False`, which specifies a linear regression model with a zero intercept.

In [None]:
reg.fit(df[["displacement"]], df["mpg"])

Here we provide the values to develop the model with, which are the "Spent" column as the independent variables and the "Clicks" as the dependent (target) variable. And we can print parameters of the model here:

In [None]:
print(reg.coef_, reg.intercept_)

The `reg.coef_` and `reg.intercept_` outputs respectively indicate the coefficient(s) and intercept of the regression model respectively.

In other words, the model predicts that the number of clicks can be produced by:

`reg.coef_` * x + `reg.intercept_`; in other words:

-0.06 * (displacement) + 35.17 = (mpg)

Instead of applying this algebra manually, the model can be used to make predictions also:

In [None]:
reg.predict([[200]])

In other words, this model predicts that a 200 cubic inch (about 3.3 litres) engine will have a fuel efficiency of 23 miles per gallon. 

This seems a reasonable estimate based on the graph above and a cursory google search on engine mileage. 


#### How well does it fit the data?

We discussed the r-squared model earlier on as a measure of model fit. We can see the r-squared value for our model as below:

In [None]:
reg.score(df[["displacement"]], df["mpg"])

Although this isn't awful, it's not a great score. Let's see how we might improve upon this.

### Multiple regression model

The dataset also provides additional information for cars such as the number of cylinders, weight or maximum acceleration. We can add any number of these to our model, creating a multiple regression model. 

For instance, we can add the weight parameter to our model:

In [None]:
reg.fit(df[["displacement", "weight"]], df["mpg"])
print(reg.coef_, reg.intercept_)
print(reg.score(df[["displacement", "weight"]], df["mpg"]))

Immediately we see that the fit has improved (r-squared value is larger), which is intuitively sensible given that we are using additional parameters for our prediction.

To see the impact of varying weights to our prediction, let's use the same 200 cubic inch engine size, but vary the weight from 1000 to 3000 pounds:

In [None]:
# Predict mpg values at [displacement, weight] values pairs - e.g. [200 cubic inches, 1000 pounds]
reg.predict([[200, 1000], [200, 2000], [200, 3000]])  

You can see how now the predictions change as a function of weight. Again, this intuitively makes sense - the heavier the car, the more fuel it will take to move.

As we continue to add relevant variables to the models, it will continue to fit the data better. 

In [None]:
reg.fit(df[["cylinders", "displacement", "acceleration", "weight"]], df["mpg"])
print(reg.coef_, reg.intercept_)
print(reg.score(df[["cylinders", "displacement", "acceleration", "weight"]], df["mpg"]))

#### Dealing with non-linearities

Another way to improve model performance is to use a different model type. Take a look at this graph again:

In [None]:
fig = px.scatter(df, x="displacement", y="mpg")
fig.show()

There clearly is a relationship between displacement and mpg here, but it isn't wholly consistent through the range of displacements. 

It looks as though it may change faster at low displacements whereas it looks relatively stable at high displacements. It suggests that there may be better fits available than linear models.

We've seen earlier that non-linear models could be used in these situations. Another way to tackle this is to transform one or more of the variables, while still using a linear model. The result is that effectively a non-linear relationship is modelled, as variables used in a linear model are transformed.

In this case, let's try transforming our target variable so that instead of using miles per gallon, we'll use gallons per mile:

In [None]:
df = df.assign(gpm=1 / df["mpg"])
fig = px.scatter(df, x="displacement", y="gpm")
fig.show()

Although there is a high spread of values at the high-displacement vehicles, this relationship does look more linear overall. Let's try to fit a simple linear regression model again.



In [None]:
reg.fit(df[["displacement"]], df["gpm"])
print(reg.score(df[["displacement"]], df["gpm"]))

The r-squared value bears out our intuition, resulting in a much higher fit value. Now when we introduce the additional variables, the resulting r-squared value becomes even higher, indicating an improved fit.

In [None]:
reg.fit(df[["cylinders", "displacement", "acceleration", "weight"]], df["gpm"])
print(reg.coef_, reg.intercept_)
print(reg.score(df[["cylinders", "displacement", "acceleration", "weight"]], df["gpm"]))

Predictions can be made as before:


In [None]:
gpm_pred = reg.predict([[6, 200, 10, 2000]])  

And it should be kept in mind that the resulting value now is a "gallons per mile" value, which can be inverted to produce miles per gallon.

In [None]:
mpg_pred = 1 / gpm_pred
print(mpg_pred)

We hope that the above gives you an idea of how basic regression models can be built with Python using the scikit-learn library. 

Linear regression is a very useful technique to understand, but there is much more you can do with scikit-learn. So we include some reading materials for you to get started with, if you are interested:

#### Further readings
- [An introduction to machine learning with scikit-learn](https://scikit-learn.org/stable/tutorial/basic/tutorial.html)
- [Linear Regression Example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html)
- [Choosing the right estimator](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)