<a href="https://colab.research.google.com/github/michalis0/Business-Intelligence-and-Analytics/blob/master/7%20-%20Regression/Exercise/Exercise_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise: Supervised Learning, Linear Regression

This exercise is an application of what you learned in the walkthrough. The following cell gather the different modules you need for this exercise (take a look at the sklearn library).

Some exercises consist of filling a part of the code without writing the whole code. Replace the `"YOUR CODE HERE"` by your own code.

In [None]:
# Useful starting lines
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import matplotlib.pyplot as plt
#from matplotlib import collections  as mc
import pandas as pd 
import seaborn as sns
sns.set_style("darkgrid")

import warnings
warnings.filterwarnings('ignore')

# Sklearn import
from sklearn.preprocessing import MinMaxScaler # Normalization
from sklearn.linear_model import LinearRegression # Regression linear model
from sklearn.model_selection import train_test_split # Splitting the data set
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error # Metrics for errors
from sklearn.model_selection import KFold # Cross validation



## 1. Load the data
We are going to use an advertisement data. The task is to figure out how different means of advertisement influence the amount of sales of a product.
     
Load the pandas dataset from the given URL. Then display the first 5 rows. How many observations  and columns we have? Hint: use the `shape` attribute.

In [None]:
url = 'https://media.githubusercontent.com/media/michalis0/Business-Intelligence-and-Analytics/master/data/Advertising.csv'
# Load the data
ad_df = 'YOUR CODE HERE'
display('YOUR CODE HERE')

# Observations and columns (dimensions)
print("Number of observations", 'YOUR CODE HERE')
print("Number of dimensions", 'YOUR CODE HERE')


In what follows we will try a simple linear regression using only one feature (univariate regression), that is, we want to predict the `sales` using only the `TV` advertisments.

To get a first sense of the relationship between the different variables, display the correlation table.

In [None]:
# Display the correlation table of ['Sales', 'TV', 'Radio', 'Newspaper']
'YOUR CODE HERE'

## 2. Using Sklearn
When using sklearn we don't need to add a column of ones to the data in order to have the constant parameter. Sklearn takes care of it, you should just pass the `fit_intercept` argument to be True (which is also the default value for this argument).

1. From the advertising dataset, save the feature ``TV`` and the target `sales` in two different variables X and y respectively, in a dataframe pandas format and not as a series (`X[['sth']]` instead of `X['sth']`).
2. Split the data into a train and a test set. The test set size should be 20% of the original data. Additionally, set the `random_state` to 0 and `shuffle` to `True`.
3. Create a new Linear model from the `LinearRegression` module of sklearn. Make sure it includes an intercept. Fit the model with the corresponding data.  
4. Print the values of the slope and the constant.  
5. Predict the sales using the 12th TV value of X (hint: `iloc[[12]]`) with your model (i.e `.predict()`) and compare with the true value from y.
6. Compute the r2, MAE, and MSE.
7. Plot the regression.

In [None]:
# 1) Use the original dataframe
X = 'YOUR CODE HERE'
y = 'YOUR CODE HERE'

In [None]:
# 2) Do the train test split
X_train, X_test, y_train, y_test = 'YOUR CODE HERE'

We don't normalize the data here, but this is how it would be done:
```
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train))
X_test = pd.DataFrame(scaler.transform(X_test))
```

In [None]:
# 3) Create the linear model
LR = 'YOUR CODE HERE'

# Fit the model using X and y
'YOUR CODE HERE'

In [None]:
# 4) Model output
print("Slope: %.4f" % 'YOUR CODE HERE')
print("Constant (intercept): %.4f" % 'YOUR CODE HERE')

In [None]:
# 5) Predict the Sales 
print("y_pred: %.4f" % 'YOUR CODE HERE',  "y_true: ", 'YOUR CODE HERE')

In [None]:
# 6) First you have to make the predictions for the test set
prediction = 'YOUR CODE HERE'

# r2, MAE, and MSE 
print('R^2: %.2f' % 'YOUR CODE HERE')
print('Mean absolute error: %.2f' % 'YOUR CODE HERE')
print('Mean squared error: %.2f' % 'YOUR CODE HERE')

In [None]:
# 7) Plot of the regression
plt.scatter('YOUR CODE HERE')
plt.plot('YOUR CODE HERE')
plt.title('Sales Predicted by TV Ads')
plt.xlabel('TV')
plt.ylabel('Sales')
plt.show()

When you use this single-variate model, you can simply switch the features (TV, Radio, Newspaper) to see which predicts the target variable (Sales) the best. Which feature predicts the sales best?

**Hint:** Simply change the target variable and re-run the cells above. Then compare the evaluation metrics (r2, MAE and MSE).

## 3. Using more features for prediction
Let's try to use more features to predict the sales. For example we can observe the effect of TV and Radio advertisment in the same time.

1. From the advertising dataset, save the features ``TV`` and `Radio` in to X  and the target `sales` into y, in a dataframe pandas format (`X[['sth']]` instead of `X['sth']`).
2. Split the data into a train and a test set. The test set size should be 20% of the original data. Additionally, set the `random_state` to 0 and `shuffle` to `True`.
3. Create a new Linear model from the `LinearRegression` module of sklearn. Fit the model with the corresponding data.
4. Print the parameters of the slope and the constant (intercept).
5. Predict the sales using the 12th TV value of X (hint: iloc[[12]]) with your model (i.e .predict()) and compare with the true value from y.  
6. Compute the r2 (`r2_score`), MAE (`mean_absolute_error`), and MSE (`mean_squared_error`).



In [None]:
# 1) Use the original dataframe
X = 'YOUR CODE HERE'
y = 'YOUR CODE HERE'

In [None]:
# 2) Do the train test split
X_train, X_test, y_train, y_test = 'YOUR CODE HERE'

We don't normalize the data here, but this is how it would be done:
```
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = pd.DataFrame(scaler.transform(X_train))
X_test = pd.DataFrame(scaler.transform(X_test))
```

In [None]:
# 3) Create the linear model
LR = 'YOUR CODE HERE'

# Fit the model using X and y
'YOUR CODE HERE'

In [None]:
# 4) Parameters and Intercept
print("Parameters: ", 'YOUR CODE HERE')
print("Intercept: ", 'YOUR CODE HERE')

In [None]:
# 5) Predict the sales
print("y_pred: %.4f" % 'YOUR CODE HERE',  "y_true: ", 'YOUR CODE HERE')

In [None]:
# 6) First you have to make the predictions for the test set
prediction = 'YOUR CODE HERE'

# r2, MAE, and MSE 
print('R^2: %.2f' % 'YOUR CODE HERE')
print('Mean absolute error: %.2f' % 'YOUR CODE HERE')
print('Mean squared error: %.2f' % 'YOUR CODE HERE')

Did the model metrics (r2, MAE, MSE) improve significantly when `Radio` was added to the features to predict `Sales`? Did the model become "better"?

In [None]:
# Compute by how much each of the metrics changed in comparaison to the single-variate model
'YOUR CODE HERE'

## 4. Using all features
Redo steps 1-6 with `TV`, `Radio`, and `Newspaper` as features predicting `Sales`.

In [None]:
# Find the features and the target
'YOUR CODE HERE'
# Split the data
'YOUR CODE HERE'
# The scaling would take place here
# Create and fit the linear regression model
'YOUR CODE HERE'
# Display the various parameters
'YOUR CODE HERE'
# Make a prediction for the 12th obsercation
'YOUR CODE HERE'
# Compute the model metrics 
'YOUR CODE HERE'