# Lesson 2: Linear Regression

## Regression VS Classification

Regression algorithms predict a continuous value based on the input variables. The main goal of regression problems is to estimate a mapping function based on the input and output variables. Classification is a predictive model that approximates a mapping function from input variables to identify discrete output variables, which can be labels or categories.


In [None]:
from IPython import display
display.Image("Image/ClassificationVSRegression.png")

### Regrssion metrics (for testing)

More detail at (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)

Metrics as:
- Mean absolute error (MAE)
- Mean squared error (MSE)
- Root mean squared error (RMSE)
- Root mean squared logarithmic error (RMSLE)
- Mean percentage error (MPE)
- Mean absolute percentage error (MAPE)
- R-square (R^2)

In [None]:
display.Image("Image/MAE_Graph.png")

### Metrics Classification

More detail at (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)

The evelaution of perfomance ar based on the Confusion Matrix
- Accuracy
- Precision (P)
- Recall (R)
- F1 score (F1)
- Area under the ROC (Receiver Operating Characteristic) curve or simply Area Under Curve (AUC)
- Matthew Correlation Coefficient


In [None]:
display.Image("Image/ConfusionMatrix1.png")

## Regression Example

In [None]:
#necessary imports
from sklearn import datasets
import math
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

### Linear Regression - Generated Dataset

In this part, all the code is already written. We ask you the deeply understand what it does and to play with the parameters.

It is highly recommanded to read the documentation there: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_regression.html#sklearn.datasets.make_regression

#### 1. Data generation

In [None]:
from sklearn.datasets import make_regression

In [None]:
X, y, coeff = make_regression(n_samples=1000, n_features=2, bias = 2.0, coef=True, noise=2, random_state=42)

**Question 1:** With the help of the documentation, explain the different parameters. Try with and without noise, with dimension 1, 2 and more.

**Your answer here**

#### 2. Visualization

This cell displays 2d input data (i.e. with 2 features). The color scale represents the output y values.

In [None]:
colors = [cm.nipy_spectral(float(i) / 255) for i in range(256)]

max = np.max(y)
min = np.min(y)
ycol = 255*(y-min)/(max-min)
ycol = ycol.astype('int')

col = [colors[yc] for yc in ycol]
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], color = col, marker="x")

Let's try and visualize the relationship of each feature with the output

In [None]:
# plot the first feature against the y values


In [None]:
# plot the second feature against the y values


What can you understand from this plot?

**Your answer here**

#### 3. Split dataset in Train and Test

In [None]:
display.Image("Image/Split Dataset.png")

In [None]:
# Splitting train and test datasets


Always check the size of the data

In [None]:
# Shape


#### 4. Estimation (training)

In [None]:
# Constructor call and training procedure


#### 5. Prediction

In [None]:
y_pred_train = lr.predict(X_train)

In [None]:
# Predict on Test dataset


#### 6. Testing: evaluation with regression metrics

Calculate $E_{in}$ and $E_{out}$ with the same evaluation metric

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

print("Train MAE score: ", mean_absolute_error(y_train, y_pred_train))
print("Test MAE score: ", mean_absolute_error(y_test, y_pred_test))

In [None]:
print("Train MSE score: ", mean_squared_error(y_train, y_pred_train))
print("Test MSE score: ", mean_squared_error(y_test, y_pred_test))

print("Train R2 score: ", r2_score(y_train, y_pred_train))
print("Test R2 score: ", r2_score(y_test, y_pred_test))

**Question 2:** What does this score represent? Is it good? (try to answer using the different options proposed for the dataset creation).

**Your answer here**

We can also compare the true linear coefficients with the coefficients found by the linear regression in the case of generated dataset:

In [None]:
print(coeff)

In [None]:
print(lr.coef_, lr.intercept_)

## Example of under/overfitting

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PolynomialFeatures

# Generate training samples
x_train = np.random.rand(100,1)
y_train = - x_train + 3 * (x_train ** 2) - 2 * (x_train ** 3) + 2 + np.random.rand(100,1) * 0.1

# Generate some outlier points in the dataset
x_train_noise = np.random.rand(10,1)
y_train_noise = - x_train_noise + 3 * (x_train_noise ** 2) - 2 * (x_train_noise ** 3) + 2 \
                + np.random.rand(10,1) * 0.5

# Combine 'normal' points and 'outlier' points to a single training set
x_train = np.concatenate((x_train, x_train_noise), axis=0)
y_train = np.concatenate((y_train, y_train_noise), axis=0)

# Generate test samples
x_test = np.random.rand(20,1)
y_test = - x_test + 3 * (x_test ** 2) - 2 * (x_test ** 3) + 2 + np.random.rand(20,1) * 0.1

In [None]:
# Plot training samples


##### Degree 1: Underfit

In [None]:
# Generate polynomial features
polynomial_features= PolynomialFeatures(degree=1)
x_train_poly = polynomial_features.fit_transform(x_train)[:,1:]
x_test_poly = polynomial_features.fit_transform(x_test)[:,1:]

# Create linear regression model


# Fit model to polynomial data


# Print fitted model : parameters, train score and test score
print('Coef:', model.coef_, 'Intercept:', model.intercept_)

print('Train score:', mean_squared_error(model.predict(x_train_poly), y_train))
print('Test score:', mean_squared_error(model.predict(x_test_poly), y_test))

In [None]:
# Plot the fitted line on the graph
idx = np.argsort(x_train, axis=0)[:,0]
plt.plot(x_train[idx], model.predict(x_train_poly)[idx], 'r', label='Fitting line')
plt.scatter(x_train,y_train, label='Training samples')
plt.scatter(x_test,y_test, label='Test samples')
plt.legend()

##### Correct Polinomial

In [None]:
# Generate polynomial features : Polynomial of degree 3


# Create linear regression model


# fit model to polynomial data


# print fitted model


In [None]:
# Plot


##### High Degree: Overfit

In [None]:
# Generate polynomial features : degree 30

# Create linear regression model


# fit model to polynomial data


# print fitted model


In [None]:
# Plot


### Linear Regression: Diabetes Dataset

In [None]:
# Load the dataset
diab = datasets.load_diabetes()
X = 
y = 

**Question 4:** how many data points ?

In [None]:
# answer here


**Question 5**: type of data ? dimension of data ? type of labels ?

In [None]:
# answer here


**Question 5bis**: What are the features ?

In [None]:
# answer here


**Question 6a:** Split train/test dataset

In [None]:
# answer here


**Question 6b:** How many training data points? How many test data points?

In [None]:
# answer here


**Question 7:** Linear regression. Create a default linear regression and train this regression.

In [None]:
# answer here
# Create linear regression model


# fit model to polynomial data


**Question 8a:** Print the score. What do they represent?

In [None]:
# answer here


**Your answer here**

**Question 8b:** What are the MSE and RMSE values?

In [None]:
# answer here


**Question 9:** How could you test a non-linear regression as for example a second-degree polynomial?

In [None]:
# answer here
