# Linear Regression

In [None]:
Introduction

    The objective of this exercise is to become familiar with linear regression. Linear regression was one of the first predictive models to be studied and is today one of the most popular models for practical applications thanks to its simplicity.

Univariate Linear Regression

    In the univariate linear model, we have two variables: 𝑦

called the target variable and 𝑥

called the explanatory variable. Linear regression consists in modeling the link between these two variables by an affine function. Thus, the formula of the univariate linear model is given by:

𝑦≈𝛽1𝑥+𝛽0

where:

        𝑦

is the variable we want to predict.
𝑥
is the explanatory variable.
𝛽1
and 𝛽0 are the parameters of the affine function. 𝛽1 will define its slope and 𝛽0

        will define its y-intercept (also called bias).

The goal of linear regression is to estimate the best parameters 𝛽0
and 𝛽1 to predict the variable 𝑦 from a given value of 𝑥

    .

To get a feel for Univariate Linear Regression, let us look at the interactive example below.

    (a) Run the next cell to display the interactive figure. In this figure, we have simulated a dataset by the relation 𝑦=𝛼1𝑥+𝛼0

    .

    (b) Use the sliders on the Regression tab to find the parameters 𝛽0

and 𝛽1

    that best match all the points in the data set.

    (c) What is the effect of each of the parameters on the regression function?


In [None]:
from widgets import regression_widget

regression_widget()

In [None]:
Multivariate Linear Regression

    Multivariate linear regression consists in modeling a linear link between a target variable 𝑦

and several explanatory variables 𝑥1, 𝑥2, ..., 𝑥𝑝

, often called features:

𝑦≈β0+β1𝑥1+β2𝑥2+⋯+β𝑝𝑥𝑝≈β0+∑𝑗=1𝑝β𝑗𝑥𝑗

There are now 𝑝+1
parameters 𝛽𝑗

    to find.

## 1. Using scikit-learn for linear regression

In [None]:


1. Using scikit-learn for linear regression

    We are now going to learn how to use the scikit-learn library in order to solve a Machine Learning problem with a linear regression.

    During the following exercises, the objective will be to predict the selling price of a car based on its characteristics.

Importing the dataset

    The dataset that we will use in the following contains many characteristics about different cars from 1985.

    For simplicity, only the numeric variables have been kept and the lines containing missing values have been deleted.

    (a) Import the pandas module under the alias pd.

    (b) In a DataFrame nameddf, import the automobiles.csv dataset using the read_csv function ofpandas. This file is located in the same folder as the runtime environment of the notebook.

    (c) Display the first 5 lines of df to check if the import was successful.


In [None]:
# Insert your code here

import pandas as pd

df = pd.read_csv('automobiles.csv')

df.head()

In [None]:
        The symboling variable corresponds to the degree of risk with respect to the insurer (risk of accident, breakdown, etc.).

        The normalized_losses variable is the relative average cost per year of vehicle insurance. This value is normalized with respect to cars of the same type (SUV, utility, sports, etc.).

        The following 13 variables concern the technical characteristics of the cars such as width, length, engine displacement, horsepower, etc ...

        The last variable price corresponds to the selling price of the vehicle. This is the variable that we will try to predict.

Separation of the explanatory variables from the target variable

    We are now going to create two DataFrames, one containing the explanatory variables and another containing the target variable price.

    (d) In a DataFrame named X, make a copy of the explanatory variables of our data set, that is to say all the variables except price.

    (e) In a DataFrame named y, make a copy of the target variable price.

In [None]:
# Insert your code here

X = df.iloc[:,:-1]
#or
X = df.drop(['price'], axis = 1)

y = df.iloc[:, -1:]
#or
y = df['price']


# Orginala Cozum
# Explanatory variables
X = df.drop(['price'], axis = 1)

# Target variable
y = df['price']

## Splitting of the data into training and test sets

In [None]:
Splitting of the data into training and test sets

    We are now going to split our dataset into two sets : A training set and a test set. This step is extremely important when doing Machine Learning.

    Indeed, as their names indicate:

            The training set is used to train the model, ie to find the optimal 𝛽0

, ..., 𝛽𝑝

            parameters for this datase t.

            The test set is used to test the trained model by evaluating its ability to generalize its predictions on data that it has never seen .

    A very useful function for doing this is the train_test_split function of the model_selection submodule of scikit-learn.

    (f) Run the following cell to import the train_test_split function.


from sklearn.model_selection import train_test_split


    This function is used as follows:

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

            X_train andy_train are the explanatory and target variables of the training dataset.

            X_test andy_test are the explanatory and target variables of the test dataset.

            The test_size parameter corresponds to the proportion of the dataset that we want to keep for the test set. In the previous example, this proportion corresponds to 20% of the initial dataset.

    (g) Using the train_test_split function, separate the dataset into a training set (X_train, y_train) and a test set (X_test, y_test) so that the test set contains 15% of the initial dataset.



In [None]:
# Insert your code here

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.15)

#Original Cozum
# Splitting the dataset into a training set (85%) and a test set (15%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

## Training the regression model

In [None]:
Training the regression model

    To train a linear regression model on this dataset, we will use the LinearRegression class contained in the linear_model submodule of scikit-learn.

    (h) Run the following cell to import the LinearRegression class.

from sklearn.linear_model import LinearRegression



    The scikit-learn API makes it easy to train and evaluate models. All scikit-learn model classes have the following two methods:

            fit: Train the model on the dataset given as input.

            predict: Make a prediction from a set of explanatory variables given as input.

    Below is an example of training a model with scikit-learn:

    # Instantiation of the model
    linreg = LinearRegression()

    # Training the model on the training set
    linreg.fit(X_train, y_train)

    # Prediction of the target variable for the test dataset. These predictions are stored in y_pred.
    y_pred = linreg.predict(X_test)

    (i) Instantiate a LinearRegression model named lr.

    (j) Train lr on the training dataset.

    (k) Make a prediction on the training data. Store these predictions in y_pred_train.

    (l) Make a prediction on the test data. Store these predictions in y_pred_test.



In [None]:
# Insert your code here

lr = LinearRegression()

lr.fit(X_train, y_train)

y_pred_train = lr.predict(X_train)

y_pred_test = lr.predict(X_test)

dict1 = {'y_pred_train' : y_pred_train,
         'y_train_true' : y_train,
         'y_train_difference' : (y_pred_train - y_train),
         'difference_squared' : (y_pred_train - y_train)**2 }

df_train = pd.DataFrame(dict1)
df_train = df_train.astype(int)

print(df_train)   

print(y_pred_train.shape, y_pred_test.shape)

print("MSE train : ",df_train.difference_squared.mean())

In [None]:
# Original Cozum
# Instantiation of the model
lr = LinearRegression()

# Training the model
lr.fit(X_train, y_train)

# Prediction of the target variable for the TRAIN dataset
y_pred_train = lr.predict(X_train)

# Prediction of the target variable for the TEST dataset
y_pred_test = lr.predict(X_test)

## Evaluation of the model's performance

In [None]:
Evaluation of the model's performance

    In order to evaluate the quality of the predictions of the model obtained thanks to the parameters 𝛽0

, ..., 𝛽𝑗

, there are several metrics already built in the scikit-learn library.

One of the most used metrics for regression is the Mean Squared Error (MSE) which is defined under the name of mean_squared_error in the metrics submodule of scikit-learn.

This function consists in calculating the average of the squared distances between the target variables and the predictions obtained thanks to the regression function.

The following interactive figure shows how this error is calculated according to 𝛽1

:

        The blue dots represent the dataset for which we want to evaluate the quality of the predictions. Usually this is the test dataset.

        The red line is the regression function configured by 𝛽1

. In this example, 𝛽0

    is set to 0 to simplify the illustration.

    The green lines are the distances between the target variable and the predictions obtained thanks to the regression function parameterized by 𝛽1

            .

    The mean squared error is just the average of these squared distances.

    (m) Run the next cell to display the interactive figure.

    (n) Using the cursor below the figure, try to find a value of 𝛽1

that minimizes the Mean Squared Error. Is this value unique?

In [None]:
from widgets import interactive_MSE

interactive_MSE()

In [None]:


    The mean_squared_error function of scikit-learn is used as follows:

        mean_squared_error(y_true, y_pred)

    where:

            y_true contains the true values of the target variable.
            y_pred contains the values predicted by our model for the same explanatory variables.

    (o) Import the mean_squared_error function from the sklearn.metrics submodule.

    (p) Evaluate the prediction quality of the model on training data. Store the result in a variable named mse_train.

    (q) Evaluate model prediction quality on test data. Store the result in a variable named mse_test.

    (r) Why is the MSE higher on the test dataset?



In [None]:
# Insert your code here

from sklearn.metrics import mean_squared_error

lr = LinearRegression()

lr.fit(X_train, y_train)

y_pred_train = lr.predict(X_train)

y_pred_test = lr.predict(X_test)

mse_train = mean_squared_error(y_train, y_pred_train)

mse_test = mean_squared_error(y_test, y_pred_test)

print(mse_train, 'MSE train')
print(mse_test, 'MSE test')

In [None]:
# Original Cozum
from sklearn.metrics import mean_squared_error

# Calculation of the MSE between the target variable and the predictions made on the training dataset
mse_train = mean_squared_error(y_train, y_pred_train)

# Calculation of the MSE between the target variable and the predictions made on the test dataset
mse_test = mean_squared_error(y_test, y_pred_test)

print("MSE train lr:", mse_train)
print("MSE test lr:", mse_test)

In [None]:
    The mean squared error you will find should be around millions on the test data, which can be difficult to interpret.

    This is why we are going to use another metric, the Mean Absolute Error which is at the same scale as the target variable.

    (s) Import the mean_absolute_error function from the sklearn.metrics submodule.

    (t) Evaluate the prediction quality on test and training data using the mean absolute error.

    (u) From the DataFrame df, calculate the average purchase price on all vehicles. Do the model's predictions seem reliable to you?


In [None]:
# Insert your code here

from sklearn.metrics import mean_absolute_error

mae_train = mean_absolute_error(y_train, y_pred_train)

mae_test = mean_absolute_error(y_test, y_pred_test)

print(mae_train, 'MAE train')
print(mae_test, 'MAE test')

print("Avg Price : ", df['price'].mean())

In [None]:
from sklearn.metrics import mean_absolute_error

# Calculation of the MAE between the target variable and the predictions made on the training dataset
mae_train = mean_absolute_error(y_train, y_pred_train)

# Calculation of the MAE between the target variable and the predictions made on the test dataset
mae_test = mean_absolute_error(y_test, y_pred_test)

print("MAE train lr:", mae_train)
print("MAE test lr:", mae_test)

mean_price = df['price'].mean()

print("\nRelative error", mae_test / mean_price)

# The mean absolute error is around 20% of the average price, which is not optimal
# but is still a good baseline for testing more advanced models.

## 2. Overfitting the data with another regression model

In [None]:
2. Overfitting the data with another regression model

    We have just seen that with the LinearRegression class ofscikit-learn, the model was able to learn on the training data and generalize on the test data with an error rate of 20% on average.

    In what follows we will create another regression model that learns very well on training data but generalizes very poorly on test data: this is called overfitting.

    For this we will use a Machine Learning model called Gradient Boosting Regressor known for its tendancy to overfit.

    (a) Run the following cell to import the GradientBoostingRegressor class contained in the ensemble submodule of scikit-learn and instantiate a GradientBoostingRegressor model named gbr.



In [None]:
from sklearn.ensemble import GradientBoostingRegressor

# These parameters have been chosen to overfit on purpose
# Do not use them in practice
gbr = GradientBoostingRegressor(n_estimators = 1000,
                                max_depth = 10000,
                                max_features = 15,
                                validation_fraction = 0)

In [None]:
    (b) Train the model gbr using its fit method.

    (c) Make predictions on the test and training datasets. Store these predictions in y_pred_test_gbr andy_pred_train_gbr.

In [None]:
# Insert your code here

gbr.fit(X_train, y_train)

y_pred_train_gbr = gbr.predict(X_train)

y_pred_test_gbr = gbr.predict(X_test)


# Original Cozum
# Training the model on the training dataset
gbr.fit(X_train, y_train)

# Prediction of the target variable for the TRAIN dataset
y_pred_train_gbr = gbr.predict(X_train)

# Prediction of the target variable for the TEST dataset
y_pred_test_gbr = gbr.predict(X_test)

In [None]:


    After instantiating our model, training it on the training data and making the predictions, we must then evaluate its performance.

    (d) Calculate the MSE on the training data and the test data using the mean_squared_error function then display the results.

    (e) Calculate the MAE for the training data and the test data using the mean_absolute_error function then display the results .

    (f) After having calculated the average of the price column, calculate the relative error of the model on the test set.



In [None]:
# Insert your code here
mse_train = mean_squared_error(y_train, y_pred_train_gbr)

mse_test = mean_squared_error(y_test, y_pred_test_gbr)

print(mse_train, "MSE train")
print(mse_test, "MSE test\n")

mae_train = mean_absolute_error(y_train, y_pred_train_gbr)

mae_test = mean_absolute_error(y_test, y_pred_test_gbr)

print(mae_train, "MAE train")
print(mae_test, "MAE test\n")

mean_price_gbr = df.price.mean()
print("Relative Error : ", mae_test / mean_price_gbr)

In [None]:
### MSE

# Calculation of the MSE between the target variable and the predictions made on the training dataset
mse_train_gbr = mean_squared_error(y_train, y_pred_train_gbr)

# Calculation of the MSE between the target variable and the predictions made on the test dataset
mse_test_gbr = mean_squared_error(y_test, y_pred_test_gbr)

print("MSE train gbr:", mse_train_gbr)
print("MSE test gbr:", mse_test_gbr, "\n")


### MAE

# Calculation of the MAE between the target variable and the predictions made on the training dataset
mae_train_gbr = mean_absolute_error(y_train, y_pred_train_gbr)

# Calculation of the MAE between the target variable and the predictions made on the test dataset
mae_test_gbr = mean_absolute_error(y_test, y_pred_test_gbr)

print("MAE train gbr:", mae_train_gbr)
print("MAE test gbr:", mae_test_gbr, "\n")

mean_price_gbr = df ['price'].mean()

print("Relative error", mae_test_gbr / mean_price_gbr)

In [None]:
    Here is an example of results that we could obtain with these two models.

    For the linear regression with LinearRegression we had:

            MAE train lr = 1588.131591267774
            MAE test lr = 2105.5002712214014

    For the regression with GradientBoostingRegressor we have:

            MAE train gbr: 27.533333333339847
            MAE test gbr: 1393.013371545563

    The mean absolute error obtained on the training set by the GradientBoostingRegressor model is only 27.5 against 1588 for the linear regression. The GradientBoostingRegressor model is very powerful and is able to learn the training data almost "by heart" which explains this difference in performance.

    It is for this reason that the performance of the model should be evaluated on the test dataset. Indeed, the average absolute error of the GradientBoostingRegressor model is 1393, which is very far from the performance obtained on the training data.

    This is an example of blatant overfitting. Even if the performance of the GradientBoostingRegressor is superior to that of the linear regression on the test data, you should always be wary of too high a performance.


## 3. Going further: Polynomial Regression

In [None]:
3. Going further: Polynomial Regression

    In many cases, the relationship between the variables 𝑥

and 𝑦 is not linear. This does not allow us to use linear regression to predict 𝑦

. We could then propose a quadratic model such as:

𝑦=𝛽0+𝛽1𝑥+𝛽2𝑥2

    (a) Run the next cell to display the interactive figure.

    (b) Find the optimal parameters for 𝛽0, 𝛽1 and 𝛽2 that best approximate the data on the scatter plot.

    (c) Set 𝛽2 to 0 and vary 𝛽0 and 𝛽1. Which model do you recognize?

In [None]:
from widgets import polynomial_regression

polynomial_regression()

In [None]:
Polynomial regression is equivalent to performing a classical linear regression from polynomial functions of the explanatory variable of arbitrary degree. Polynomial regression is much more flexible than classical linear regression because it can approach any type of continuous function.

When we have several explanatory variables, the polynomial variables can also be calculated by products between the explanatory variables. For example, if we have three variables, then the second-order polynomial regression model becomes:

𝑦≈𝛽0+𝛽1𝑥21+𝛽2𝑥22+𝛽3𝑥23+𝛽4𝑥1𝑥2+𝛽5𝑥2𝑥3+𝛽6𝑥1𝑥3

    If we had more explanatory variables or wanted to increase the degree of polynimial regression, the number of explanatory variables would explode, which could induce overfitting.

    (d) Run the next cell to display the interactive figure.

    The scatter plot was generated with the same trend as the previous figure. The red line corresponds to the optimal polynomial regression function obtained on these data.

    (e) Taking into account the scatter plot in the previous figure, find the degree of the polynomial regression that best captures the trend of the data.

    (f) Set d to 20. Do you think this regression function would give good predictions on the scatter plot in the previous figure?


In [None]:
from widgets import polynomial_regression2

polynomial_regression2()

In [None]:
    To train a Polynomial regression model with scikit-learn, we must first calculate the polynomial variables from the data. This can be done using the PolynomialFeatures class of the preprocessing submodule:

    from sklearn.preprocessing import PolynomialFeatures

    poly_feature_extractor = PolynomialFeatures(degree = 2)

        The degree parameter defines the degree of the polynomial features to be calculated.

    The poly_feature_extractor object is not a prediction model. This type of object is called a Transformer and it can be used with the following two methods:

            fit: does nothing in this case. This method is generally used to calculate the parameters necessary to apply a transformation to the data.

            transform: Applies the transformation to the dataset. In this case, the method returns the polynomial features of the dataset.

    These two methods can be called simultaneously using the fit_transform method. We can compute the polynomial features on X_train andX_test as follows:

    X_train_poly = poly_feature_extractor.fit_transform(X_train)

    X_test_poly = poly_feature_extractor.transform(X_test)

    (g) Import the PolynomialFeatures class from the preprocessing submodule of sklearn.

    (h) Instantiate an object of class PolynomialFeatures with the argument degree = 3 and name it poly_feature_extractor.

    (i) Apply the transformation of poly_feature_extractor on X_train and X_test and store the results in X_train_poly and X_test_poly.



In [None]:
# Insert your code here

from sklearn.preprocessing import PolynomialFeatures

poly_feature_extractor = PolynomialFeatures(degree = 3)

X_train_poly = poly_feature_extractor.fit_transform(X_train)

X_test_poly = poly_feature_extractor.transform(X_test)

In [None]:
# Original Cozum
from sklearn.preprocessing import PolynomialFeatures

poly_feature_extractor = PolynomialFeatures(degree = 3)

# Applying the transformation on X_train et X_test
X_train_poly = poly_feature_extractor.fit_transform(X_train)
X_test_poly = poly_feature_extractor.transform(X_test)

In [None]:

    (j) Train a linear regression model on the data (X_train_poly, y_train).

    (k) Evaluate its performance on training data and test data (X_test_poly, y_test). Are we in an overfitting regime?



In [None]:
# Insert your code here

lr = LinearRegression()

lr.fit(X_train_poly, y_train)

y_pred_train = lr.predict(X_train_poly)

mae_train = mean_absolute_error(y_train, y_pred_train)

print("MAE train : ",mae_train)

y_pred_test = lr.predict(X_test_poly)

mae_test = mean_absolute_error(y_test, y_pred_test)

print("MAE test : ", mae_test)

In [None]:
# Original Cozum
# Instantiation of a linear regression model
polyreg = LinearRegression()

# Training of the model on polynomial features
polyreg.fit(X_train_poly, y_train)

# Evaluation of the model on the training data
y_pred_train = polyreg.predict(X_train_poly)
print("MAE Train:", mean_absolute_error(y_train, y_pred_train), '\n')


# Evaluation of the model on the test data
y_pred_test = polyreg.predict(X_test_poly)
print("MAE Test:", mean_absolute_error(y_test, y_pred_test), '\n')


print("We are absolutely in an overfitting regime.")
print("The polynomial regression model performs well on training data but not on test data.")
print("The third-order polynomial regression model performs significantly worse than a simple linear regression.")

## Conclusion and recap

In [None]:
Conclusion and recap

    In this course, you have been introduced to solving a regression problem with machine learning.

    We used the scikit-learn library to instantiate regression models like LinearRegression or GradientBoostingRegressor and also apply transformations on the data like extracting polynomial features.

    The different steps that we have studied are the basis of any solution to a Machine Learning problem:

            The data is prepared by separating the explanatory variables from the target variable.

            We split the dataset into two sets (a training set and a test set) using the train_test_split function of thesklearn.model_selection submodule.

            We instantiate a model like LinearRegression or GradientBoostingRegressor thanks to the class' constructor.

            We train the model on the training dataset using the fit method.

            We perform a prediction on the test dataset using the predict method.

            We evaluate the performance of our model by calculating the error between these predictions and the true values of the target variable from the test data.

    The performance evaluation for a regression model is easily done using the mean_squared_error or mean_absolute_error functions of the metrics submodule of sklearn.



## RECAP Linear Regression 

In [None]:
# RECAP Linear Regression 
# Load Data
df = pd.read_csv('automobiles.csv', index_col = 0)


# Import Libraries
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error


# Explanatory variables
X = df.drop(['target'], axis = 1)

# Target variable
y = df['target']



# Splitting the dataset into a training set (85%) and a test set (15%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)



# Instantiation of the model
lr = LinearRegression()

# Training the model
lr.fit(X_train, y_train)

# Prediction of the target variable for the TRAIN dataset
y_pred_train = lr.predict(X_train)

# Prediction of the target variable for the TEST dataset
y_pred_test = lr.predict(X_test)



# Calculation of the MSE between the target variable and the predictions made on the training dataset
mse_train = mean_squared_error(y_train, y_pred_train)

# Calculation of the MSE between the target variable and the predictions made on the test dataset
mse_test = mean_squared_error(y_test, y_pred_test)

print("MSE train lr:", mse_train)
print("MSE test lr:", mse_test)
print('\n')


# Calculation of the MAE between the target variable and the predictions made on the training dataset
mae_train = mean_absolute_error(y_train, y_pred_train)

# Calculation of the MAE between the target variable and the predictions made on the test dataset
mae_test = mean_absolute_error(y_test, y_pred_test)

print("MAE train lr:", mae_train)
print("MAE test lr:", mae_test)

mean_price = df['price'].mean()

print("\nRelative error", mae_test / mean_price)

# The mean absolute error is around 20% of the average price, which is not optimal
# but is still a good baseline for testing more advanced models.


Accuracy (Doğruluk): Bir sınıflandırma algoritması tarafından doğru olarak tahmin edilen örneklerin oranıdır. 
Yani doğru tahmin edilen örneklerin toplam örneklere oranıdır. 
Örnek olarak, bir hastalık teşhisi yapmak için bir sınıflandırma modeli kullanırsak,
doğru olarak teşhis edilen hastaların sayısının toplam hastalara oranı accuracy olarak ifade edilir.

Recall (Duyarlılık): Gerçek pozitif örneklerin tamamının doğru olarak tahmin edilme oranıdır. 
Yani gerçekte hastalığı olan hastaların, hastalığı olanların tümü olarak tanımlanma oranıdır. 
Örnek olarak, kanser taraması yapmak için bir sınıflandırma modeli kullanırsak, 
gerçekte kanser hastası olanların tamamının doğru şekilde tespit edilme oranı recall olarak ifade edilir.

Precision (Kesinlik): Bir sınıflandırma algoritmasının doğru olarak tahmin ettiği pozitif örneklerin oranıdır. 
Yani, modelin doğru olarak tahmin ettiği pozitif örneklerin gerçekte pozitif olanların tümüne oranıdır. 
Örnek olarak, bir spam filtresi oluşturmak için bir sınıflandırma modeli kullanırsak, 
gerçekten spam olanların tahmini doğru şekilde spam olarak sınıflandırıldığı oran precision olarak ifade edilir.m

# CLASSIFICATION

In [None]:
## Part II: Simple classification models

    For this second part of an introduction to the scikit-learn module, we will focus on the second type of problem in Machine Learning: the classification problem.

    The objective of this introduction is to:

            Introduce the classification problem.
            Learn to use the scikit-learn module to build a classification model, also known as a "classifier".
            Introduce metrics adapted to the evaluation of classification models.



## Introduction to classifcation

Objective of classification

    In supervised learning, the objective is to predict the value of a target variable from explanatory variables.

            In a regression problem, the target variable takes continuous values. These values are numerical: price of a house, quantity of oxygen in the air of a city, etc. The target variable can therefore take an infinity of values.

            In a classification problem, the target variable takes discrete values. These values can be numeric or literal, but in both cases the target variable takes a finite number of values.

    The different values taken by the target variable are called classes.

    The objective of classification therefore consists in predicting the class of an observation from its explanatory variables.

An example of classification

    We will look at a problem of a binary classification, i.e. where there are two classes. We are trying to determine whether the water in a stream is drinkable or not depending on its concentration of toxic substances and its mineral salts content.

    The two classes are therefore 'drinkable' and 'non-drinkable'.

<img src="Photos\sklearn_intro_classification_binaire_en.png" width="400" height="400">

    In the figure above, each point represents a stream whose position on the map is defined by its values for the concentration of toxic substances and the content of mineral salts.

    The objective will be to build a model capable of assigning one of the two classes ('drinkable' / 'non-drinkable') to a stream of which only these two variables are known.

    The figure above suggests the existence of two zones allowing easy classification of streams:

            An area where the streams are drinkable (top left).

            An area where the streams are not drinkable (bottom right).

    We would like to create a model capable of separating the dataset into two parts corresponding to these areas.

    A simple technique would be to separate the two areas using a line.

    (a) Run the next cell to display the interactive figure.

        The orange dots are the drinkable streams and the blue dots are the non-drinkable streams.

        The red arrow corresponds to a vector defined by 𝑤=(𝑤1,𝑤2)

. The red line corresponds to the orthogonal (i.e. perpendicular) plane to 𝑤. You can change the coordinates of the vector 𝑤

        in two ways:

                By moving the sliders of w_1 andw_2.
                By clicking on the values to the right of the sliders and typing the desired value.

    (b) Try to find a vector 𝑤

such that the plane orthogonal to 𝑤

    perfectly separates the two stream classes.

    (c) A possible solution is given by the vector 𝑤=(−1.47,0.84)

. Does the vector 𝑤=(1.47,−0.84) also give a solution?

In [None]:
from classification_widgets import linear_classification

linear_classification()

##  
The classification we just performed is of linear type, that is to say that we used a flat linear plane to separate our classes.

This plane was parametrized by the vector 𝑤
. Thus, the objective of linear classification models is to find the vector 𝑤

allowing the best possible separation of the different classes. Each model of linear type has its own technique to find this vector.

There are also non-linear classification models, which we will see later.

<img src="Photos\sklearn_intro_classification_lin_non_lin_en.png" width="900" height="400">


## 1. Using scikit-learn for classification

    We will now introduce the main tools of the scikit-learn module for solving a classification problem.

    In this exercise we will use the Congressional Voting Records dataset, containing a number of votes cast by members of Congress of the United States House of Representatives.

    The objective of our classification problem will be to predict the political party ("Democrat" or "Republican") of the members of the House of Representatives according to their votes on subjects like the education, health, budget, etc.

    The explanatory variables will therefore be the votes on various subjects and the target variable will be the "democrat" or "republican" political party.

    To solve this problem we will use:

            A non-linear classification model: K-Nearest Neighbors.

            A linear classification model: Logistic Regression.

Data preparation

    (a) Run the following cell to import the pandas and numpy modules needed for the exercise.


In [1]:
import pandas as pd
import numpy as np
%matplotlib inline

In [None]:
(b) Load the data contained in the file 'votes.csv' into a DataFrame named votes.

In [None]:
# Insert your code here

votes = pd.read_csv('votes.csv')
votes.head(3)

In [5]:
votes = pd.read_clipboard()
votes.head(3)

Unnamed: 0,party,infants,water,budget,physician,salvador,religious,satellite,aid,missile,immigration,synfuels,education,superfund,crime,duty_free_exports,eaa_rsa
0,republican,n,y,n,y,y,y,n,n,n,y,n,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,n
2,democrat,n,y,y,n,y,y,n,n,n,n,y,n,y,y,n,n


In [None]:
In order to briefly visualize our data:

    (c) Display the number of rows and columns of votes.

    (d) Show a preview of the first 20 rows of votes.

# Shape of the DataFrame
print('The DataFrame has', votes.shape[0], 'rows and', votes.shape[1], 'columns.')

# Display the first 20 rows
votes.head(20)

In [None]:

        The first column "party" contains the name of the political party to which each member of the Congress of the House of Representatives belongs. This is the target variable.

        The following 16 columns contain the votes of each member of Congress on legislative proposals:

            'y' indicates that the elected member voted for the bill.

            'n' indicates that the elected member voted against the bill.

    In order to use the data in a classification model, it is necessary to transform these columns into binary numeric values, i.e. either 0 or 1.

    (e) For each of the columns 1 to 16 (column 0 being our target variable), replace the values 'y' by 1 and 'n' by 0. To do so, we can use the replace method from the DataFrame class.

    (f) Display the first 10 rows of the modified DataFrame.



In [None]:
# Replacing the values
votes = votes.replace(('y', 'n'), (1, 0))

# Display the first 10 rows of the DataFrame
votes.head(10)

## Separation of the variables
    (g) In a DataFrame named X, store the explanatory variables of the dataset (all columns except 'party'). For this, you can use the drop method of a DataFrame.

    (h) In a DataFrame named y, store the target variable ('party').



In [None]:
# Separation of the variables

X = votes.drop(['party'], axis = 1)
y = votes['party']


## train_test_split
    As for the regression problem, we will have to split the data set into 2 sets: a training set and a test set. As a reminder:

            The training set is used to train the classification model, that is to say find the parameters of the model which best separate the classes.

            The test set is used to evaluate the model on data that it has never seen. This evaluation will allow us to judge the generalizability of the model.

    (i) Import the train_test_split function from the sklearn.model_selection submodule. Remember that this function is used as follows:

        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

    (j) Split the data into a training set (X_train, y_train) and a test set (X_test, y_test) keeping 20% of the data for the test set.

    To eliminate the randomness of the train _test_split function, you can use the random_state parameter with an integer value (for example random_state = 2). This will make it so every time you use the function with the argument random_state = 2, the datasets produced will be the same.



In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2)

## Non-linear classification: K-Nearest Neighbors model

    In order to assign a class to an observation, the K-Nearest Neighbors algorithm considers, as its name suggests, the K nearest neighbors of the observation and determines the most represented class among these neighbors.

    Concretely, the algorithm is as follows:

            Suppose that K = 5.

            For an observation that we want to classify, we will look at the 5 points of the training set that are closest to our observation. The distance metric used is often the euclidian norm.

            If among the 5 neighbors, the majority is "democrat", then the observation will be classified "democrat".

    To train a K-Nearest Neighbors model for our problem, we'll use the KNeighborsClassifer class from the neighbors submodule of scikit-learn.

    The number K of neighbors to consider is entered using the parameter n_neighbors of the KNeighborsClassifer constructor.

    (k) Run the following cell to import the KNeighborsClassifer class.



In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:

    (l) Instantiate a KNeighborsClassifier model named knn which will consider the 6 nearest neighbors for classification.

    (m) Using the fit method, train the model knn on the training dataset.

    (n) Using the predict method, perform a prediction on the test dataset. Store these predictions in y_pred_test_knn and display the first 10 predictions.



In [None]:
# Insert your code here

knn = KNeighborsClassifier(n_neighbors = 6)

knn.fit(X_train, y_train)

y_pred_test_knn = knn.predict(X_test)

y_pred_test_knn[:10]

## Linear Classification: Logistic Regression

    The logistic regression model is closely related to the linear regression model seen in the previous notebook.

    They should not be confused since they do not solve the same types of problems:

            Logistic Regression is used for classification (predict classes).

            Linear regression is used for regression (predict a quantitative variable).

    The linear regression model was defined with the following formula:

    𝑦≈𝛽0+∑𝑗=1𝑝𝛽𝑗𝑥𝑗

Logistic regression no longer estimates 𝑦
directly but the probability that 𝑦

is equal to 0 or 1. Thus, the model is defined by the formula:

𝑃(𝑦=1)=𝑓(𝛽0+∑𝑗=1𝑝𝛽𝑗𝑥𝑗)

Where
𝑓(𝑥)=1 / (1+𝑒^−𝑥)

The 𝑓
function, often called sigmoid or logistic function, transforms the linear combination 𝛽0+∑𝑝𝑗=1𝛽𝑗𝑥𝑗

into a value between 0 and 1 that can be interpreted as a probability:

        If 𝛽0+∑𝑝𝑗=1𝛽𝑗𝑥𝑗

is positive, then 𝑃(𝑦=1)>0.5

    , so the predicted class of the observation will be 1.

    If 𝛽0+∑𝑝𝑗=1𝛽𝑗𝑥𝑗

is negative, then 𝑃(𝑦=1)<0.5, i.e. 𝑃(𝑦=0)>0.5

            , so the predicted class of the observation will be 0.

    (o) Import the LogisticRegression class from the linear_model submodule of scikit-learn.

    (p) Instantiate a LogisticRegression model named logreg without specifying constructor arguments.

    (q) Train the model on the training dataset.

    (r) Make a prediction on the test dataset. Store these predictions in y_pred_test_logreg and display the first 10 predictions.


In [None]:
# Insert your code here

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()

logreg.fit(X_train, y_train)

y_pred_test_logreg = logreg.predict(X_test)

y_pred_test_logreg[:10]

## 2. Evaluate the performance of a classification model

    There are different metrics to evaluate the performance of classification models such as:

            The accuracy.

            The precision and the recall.

            The F1-score.

    Each metric assesses the performance of the model with a different approach.

    In order to explain these concepts, we will introduce 4 very important terms.

    Arbitrarily, we will choose that the class 'republican' will be the positive class (1) and 'democrat' will be the negative class (0).

    Thus, we will call:

            True Positive (TP) an observation classified as positive ('republican') by the model which is indeed positive ('republican').

            False Positive (FP) an observation classified as positive ('republican') by the model which was actually negative ('democrat').

            True Negative (TN) an observation classified as negative ('democrat') by the model and which is indeed negative('democrat').

            False Negative (FN) an observation classified as negative ('democrat') by the model which was actually positive ('republican').

<img src="Photos\sklearn_intro_positif_negatif_en.png" width="700" height="400">

The accuracy is the most common metric used to evaluate a model. It simply corresponds to the rate of correct predictions made by the model.

We suppose that we have 𝑛
observations. We denote by TP the number of True Positives and TN

the number of True Negatives. Then the accuracy is given by:

**accuracy= TP+TN / 𝑛**

The precision is a metric which answers the question: Among all the positive predictions of the model, how many are true positives? If we denote by FP

the number of False Positives of the model, then the precision is given by:

**precision= TP /TP+FP**

A high precision score tells us the model does not blindly classify everyone as positive.

The recall is a metric that quantifies the proportion of truly positive observations that were correctly classified as positive by the model.

If we write FN

as the number of False Negatives, then the callback is given by:

**recall= TP / TP+FN**

A high recall score tells us the model is able to properly detect the truly positive observations.

The confusion matrix counts the values of TP, TN, FP and FN for a set of predictions, which allows us to calculate the three previous metrics:

**ConfusionMatrix =  [TN FP**
                   
                     FN TP]

    The confusion_matrix function of the sklearn.metrics submodule generates the confusion matrix from the predictions of a model:

    confusion_matrix (y_true, y_pred)

    As a reminder:

            y_true contains the true values of y.

            y_pred contains the values of y predicted by the model.

    (a) Import the confusion_matrix function from sklearn.metrics.

    (b) Calculate the confusion matrix of the predictions produced by the model knn. These predictions were stored in y_pred_test_knn.

    (c) Display the confusion matrix. How many false positives have occurred? The positive class corresponds to 'republican'.

    (d) Using the formulas given above, calculate the accuracy, precision, and recall scores of the knn model on the test set. You can use tuple assignment to deconstruct the confusion matrix:

**(TN, FP), (FN, TP) = confusion_matrix(y_true, y_pred)**



In [None]:
# Insert your code here

# NOTE
# 'republican' = 1
# 'democrat' = 0 
from sklearn.metrics import confusion_matrix


y_pred_test_knn = knn.predict(X_test)

cm = confusion_matrix(y_test, y_pred_test_knn)
print('confusion_matrix : \n', cm)
print(cm[0])
print(cm[0,0])

(TN, FP), (FN, TP) = confusion_matrix(y_test, y_pred_test_knn)
accuracy = (TP + TN) / len(y_test)
precision = TP / (TP + FP)
recall = TP / (TP + FN)

print('Accuracy :', accuracy, '\nPrecision :', precision, '\nRecall :', recall)

In [None]:
from sklearn.metrics import confusion_matrix

# Computation and display of the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_test_knn)
print("Confusion Matrix:\n",  conf_matrix)

print("\nThe knn model made", conf_matrix[0,1], "False Positives.")

# Computation of the accuracy, precision and recall
(TN, FP), (FN, TP) = confusion_matrix(y_test, y_pred_test_knn)
n = len(y_test)

print("\nKNN Accuracy:", (TP + TN) / n)

print("\nKNN Precision:", TP / (TP + FP))

print("\nKNN Recall:", TP / (TP + FN))

In [None]:


    The display of the confusion matrix can also be done with the pd.crosstab function as we had done in a previous notebook:

    pd.crosstab(y_test, y_pred_test_knn, rownames = ['Reality'], colnames = ['Prediction'])

    Which in our case will produce the following DataFrame:

    Prediction  democrat  republican

    Reality 
    democrat 	48 	5
    republican 	2 	32

    For this dataset, the KNN model performs quite well. When the classes are balanced, i.e. there are about as many positives as there are negatives in the dataset, accuracy is a good enough metric to assess the performance.

    However, as you will see later, when a class is dominant, precision and recall are much more relevant metrics.

    If you think you cannot remember the formulas for the metrics of accuracy, precision and recall, do not worry! The sklearn.metrics submodule contains functions to calculate them quickly:

    accuracy_score(y_test, y_pred_test_knn)
    >>> 0.9195402298850575

    (e) Import the accuracy_score, precision_score and recall_score functions from the sklearn.metrics submodule.

    (f) Display the confusion matrix of the predictions made by the logreg model using pd.crosstab.

    (g) Calculate the accuracy, precision and recall of model predictions logreg. To use the precision_score and recall_score metrics, you will need to fill in the argument pos_label = 'republican' in order to specify that the 'republican' class is the positive class.



In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Computation and display of the confusion matrix
pd.crosstab(y_test, y_pred_test_logreg, rownames=['Reality'], colnames=['Prediction'])

# Computation of the accuracy, precision and recall
print("\nLogReg Accuracy:", accuracy_score(y_test, y_pred_test_logreg))

print("\nLogReg Precision:", precision_score(y_test, y_pred_test_logreg, pos_label = 'republican'))

print("\nLogReg Recall:", recall_score(y_test, y_pred_test_logreg, pos_label = 'republican'))

## Classification Report

    The classification_report function of the sklearn.metrics submodule displays all these metrics for each class.

    (h) Import the classification_report function from the sklearn.metrics submodule.

    (i) Display using the print and classification_report functions the classification reports of the models logreg and knn on the test set.



In [None]:
from sklearn.metrics import classification_report

print("LogReg report:\n", classification_report(y_test, y_pred_test_logreg))

print("\n\n")

print("KNN report:\n", classification_report(y_test, y_pred_test_knn))

## F1-Score

    The classification report is a little more complete than what we have done so far. It contains an additional metric: the F1-Score.

    The F1-Score is a sort of average between precision and recall. The F1-Score adapts very well to classification problems with balanced or unbalanced classes.

    For most classification problems, the model with the highest F1-Score will be considered the model whose recall and precision performances are the most balanced, and is therefore preferable to others.

    (j) Import the f1_score function from submodule sklearn.metrics.

    (k) Compare the F1-Scores of the models knn and logreg on the test set. Which model has the best performance? As for the recall and the precision, it will be necessary to fill in the argument pos_label = 'republican'.



In [None]:
from sklearn.metrics import f1_score

print("F1 KNN:", f1_score(y_test, y_pred_test_knn, pos_label = 'republican'))

print("F1 LogReg:", f1_score(y_test, y_pred_test_logreg, pos_label = 'republican'))

## Conclusion and recap

    Scikit-learn offers many classification models that can be grouped into two families:

            Linear models like LogisticRegression.

            Non-linear models like KNeighborsClassifier.

    The implementation of these models is done in the same way for all models of scikit-learn:

            Instantiation of the model.

            Training of the model: model.fit(X_train, y_train).

            Prediction: model.predict(X_test).

    The prediction on the test set allows us to evaluate the performance of the model thanks to suitable metrics.

    The metrics we have seen are used for binary classification and are calculated using 4 values:

            True Positives: Prediction = + | Reality = +

            True Negatives: Prediction = - | Reality =-

            False Positives: Prediction = + |Reality = -

            False Negatives: Prediction = -| Reality = +

    All these values can be calculated using the confusion matrix generated by the confusion_matrix function of the sklearn.metrics submodule or by the pd.crosstab function.

    Thanks to these values, we can calculate metrics like:

            Accuracy: The proportion of correctly classified observations.

            Precision: The proportion of true positives among all the positive predictions of the model.

            Recall: the proportion of truly positive observations that were correctly classified as positive by the model.

    All these metrics can be obtained using the classification_report function of the sklearn.metrics submodule.

    The F1-Score quantifies the balance between these metrics, which gives us a reliable criterion for choosing the model most suited to our problem.



## RECAP Classification KKN, LogisticRegression

In [None]:
# RECAP Linear Regression 
# Load Data
df = pd.read_csv('automobiles.csv', sep=',', index_col = 0)


# Import Libraries
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, f1_score


# Explanatory variables
X = df.drop(['target'], axis = 1)


# Target variable
y = df['target']


# Splitting the dataset into a training set (85%) and a test set (15%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)

#--------- KNN
# Instantiation of the model
knn = KNeighborsClassifier(n_neighbors = 6)

# Training the model
knn.fit(X_train, y_train)

# Prediction of the target variable for the TEST dataset
y_pred_test_knn = knn.predict(X_test)


# -------- Confusion Matrix
# Computation and display of the confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_test_knn)

# Computation of the accuracy, precision and recall
(TN, FP), (FN, TP) = confusion_matrix(y_test, y_pred_test_knn)
n = len(y_test)

print("\nKNN Accuracy:", (TP + TN) / n)

print("\nKNN Precision:", TP / (TP + FP))

print("\nKNN Recall:", TP / (TP + FN))

# -------- Classification Report
print("LogReg report:\n", classification_report(y_test, y_pred_test_logreg))

print("\n\n")

print("KNN report:\n", classification_report(y_test, y_pred_test_knn))


# -------- F1-Score
print("F1 KNN:", f1_score(y_test, y_pred_test_knn, pos_label = 'republican'))

print("F1 LogReg:", f1_score(y_test, y_pred_test_logreg, pos_label = 'republican'))







#---------LogisticRegression
# Instantiation of the model
logreg = LogisticRegression()

# Training the model
logreg.fit(X_train, y_train)

# Prediction of the target variable for the TEST dataset
y_pred_test_logreg = logreg.predict(X_test)


# -------- Confusion Matrix
# Computation and display of the confusion matrix
pd.crosstab(y_test, y_pred_test_logreg, rownames=['Reality'], colnames=['Prediction'])

# Computation of the accuracy, precision and recall
print("\nLogReg Accuracy:", accuracy_score(y_test, y_pred_test_logreg))

print("\nLogReg Precision:", precision_score(y_test, y_pred_test_logreg, pos_label = 'republican'))

print("\nLogReg Recall:", recall_score(y_test, y_pred_test_logreg, pos_label = 'republican'))