# Logistic Regression

More information at: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression


## Classification metrics

More detail at (https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)

The evaluation of performance are based on the Confusion Matrix
- Accuracy
- Precision (P)
- Recall (R)
- F1 score (F1)
- Area under the ROC (Receiver Operating Characteristic) curve or simply Area Under Curve (AUC)
- Matthew Correlation Coefficient

In [None]:
from IPython import display
display.Image("Image/ConfusionMatrix1.png")

### Micro/Macro Metrics

For multiclass classification we use Micro/Macro average.  A macro-average will compute the metric independently for each class and then take the average (hence treating all classes equally), whereas a micro-average will aggregate the contributions of all classes to compute the average metric. In a multi-class classification setup, micro-average is preferable if you suspect there might be class imbalance (i.e you may have many more examples of one class than of other classes).

Nice tutorial at (https://iamirmasoud.com/2022/06/19/understanding-micro-macro-and-weighted-averages-for-scikit-learn-metrics-in-multi-class-classification-with-example/)


## Simple logistic regression

Let's start generating a sample with three intersecting clusters

In [None]:
# Do not modify
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
from sklearn.metrics import ConfusionMatrixDisplay

# we create two clusters of random points
n_samples = 1000
centers = 3
clusters_std = 4
X, y = make_blobs(
    n_samples=n_samples,
    n_features=2,
    centers=centers,
    cluster_std=clusters_std,
    random_state=42,
    shuffle=False,
)

Plot the samples

In [None]:
# plot the samples


Split the data into train and test

Fit the model and get the separating hyperplane: use the LogisticRegression class

Make the predictions on the test input data

Classification report and confusion matrix

In [None]:
# Do not modify
from sklearn.metrics import classification_report

# Complete here


In [None]:
# Complete here


## Balanced/Unbalanced Dataset

Let's now create two clusters with very different sample sizes

In [None]:
import matplotlib.pyplot as plt

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs

# we create two clusters of random points with very different sample sizes
n_samples_1 = 10000
n_samples_2 = 1000
centers = [[0.0, 0.0], [3.0, 3.0]]
clusters_std = [1.5, 0.5]
X, y = make_blobs(
    n_samples=[n_samples_1, n_samples_2],
    centers=centers,
    cluster_std=clusters_std,
    random_state=42,
    shuffle=False,
)

Plot the samples

In [None]:
# plot the samples


Split train and test data

Fit the model

Make predictions

Calculate the accuracy with `accuracy_score` and `matthews_corrcoef`: what do you notice?

In [None]:
# Do not modify
from sklearn.metrics import accuracy_score, matthews_corrcoef

# Calculate using Accuracy


In [None]:
# Calculate using MCC


Different startegy to deal with this problem
- Collecting more data
- Use the right evaluation metrics
- Under-sampling the majority class
- Over-sampling the minority class
- Cost in your Model definition

## Stratified Train/Test/Validation split

Stratified sampling is a sampling technique where the samples are selected in the same proportion (by dividing the population into groups called ‘strata’ based on a characteristic) as they appear in the population. Let's analyze a more complex dataset and make a stratified split.

In [None]:
# Do not modify
# import dataset aggregated dataset metadata
import json

with open('./Data/tutorial_metadata.json', 'r') as f:
    metadata = json.load(f)
    lines = f.readlines()

print(metadata.keys())

In [None]:
# Do not modify
import pandas as pd
newdf = pd.DataFrame(metadata)
newdf.head()

The `covariates` column in this dataset contains a lot of useful information:

In [None]:
# Do not modify
newdf["covariates"][0]

Find a way to add each datum in the `covariates` column as a new column to the main dataframe.

In [None]:
df.info()

Now let's focus on the `age`: delete the `NaN` values and create a new column `AgeGroup` to binarize the `age` datum into two classes: `'50-'` if 50 years or below, `'51+'` if 51 years or above. 

In [None]:
import numpy as np

# Delete NaN


# Binarize Age Data


How populated are the two new labels of `AgeGroup` ?

In [None]:
# Print number of each age


Let's try and make a stratified split over three categorical variables: `outcome`, `female` and `AgeGroup`

In [None]:
# Do not modify
from sklearn.model_selection import train_test_split

# Split the data into training and testing sets (Stratified Split)


Let's check whether we split the data uniformly with respect to the outcome

In [None]:
# Do not modify
# Create a Deliverable table
data = [[len(train_data)/len(df), len(train_data[train_data["outcome"] == 0])/len(train_data)],
        [len(val_data)/len(df), len(val_data[val_data["outcome"] == 0])/len(val_data)],
        [len(test_data)/len(df), len(test_data[test_data["outcome"] == 0])/len(test_data)]]

# Create the pandas DataFrame with column name is provided explicitly
df_Final = pd.DataFrame(data, columns=['Percentage of Patient', 'Percentace of Not Recovery'])
df_Final

Check the other two selected features:

## Cross Validation

Sometimes, the dataset size is not enough for correctly training our models while avoiding bias. In these cases, cross validation can reveal very useful: the training/testing phases are repeated a number of times, and each time a different part of the dataset is taken as test set. 

In [None]:
display.Image("Image/CrossValidation.png")

Let's try with a very limited dataset

In [None]:
# Do not modify
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

X, y = make_classification(
    n_samples=100,
    n_features=8,
    n_informative=6,
    n_classes=2,
    random_state=42
)

In [None]:
# Do not modify
# Define classifier
clf = LogisticRegression(penalty='none')

In [None]:
# Do not modify
# Split data in Train and Test
X_, X_test, y_, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Stratified KFold
cv = StratifiedKFold(5, shuffle=True, random_state=42)

for cv_ind, (train, val) in enumerate(cv.split(X_, y_)):
    clf.fit(X_[train], y_[train])
    y_pred_val = clf.predict(X_[val])
    print("Validation Accuracy score: ", accuracy_score(y_[val], y_pred_val))


y_pred_test = clf.predict(X_test)
print("Test Accuracy score: ", accuracy_score(y_test, y_pred_test))

Depending on the choice of the test set, the accuracy could have varied greatly (from 0.42 to 0.85). The cross validation avoids this bias.

Is the obtained accuracy the best? We can make sure by doing a grid search in the space of the parameters.

In [None]:
# Do not modify
# Split data in Train and Test
X_, X_test, y_, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Stratified KFold
cv = StratifiedKFold(5, shuffle=True, random_state=42)

# Parameter for GridSearch
parameters = {'penalty':('l2', None)}

search = GridSearchCV(clf,
                   parameters,
                   cv=cv,
                   refit=False
                   )

# This does the same as refit=True
search.fit(X_, y_)

In [None]:
# Do not modify
clf_final = clf.set_params(**search.best_params_)
y_pred_test = clf.predict(X_test)
print("Test Accuracy score: ", accuracy_score(y_test, y_pred_test))

## Iris dataset

Let's now apply these techniques to some classical datasets. The Iris dataset collects four features of the observations of three species of iris flower. The aim is to correctly classify the data points into the three species.

In [None]:
# Do not modify
# necessary imports
import time
import matplotlib.pyplot as plt
import numpy as np
from random import randint
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score, f1_score, ConfusionMatrixDisplay
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [None]:
# Do not modify
# loading the dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

Please check out the Pipeline class on sklearn

In [None]:
# Do not modify
myLogReg = Pipeline(steps=[
    ("StandardScaler", StandardScaler()),
    ("LogReg", LogisticRegression())
])

### Binary classification

Extract classes 0 and 1 from the dataset

Split train and test datasets

Fit the data

Show the confusion matrix and the F1 score

### Multinomial classification

Repeat, this time using all three classes

In [None]:
# split train/test dataset


In [None]:
# training

In [None]:
# evaluation


## MNIST dataset

The MNIST dataset is a set of 28x28 pixel images of handwritten figures from 0 to 9. The images must thus be correctly classified in one of the 10 classes.

In [None]:
# Do not modify
# reading the dataset
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

In [None]:
# Do not modify
#display randomly one of the pictures

i = randint(0,len(y_train)-1)
plt.imshow(x_train[i],aspect="auto",cmap='gray')
plt.show()
print("The true class is : ", y_train[i])

In [None]:
# Do not modify
# reshaping the data (a 2d image is transformed to a 1d array)
#train
n = x_train.shape[0]
x_trainLin = x_train.reshape(n,-1)
#test
n = x_test.shape[0]
x_testLin = x_test.reshape(n,-1)

In [None]:
# Creation of Pipeline


In [None]:
# Case of binary classification: let's choose 2 classes among the 10 classes: the 4's and the 8's


In [None]:
# Learn of logistic regressions


In [None]:
# Compute and display the f1 score and the confusion matrix


The logistic regression has been build with default parameters. Follow the recommandation and try different hints:
- algorithm terminaison: max number of iterations, ....
- data preprocessing: standardisation YES
- solver (liblinear, sag, saga, ...)
- regularisation

Did you get better results ? Did some converge faster ?

### OneVsOne classification of the 10 classes

Now try and use the OneVsOneClassifier for classifying all 10 classes via the consensus of $10(10-1)/2 = 45$ binary classifiers

In [None]:
# Do not modify
from sklearn.multiclass import OneVsOneClassifier

# reading the dataset
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# reshaping the data (a 2d image is transformed to a 1d array)
#train
n = x_train.shape[0]
x_trainLin = x_train.reshape(n,-1)
#test
n = x_test.shape[0]
x_testLin = x_test.reshape(n,-1)

In [None]:
# Define your pipeline


In [None]:
# Fit the data


In [None]:
# Confusion matrix and F1 score


Lastly, do the same with the multinomial regression you used for Iris. Does it do better or worse?