# Task A: Logistic Regression

## 0. Load the preprocessed data
**We will use preprocessed image dataset and testset after image processing and PCA. They characterize 2 different classes (tumor or not) based on 200 features.** 

In [6]:
# Import necessary libraries
import pickle
import numpy as np
import pandas as pd

In [7]:
# Load preprocessed image dataset with help of pickle.
with open('DataAfterProcess/images_AfterProcess.pickle', 'rb') as handle:
    X = pickle.load(handle)
    
with open('DataAfterProcess/label_AfterProcess.pickle', 'rb') as handle:
    y = pickle.load(handle)
    
# Check result.
print(X.shape, y.shape) 

(3000, 200) (3000,)


In [8]:
# Load preprocessed test dataset with help of pickle.
with open('DataAfterProcess/test_images_AfterProcess.pickle', 'rb') as handle:
    X_test = pickle.load(handle)
    
with open('DataAfterProcess/test_label_AfterProcess.pickle', 'rb') as handle:
    y_test = pickle.load(handle)
    
# Check result.
print(X_test.shape, y_test.shape) 

(200, 200) (200,)


### Training-Validation-Test 
**Split the preprocessed image dataset into training set (90%) and validation set (10%). The test set is chosen as dataset from AMLS-2021_test.zip.**

In [9]:
# Import necessary libraries
from sklearn.model_selection import train_test_split

In [10]:
# Split dataset into a training set and a validation set (90% training and 10% validation data).
# Notice that all random state is chosen as fixed in this assignment to ensure reproducibility.
X_train, X_val, y_train, y_val = train_test_split(X,y, 
                                                  test_size=0.1, random_state=2)

# Check result.
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape, X_test.shape, y_test.shape) 
print('train set: {} | val set: {} | test set: {}'.format(round(len(y_train)/len(X),3), 
                                                          round(len(y_val)/len(X),3),
                                                          round(len(y_test)/len(X),3)))

(2700, 200) (2700,) (300, 200) (300,) (200, 200) (200,)
train set: 0.9 | val set: 0.1 | test set: 0.067


## 1. Train a logistic regression model without hyperparameter tuning
**Firstly, we train a logistic regression model without hyperparameter tuning in order to compare with model trained from GridsearchCV in next step.**

In [23]:
# Import necessary libraries
from sklearn.linear_model import LogisticRegression

In [31]:
# Call the LogisticRegression() model from sklearn and fit the model to the training data.
# Notice that all random state is chosen as fixed in this assignment to ensure reproducibility.
logreg = LogisticRegression(solver='lbfgs', C=0.001, max_iter=1000, random_state=0)
logreg.fit(X_train, y_train)

LogisticRegression(C=0.001, max_iter=1000, random_state=0)

### Model validation
**Evaluate the model per validation set by checking confusion matrix, classification report, accuracy score and cohen kappa score.**

In [32]:
# Import necessary libraries
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, cohen_kappa_score

In [33]:
# Get predictions from the model.
y_pred = logreg.predict(X_val)

# Print model performance: Confusion matrix, accuracy score and classification report.
print('Confusion matrix: \n', confusion_matrix(y_val, y_pred), '\n')
print('Accuracy on validation set: ', accuracy_score(y_val, y_pred), '\n')
print('Classification report: \n', classification_report(y_val, y_pred), '\n') 
print('cohen_kappa score：',cohen_kappa_score(y_val, y_pred))

Confusion matrix: 
 [[  8  42]
 [  3 247]] 

Accuracy on validation set:  0.85 

Classification report: 
               precision    recall  f1-score   support

           0       0.73      0.16      0.26        50
           1       0.85      0.99      0.92       250

    accuracy                           0.85       300
   macro avg       0.79      0.57      0.59       300
weighted avg       0.83      0.85      0.81       300
 

cohen_kappa score： 0.21511627906976738


## 2. Hyperparameter tuning: GridsearchCV 
**Let's tune the hyper-parameters of logistic regression with help of GridsearchCV(). We tune the regularizarion parameter C and max iteration max_iter here.**

### Implement GrisearchCV

In [34]:
# Import necessary libraries
from sklearn.model_selection import GridSearchCV

In [35]:
# Create a dictionary called param_grid and fill out some parameters for C and max_iter.
param_grid = {'C': [0.001,0.01,1,10,10],
              'max_iter': [200,500,1000]}

In [36]:
# Create a GridSearchCV() object and fit it to the training data.
grid_LogiTaskA = GridSearchCV(LogisticRegression(solver='lbfgs', random_state=0), param_grid=param_grid, refit=True, verbose=2)
grid_LogiTaskA.fit(X_train, y_train)

Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV] END ..............................C=0.001, max_iter=200; total time=   0.0s
[CV] END ..............................C=0.001, max_iter=200; total time=   0.0s
[CV] END ..............................C=0.001, max_iter=200; total time=   0.0s
[CV] END ..............................C=0.001, max_iter=200; total time=   0.0s
[CV] END ..............................C=0.001, max_iter=200; total time=   0.0s
[CV] END ..............................C=0.001, max_iter=500; total time=   0.0s
[CV] END ..............................C=0.001, max_iter=500; total time=   0.0s
[CV] END ..............................C=0.001, max_iter=500; total time=   0.0s
[CV] END ..............................C=0.001, max_iter=500; total time=   0.0s
[CV] END ..............................C=0.001, max_iter=500; total time=   0.0s
[CV] END .............................C=0.001, max_iter=1000; total time=   0.0s
[CV] END .............................C=0.001, m

GridSearchCV(estimator=LogisticRegression(random_state=0),
             param_grid={'C': [0.001, 0.01, 1, 10, 10],
                         'max_iter': [200, 500, 1000]},
             verbose=2)

### Check best hyperparameters from Gridsearch

In [37]:
# Print best hyperparameters and score of grid model.
print('Best hyperparameter by GridsearchCV():', grid_LogiTaskA.best_params_, '\n')
print('Best score by using this hyperparameter: %.4f' %(grid_LogiTaskA.best_score_), '\n\n')

# Print mean, std of grid model.
means_grid = grid_LogiTaskA.cv_results_['mean_test_score']
stds_grid = grid_LogiTaskA.cv_results_['std_test_score']
params_grid = grid_LogiTaskA.cv_results_['params']
    
for mean, stdev, param in zip(means_grid, stds_grid, params_grid):
    print('For parameter %r: Mean score %.4f (with std %.4f)\n'% (param, mean, stdev))

Best hyperparameter by GridsearchCV(): {'C': 1, 'max_iter': 200} 

Best score by using this hyperparameter: 0.9019 


For parameter {'C': 0.001, 'max_iter': 200}: Mean score 0.8641 (with std 0.0066)

For parameter {'C': 0.001, 'max_iter': 500}: Mean score 0.8641 (with std 0.0066)

For parameter {'C': 0.001, 'max_iter': 1000}: Mean score 0.8641 (with std 0.0066)

For parameter {'C': 0.01, 'max_iter': 200}: Mean score 0.8996 (with std 0.0087)

For parameter {'C': 0.01, 'max_iter': 500}: Mean score 0.8996 (with std 0.0087)

For parameter {'C': 0.01, 'max_iter': 1000}: Mean score 0.8996 (with std 0.0087)

For parameter {'C': 1, 'max_iter': 200}: Mean score 0.9019 (with std 0.0082)

For parameter {'C': 1, 'max_iter': 500}: Mean score 0.9019 (with std 0.0082)

For parameter {'C': 1, 'max_iter': 1000}: Mean score 0.9019 (with std 0.0082)

For parameter {'C': 10, 'max_iter': 200}: Mean score 0.9004 (with std 0.0086)

For parameter {'C': 10, 'max_iter': 500}: Mean score 0.9004 (with std 0.0086)

## 3. Model Evaluation
**Evaluate the chosen model with best hyperparameter per test set by checking confusion matrix, classification report, accuracy score and cohen kappa score.**

In [38]:
# Get predictions from the chosen model with best hyperparameter.
pred_grid_LogiTaskA = grid_LogiTaskA.predict(X_test)

# Print performance of the chosen model with best hyperparameter: Confusion matrix, accuracy score and classification report.
print('Confusion matrix: \n', confusion_matrix(y_test, pred_grid_LogiTaskA), '\n')
print('Accuracy on test set: ', accuracy_score(y_test, pred_grid_LogiTaskA), '\n')
print('Classification report: \n', classification_report(y_test, pred_grid_LogiTaskA), '\n')
print('cohen_kappa score：',cohen_kappa_score(y_test, pred_grid_LogiTaskA))

Confusion matrix: 
 [[ 15  22]
 [  5 158]] 

Accuracy on test set:  0.865 

Classification report: 
               precision    recall  f1-score   support

           0       0.75      0.41      0.53        37
           1       0.88      0.97      0.92       163

    accuracy                           0.86       200
   macro avg       0.81      0.69      0.72       200
weighted avg       0.85      0.86      0.85       200
 

cohen_kappa score： 0.4556451612903225


## 4. Save model

In [39]:
# Save the chosen model with best hyperparameter with help of pickle.
with open('Model/LogRegTaskA.pickle', 'wb') as handle:
    pickle.dump(grid_LogiTaskA, handle)