# Task A: Logistic Regression

## 0. Load the preprocessed data
**We will use the preprocessed dataset after PCA. It characterizes 2 different classes (tumor or not) based on 200 features.** 

In [1]:
# Import necessary libraries
import pickle
import numpy as np
import pandas as pd

In [2]:
# Load preprocessed data with help of pickle.
with open('DataAfterProcess/images_AfterProcess.pickle', 'rb') as handle:
    X = pickle.load(handle)
    
with open('DataAfterProcess/label_AfterProcess.pickle', 'rb') as handle:
    y = pickle.load(handle)
    
# Check result.
print(X.shape, y.shape) 

(3000, 200) (3000,)


### Train Test Split

**Split data into a training set (70%) and a test set (30%). Note that the test set here comes from the dataset.zip file, not the test.zip file.**

In [3]:
# Import necessary libraries
from sklearn.model_selection import train_test_split

In [4]:
# Split data into a training set and a test set (70% training and 30% testing data).
# Notice that all random state is chosen as 0 in this assignment to ensure reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size=0.3, random_state=0)

# Check result.
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape) 
print('train set: {}  | test set: {}'.format(round(len(y_train)/len(X),3), 
                                             round(len(y_test)/len(X),3)))

(2100, 200) (2100,) (900, 200) (900,)
train set: 0.7  | test set: 0.3


## 1. Train a logistic regression model
**Firstly, we train a logistic regression model without hyperparameter tuning in order to compare with model trained from GridsearchCV in next step.**

In [5]:
# Import necessary libraries
from sklearn.linear_model import LogisticRegression

In [6]:
# Call the LogisticRegression() model from sklearn and fit the model to the training data.
# Notice that all random state is chosen as 0 in this assignment to ensure reproducibility.
logreg = LogisticRegression(solver='lbfgs', C=10, max_iter=1000, random_state=0)
logreg.fit(X_train, y_train)

LogisticRegression(C=10, max_iter=1000, random_state=0)

### Model Evaluation
**Get predictions from the model, then create the confusion matrix and the classification report and check the accuracy score.**

In [7]:
# Import necessary libraries
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, cohen_kappa_score

In [8]:
# Get predictions from the model.
y_pred = logreg.predict(X_test)

# Print model performance: Confusion matrix, accuracy score and classification report.
print('Confusion matrix: \n', confusion_matrix(y_test, y_pred), '\n')
print('Accuracy on test set: ', accuracy_score(y_test, y_pred), '\n')
print('Classification report: \n', classification_report(y_test, y_pred), '\n') 
# print('cohen_kappa score：',cohen_kappa_score(y_test, y_pred))

Confusion matrix: 
 [[ 82  63]
 [ 27 728]] 

Accuracy on test set:  0.9 

Classification report: 
               precision    recall  f1-score   support

           0       0.75      0.57      0.65       145
           1       0.92      0.96      0.94       755

    accuracy                           0.90       900
   macro avg       0.84      0.76      0.79       900
weighted avg       0.89      0.90      0.89       900
 



## 2. Hyperparameter tuning: GridsearchCV 
**Let's tune the hyper-parameters with help of GridsearchCV(). We tune the regularizarion parameter C and max iteration max_iter here**

In [9]:
# Import necessary libraries
from sklearn.model_selection import GridSearchCV

In [10]:
# Create a dictionary called param_grid and fill out some parameters for C and max_iter.
param_grid = {'C': [1,10,10],
              'max_iter': [300,500,1000]}

In [11]:
# Create a GridSearchCV() object and fit it to the training data.
grid_LogiTaskA = GridSearchCV(LogisticRegression(solver='lbfgs', random_state=0), param_grid=param_grid, refit=True, verbose=2)
grid_LogiTaskA.fit(X_train, y_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits
[CV] END ..................................C=1, max_iter=300; total time=   0.0s
[CV] END ..................................C=1, max_iter=300; total time=   0.0s
[CV] END ..................................C=1, max_iter=300; total time=   0.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[CV] END ..................................C=1, max_iter=300; total time=   0.1s
[CV] END ..................................C=1, max_iter=300; total time=   0.1s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[CV] END ..................................C=1, max_iter=500; total time=   0.0s
[CV] END ..................................C=1, max_iter=500; total time=   0.0s
[CV] END ..................................C=1, max_iter=500; total time=   0.1s
[CV] END ..................................C=1, max_iter=500; total time=   0.0s
[CV] END ..................................C=1, max_iter=500; total time=   0.1s
[CV] END .................................C=1, max_iter=1000; total time=   0.0s
[CV] END .................................C=1, max_iter=1000; total time=   0.0s
[CV] END .................................C=1, max_iter=1000; total time=   0.1s
[CV] END .................................C=1, max_iter=1000; total time=   0.0s
[CV] END .................................C=1, max_iter=1000; total time=   0.1s
[CV] END .................................C=10, max_iter=300; total time=   0.0s
[CV] END .................................C=10, max_iter=300; total time=   0.0s
[CV] END ...................

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[CV] END .................................C=10, max_iter=300; total time=   0.1s
[CV] END .................................C=10, max_iter=500; total time=   0.0s
[CV] END .................................C=10, max_iter=500; total time=   0.0s
[CV] END .................................C=10, max_iter=500; total time=   0.0s
[CV] END .................................C=10, max_iter=500; total time=   0.0s
[CV] END .................................C=10, max_iter=500; total time=   0.1s
[CV] END ................................C=10, max_iter=1000; total time=   0.0s
[CV] END ................................C=10, max_iter=1000; total time=   0.0s
[CV] END ................................C=10, max_iter=1000; total time=   0.0s
[CV] END ................................C=10, max_iter=1000; total time=   0.0s
[CV] END ................................C=10, max_iter=1000; total time=   0.1s
[CV] END .................................C=10, max_iter=300; total time=   0.0s
[CV] END ...................

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


[CV] END .................................C=10, max_iter=500; total time=   0.0s
[CV] END .................................C=10, max_iter=500; total time=   0.0s
[CV] END .................................C=10, max_iter=500; total time=   0.0s
[CV] END .................................C=10, max_iter=500; total time=   0.0s
[CV] END .................................C=10, max_iter=500; total time=   0.1s
[CV] END ................................C=10, max_iter=1000; total time=   0.0s
[CV] END ................................C=10, max_iter=1000; total time=   0.1s
[CV] END ................................C=10, max_iter=1000; total time=   0.0s
[CV] END ................................C=10, max_iter=1000; total time=   0.0s
[CV] END ................................C=10, max_iter=1000; total time=   0.1s


GridSearchCV(estimator=LogisticRegression(random_state=0),
             param_grid={'C': [1, 10, 10], 'max_iter': [300, 500, 1000]},
             verbose=2)

### Check best hyperparameters

In [12]:
# Print best hyperparameters of grid model
print(grid_LogiTaskA.best_params_)

{'C': 1, 'max_iter': 300}


### Model Evaluation
**Get predictions from the grid model, then create the confusion matrix and the classification report and check the accuracy score.**

In [13]:
# Get predictions from the grid model.
pred_grid_LogiTaskA = grid_LogiTaskA.predict(X_test)

# Print grid model performance: Confusion matrix, accuracy score and classification report.
print('Confusion matrix: \n', confusion_matrix(y_test, pred_grid_LogiTaskA), '\n')
print('Accuracy on test set: ', accuracy_score(y_test, pred_grid_LogiTaskA), '\n')
print('Classification report: \n', classification_report(y_test, pred_grid_LogiTaskA), '\n')
# print('cohen_kappa score：',cohen_kappa_score(y_test, pred_grid_LogiTaskA))

Confusion matrix: 
 [[ 83  62]
 [ 27 728]] 

Accuracy on test set:  0.9011111111111111 

Classification report: 
               precision    recall  f1-score   support

           0       0.75      0.57      0.65       145
           1       0.92      0.96      0.94       755

    accuracy                           0.90       900
   macro avg       0.84      0.77      0.80       900
weighted avg       0.89      0.90      0.90       900
 



## 3. Save model

In [14]:
# Save trained model with help of pickle.
with open('Model/LogRegTaskA.pickle', 'wb') as handle:
    pickle.dump(grid_LogiTaskA, handle)