### Objective
-	To understand and use K-Fold Cross Validation on the entire dataset
-	To use GridSearchCV in order to tune the parameters of the logistic regression model in the hopes of finding the best one.




In [77]:
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import roc_auc_score
from sklearn.metrics import plot_roc_curve
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
from numpy import mean
from sklearn import metrics
from sklearn.preprocessing import binarize
from joblib import dump, load
import matplotlib.pyplot as plt

##### 1. Load the dataset that was cleaned (from the data directory) and see if it requires any more cleaning after reading it (hint: Check the first column). Feed the train data into a Logistic Regression model with an arbitrary random state. 
* Feel free to play around with the parameters of the LogisticRegression class.

In [3]:
# Read cleaned data training, test, labels
X_train = pd.read_pickle("../data/X_train.pkl")
X_test  = pd.read_pickle("../data/X_test.pkl")
y_train = pd.read_pickle("../data/y_train.pkl")
y_test  = pd.read_pickle("../data/y_test.pkl")

In [4]:
X_train.head()

Unnamed: 0,MinTemp,MaxTemp,Rainfall,Evaporation,Sunshine,WindGustSpeed,WindSpeed9am,WindSpeed3pm,Humidity9am,Humidity3pm,...,WindDir3pm_NNW,WindDir3pm_NW,WindDir3pm_S,WindDir3pm_SE,WindDir3pm_SSE,WindDir3pm_SSW,WindDir3pm_SW,WindDir3pm_W,WindDir3pm_WNW,WindDir3pm_WSW
0,0.524941,0.517958,0.001198,0.143713,0.586207,0.466667,0.4,0.651163,0.65,0.55,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.726841,0.646503,0.011978,0.179641,0.434483,0.36,0.14,0.44186,0.71,0.59,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.425178,0.775047,0.0,0.341317,0.848276,0.333333,0.3,0.255814,0.06,0.02,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.68171,0.659735,0.0,0.263473,0.765517,0.413333,0.44,0.44186,0.59,0.53,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.593824,0.642722,0.0,0.143713,0.586207,0.44,0.0,0.162791,0.72,0.53,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
# Create Logistic Regression - Rain Prediction
log_regression = LogisticRegression(solver='liblinear', random_state=0)

In [6]:
# Train the model
log_regression.fit(X_train, y_train)

LogisticRegression(random_state=0, solver='liblinear')

##### 2. Use cross_validate from sklearn.model_selection to understand how cross validation works. 
* Instead of setting cv as an integer value, try using KFold (with >2 folds) from sklearn.model_selection as an alternative.
* You can use either the training set or concatenate the training set to the test set when using cross_validate in order to obtain the metrics (obtain accuracy and AUC score).
* Take the mean of all the values to obtain a single accuracy and single AUC score.


In [43]:
# Concatenate all values
X = pd.concat([X_train, X_test])
y = pd.concat([y_train, y_test])

In [19]:
# Kfold cross validation k=10
cv = KFold(n_splits=10, random_state=1, shuffle=True)

In [46]:
# Use k-fold of 5
cv_results = cross_validate(log_regression, X, y, cv=cv, scoring='accuracy')

In [47]:
# Obtain accuracy by averaging all scores
accuracy = mean(cv_results['test_score'])
accuracy

0.8472639692200834

In [49]:
# Use cross validation to obtain multiple predictions. Keep track of accuracy and AUC
accuracy_all = []
auc_all = []
for i, (train, test) in enumerate(cv.split(X, y)):    
    log_regression.fit(X.iloc[train], y.iloc[train])
    y_prediction_test = log_regression.predict(X.iloc[test])
    test_accuracy_score = accuracy_score(y.iloc[test], y_prediction_test)
    accuracy_all.append(test_accuracy_score)
    AUC_score = roc_auc_score(y.iloc[test], y_prediction_test)
    auc_all.append(AUC_score)


In [50]:
accuracy_all

[0.8453586497890295,
 0.8476793248945148,
 0.8431786216596343,
 0.8453477741050707,
 0.8524509459174344,
 0.8499894507349322,
 0.85069273507279,
 0.8461213868767142,
 0.8461917153105001,
 0.8456290878402138]

In [51]:
auc_all

[0.7211433512312738,
 0.7287229303570166,
 0.7201244174231188,
 0.7223996308666248,
 0.729182454365737,
 0.7323710429612815,
 0.7334408885328306,
 0.7270266863538999,
 0.72912810556455,
 0.7217109916456106]

In [52]:
final_accuracy = mean(accuracy_all)
final_auc = mean(auc_all)

In [54]:
print(f"Accuracy (average):{final_accuracy:.4f}")
print(f"AUC (average):{final_auc:.4f}")

Accuracy (average):0.8473
AUC (average):0.7265


Original accuracy from the logistic regression model was 0.8496, not much different from the one obtained using cross validation.

##### 3. Use GridSearchCV on a logistic regression model to find the best parameters.
* As a minimum, find a suitable value of C along with the best solver. You are free to include other parameters in your parameter grid. Keep in mind the more parameters there are, the more models are iterated, leading to a longer time needed to compute.
* Do not forget to use the cv argument.
* Once GridSearchCV is instantiated, fit the classifier on the training data. This may take some time.
* Have a look at the best estimator and the parameters for the same once the data has been fit.

In [61]:
# Parameters used for gird search
gs_parameters = [{'penalty':['l1','l2'], 'C':[0.001, .009, 0.01, 0.9, 1, 10, 100, 1000], 'solver':['liblinear','lbfgs','saga']}]

In [62]:
# Instantiate gridsearch, use all available processors to speed up
grid_search = GridSearchCV(estimator = LogisticRegression(),  
                           param_grid = gs_parameters,
                           scoring = 'accuracy',
                           cv = 5, verbose=2, n_jobs=-1)

In [63]:
# Train
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 48 candidates, totalling 240 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 32 concurrent workers.
[Parallel(n_jobs=-1)]: Done  98 tasks      | elapsed:   24.1s
[Parallel(n_jobs=-1)]: Done 240 out of 240 | elapsed:  6.3min finished
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


GridSearchCV(cv=5, estimator=LogisticRegression(), n_jobs=-1,
             param_grid=[{'C': [0.001, 0.009, 0.01, 0.9, 1, 10, 100, 1000],
                          'penalty': ['l1', 'l2'],
                          'solver': ['liblinear', 'lbfgs', 'saga']}],
             scoring='accuracy', verbose=2)

In [64]:
# What are the best params
grid_search.best_params_

{'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}

##### 4. Use this new logistic regression classifier to predict the labels for X_test and compute the accuracy, confusion matrix and classification report.
* Compare these values to the initial model we created.
* Change the threshold value to a suitable value to decrease the type 2 error.


In [65]:
y_predicted_gs_test = grid_search.predict(X_test)

In [67]:
# Print accuracy
print(f"Accuracy Score:{accuracy_score(y_test,y_predicted_gs_test)}")

Accuracy Score:0.8473926650022856


In [69]:
# Print precision, recall, f-score
precision_recall_fscore_support(y_test,y_predicted_gs_test, average='binary')

(0.770356572645024, 0.454331450094162, 0.5715695952615992, None)

In [75]:
# Obtain tp, fp, fn, tp
confusion_matrix(y_test,y_predicted_gs_test)

array([[21204,   863],
       [ 3477,  2895]])

In [74]:
print(classification_report(y_test, y_predicted_gs_test))

              precision    recall  f1-score   support

           0       0.86      0.96      0.91     22067
           1       0.77      0.45      0.57      6372

    accuracy                           0.85     28439
   macro avg       0.81      0.71      0.74     28439
weighted avg       0.84      0.85      0.83     28439



The numbers are comparable to original model:

- Original accuracy: 0.8496
- Original precision: 0.7681
- Original recall: 0.4606
- Original f score: 0.5759

In [76]:
# Generate thresholds 0.1 to 0.9 and calculate confusion matrix
for i in range(1,10):
    # Predict original probability of rain (column 1)
    y_rain_prediction = grid_search.predict_proba(X_test)[:,1]    
    # Reshape to pass to binarize function
    y_rain_prediction = y_rain_prediction.reshape(-1,1)
    new_threshold = i/10
    print(f"\nThreshold:{new_threshold}")
    y_new_rain_prediction = binarize(y_rain_prediction,threshold=new_threshold)
    new_confusion_m = confusion_matrix(y_test,y_new_rain_prediction)    
    # Obtain tp, fp, fn, tp
    TN, FP, FN, TP = new_confusion_m.ravel()
    print(f"False Positives (Type I errors):{FP:,}")
    print(f"False Negatives (Type II errors):{FN:,}")
    accuracy = (TP+TN)/(TP+TN+FP+FN)
    precision = TP/(TP+FP)
    recall = TP/(TP+FN)
    print(f"Accuracy:{accuracy*100:.2f}%")
    print(f"Precision:{precision*100:.2f}%")
    print(f"Recall:{recall*100:.2f}%")


Threshold:0.1
False Positives (Type I errors):7,619
False Negatives (Type II errors):745
Accuracy:70.59%
Precision:42.48%
Recall:88.31%

Threshold:0.2
False Positives (Type I errors):3,854
False Negatives (Type II errors):1,563
Accuracy:80.95%
Precision:55.51%
Recall:75.47%

Threshold:0.3
False Positives (Type I errors):2,245
False Negatives (Type II errors):2,299
Accuracy:84.02%
Precision:64.47%
Recall:63.92%

Threshold:0.4
False Positives (Type I errors):1,380
False Negatives (Type II errors):2,888
Accuracy:84.99%
Precision:71.63%
Recall:54.68%

Threshold:0.5
False Positives (Type I errors):863
False Negatives (Type II errors):3,477
Accuracy:84.74%
Precision:77.04%
Recall:45.43%

Threshold:0.6
False Positives (Type I errors):497
False Negatives (Type II errors):4,044
Accuracy:84.03%
Precision:82.41%
Recall:36.53%

Threshold:0.7
False Positives (Type I errors):244
False Negatives (Type II errors):4,618
Accuracy:82.90%
Precision:87.79%
Recall:27.53%

Threshold:0.8
False Positives (Typ

At a threshold of 0.3 the Type II errors start to increase drastically

##### 5. Compare it to the initial model we created and state your inferences.

The original model had an accuracy of Original accuracy: 0.8496, at a threshold of 0.3 or 0.4 the accuracy is similar. 

##### 6.  Save the model as a pickle file using the joblib library. The model is now ready for deployment!

In [78]:
dump(grid_search,'../data/logistic_regression_model.joblib')

['logistic_regression_model.joblib']