## Assignment 1.b

## Model fitting 

### importing the required libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.datasets import make_classification
from collections import Counter
from imblearn.under_sampling import RandomUnderSampler

np.random.seed(1)

## Loading the preprocessed data 

In [2]:
train_df=pd.read_csv("default_train_df.csv")
test_df=pd.read_csv("default_test_df.csv")
X_train=pd.read_csv("default_train_X.csv")
X_test=pd.read_csv("default_test_X.csv")
y_train=pd.read_csv("default_train_y.csv")
y_test=pd.read_csv("default_test_y.csv")

In [3]:
y_train.value_counts()

DEFAULT PAYMENT NEXT MONTH
0                             16364
1                              4636
dtype: int64

*** There is a huge imbalance in the trained data so we need to perform a data imbalance techniqu in order to encounter the biased results ***

## Data imbalancing 

In [4]:
undersample = RandomUnderSampler(sampling_strategy='majority')

X_train, y_train = undersample.fit_resample(X_train, y_train)

print(Counter(y_train))

Counter({'DEFAULT PAYMENT NEXT MONTH': 1})


In [5]:
y_train.value_counts()

DEFAULT PAYMENT NEXT MONTH
0                             4636
1                             4636
dtype: int64

## Standardizing the variables

Here we will be standardizing the variables of each attribute in order to reduce the differences among them so that we will be able to predict the scores accurately

In [6]:
scaler = StandardScaler()
scaler.fit(X_train)

# Transform the predictors of training and test sets
X_train = scaler.transform(X_train) 
 

X_test = scaler.transform(X_test)


In [7]:
X_train 

array([[-0.1594649 ,  1.59086926,  0.83370498, ...,  1.48177835,
        -0.22219406, -0.22989422],
       [ 0.87887396, -0.50438208,  0.83370498, ..., -0.10248862,
         0.01905733, -0.22452629],
       [-0.24172336, -0.42379549, -1.19946507, ..., -0.29645833,
        -0.07275223, -0.28920017],
       ...,
       [ 0.50824613,  0.22089723,  0.83370498, ..., -0.29645833,
        -0.28920834, -0.28920017],
       [-1.40542505, -0.42379549, -1.19946507, ..., -0.29645833,
        -0.24424176, -0.245804  ],
       [ 0.25914704,  0.14031064,  0.83370498, ..., -0.22405455,
        -0.28920834,  2.08297309]])

### What is the best evaluating matrix????

Our aim for this analysis is to predict accurately the default payments of a transaction. So we will be dealing with both the True Positives and False Negitives of the problem because, True positives(TN) gives number of times the model correctly predicts a default payment whereas False Negitives(FN) gives the number of times the model incorrectly predicts a non-default payment when the actual payment is a default. False negitives are also as important as True Negitives in order to define this model s accurate one.

Recall is a predictive metric that deals with both true positives and false negatives. The proportion of true positives among all actual positive observations is measured by recall. It indicates how well the model can identify positive cases.

Formula :
Recall = True Positives / (True Positives + False Negatives)

## Modelling the data with various modelling techniques

In [8]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

### Decision Trees
### Using Random search and grid search

In [10]:
score_measure = "recall"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(1,100),  
    'min_samples_leaf': np.arange(1,100),
    'min_impurity_decrease': np.arange(0.0001, 0.0005),
    'max_leaf_nodes': np.arange(5, 100), 
    'max_depth': np.arange(1,25), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

best_DTree = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best recall score is 0.65401857121601
... with parameters: {'min_samples_split': 41, 'min_samples_leaf': 85, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 84, 'max_depth': 20, 'criterion': 'entropy'}


In [11]:
score_measure = "recall"
kfolds = 5
min_samples_split = rand_search.best_params_['min_samples_split']
min_samples_leaf = rand_search.best_params_['min_samples_leaf']
min_impurity_decrease = rand_search.best_params_['min_impurity_decrease']
max_leaf_nodes = rand_search.best_params_['max_leaf_nodes']
max_depth = rand_search.best_params_['max_depth']
criterion = rand_search.best_params_['criterion']
#Using the best parameters from the Random Search to use as range for the parameters to do the grid search
param_grid = {
    'min_samples_split': np.arange(min_samples_split-2,min_samples_split+2),  
    'min_samples_leaf': np.arange(min_samples_leaf-2,min_samples_leaf+2),
    'min_impurity_decrease': np.arange(min_impurity_decrease-0.0001, min_impurity_decrease+0.0001, 0.00005),
    'max_leaf_nodes': np.arange(max_leaf_nodes-2,max_leaf_nodes+2), 
    'max_depth': np.arange(max_depth-2,max_depth+2), 
    'criterion': [criterion]
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

best_DTree = grid_search.best_estimator_

Fitting 5 folds for each of 1024 candidates, totalling 5120 fits
The best recall score is 0.65401857121601
... with parameters: {'criterion': 'entropy', 'max_depth': 18, 'max_leaf_nodes': 82, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 85, 'min_samples_split': 39}


In [12]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")
Recall_Dtree= {TP/(TP+FN)}

Accuracy=0.7301111 Precision=0.4249738 Recall=0.6075000 F1=0.5001029


In [13]:
performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [14]:
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,Decision Tree,0.730111,0.424974,0.6075,0.500103


###  Logistic Regression
### using Random search and grid search

In [15]:
score_measure = "recall"
kfolds = 3

param_grid = {'C':[0.01,0.1,1,2,10], # C is the regulization strength
               'penalty':['l1', 'l2','elasticnet','none'],
              'solver':['saga','liblinear'],
              'max_iter': np.arange(250,500)
                  
}

log_reg = LogisticRegression()
rand_search = RandomizedSearchCV(estimator =log_reg, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1  # n_jobs=-1 will utilize all available CPUs 
                                )

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

best_log_reg = rand_search.best_estimator_

Fitting 3 folds for each of 500 candidates, totalling 1500 fits


585 fits failed out of a total of 1500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
177 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\ajayk\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ajayk\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py", line 1291, in fit
    fold_coefs_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose, prefer=prefer)(
  File "C:\Users\ajayk\anaconda3\lib\site-packages\sklearn\utils\parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "C:\Users\ajayk\anaconda3\lib\site-packages\joblib\parallel.py", 

The best recall score is 0.628124358926052
... with parameters: {'solver': 'saga', 'penalty': 'l2', 'max_iter': 334, 'C': 0.01}


In [16]:
score_measure = "recall"
kfolds = 3
best_penality = rand_search.best_params_['penalty']
best_solver = rand_search.best_params_['solver']
min_regulization_strength=rand_search.best_params_['C']
min_iter = rand_search.best_params_['max_iter']

#Using the best parameters from the Random Search to use as range for the parameters to do the grid search
param_grid = {
    
    'C':np.arange(min_regulization_strength,min_regulization_strength+0.5), 
               'penalty':[best_penality],
              'solver':[best_solver],
              'max_iter': np.arange(min_iter-300,min_iter+300)
}

log_reg =  LogisticRegression()
grid_search = GridSearchCV(estimator = log_reg, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,error_score='raise' # n_jobs=-1 will utilize all available CPUs 
                )

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

best_log_reg = grid_search.best_estimator_

Fitting 3 folds for each of 600 candidates, totalling 1800 fits
The best recall score is 0.6283401086563648
... with parameters: {'C': 0.01, 'max_iter': 34, 'penalty': 'l2', 'solver': 'saga'}


  y = column_or_1d(y, warn=True)


In [17]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")
Recall_logistic= {TP/(TP+FN)}

Accuracy=0.7023333 Precision=0.3916374 Recall=0.6135000 F1=0.4780830


In [None]:
performance = pd.concat([performance, pd.DataFrame({'model':"logistic using random & grid search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

In [None]:
performance

### SVM Model

### using RandomSearch and Grid Search

In [None]:
score_measure = "recall"
kfolds = 3

param_grid = {'C':np.arange(0.1,100,10),  #  regularization parameter.
               'kernel':['linear', 'rbf','poly'],
              'gamma':['scale','auto'],
              'degree':np.arange(1,10), #degree is for the polynomial kernal
              'coef0':np.arange(1,10) #coef0 is for the polynomial kernal
                  
}

svc = SVC()
rand_search = RandomizedSearchCV(estimator =svc, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1  # n_jobs=-1 will utilize all available CPUs 
                                )

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

best_svc = rand_search.best_estimator_

Fitting 3 folds for each of 500 candidates, totalling 1500 fits


In [None]:
score_measure = "recall"
kfolds = 3
best_kernel = rand_search.best_params_['kernel']
best_gamma = rand_search.best_params_['gamma']
min_regulization=rand_search.best_params_['C']
best_degree = rand_search.best_params_['degree']
best_coef0=rand_search.best_params_['coef0']

#Using the best parameters from the Random Search to use as range for the parameters to do the grid search
param_grid = {
    
    'C':np.arange(min_regulization-3,min_regulization+3), 
               'kernel':[best_kernel],
              'gamma':[best_gamma],
              'degree': np.arange(best_degree-1,best_degree+1),
            'coef0': np.arange(best_coef0-3,best_coef0+3)
}

svm_grid =  SVC()
grid_search = GridSearchCV(estimator = svm_grid, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1 # n_jobs=-1 will utilize all available CPUs 
                )

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

best_svm = grid_search.best_estimator_

In [None]:
c_matrix = confusion_matrix(y_test, grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
print(f"Accuracy={(TP+TN)/(TP+TN+FP+FN):.7f} Precision={TP/(TP+FP):.7f} Recall={TP/(TP+FN):.7f} F1={2*TP/(2*TP+FP+FN):.7f}")
Recall_SVM = {TP/(TP+FN)}

In [None]:
performance = pd.concat([performance, pd.DataFrame({'model':"svm using Random & Grid search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

## Conclusion

We can observe that the recall score of decision tree and logistic regression using random search and grid search is 0.60 and 0.62 respectively which is almost equal. Comming to Support Vector Machines we are unable to get the result because of (incompatability of the system). So in this business problem we are mostly focussed on true positives(TP)(gives number of times the model correctly predicts a default payment) and False negatives(FN)(gives the number of times the model incorrectly predicts a non-default payment when the actual payment is a default).

So when compared the best AI model developed to detect the default payments of both TP and FN is logisitic regression model.
This model can detect the solution with less FN and more TP