# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

- This is a classification task. We are trying predict discrete categorical values. For example, we are looking to see if any of the transactions types are fraudlent or non-fraudulent. 

Are you predicting for multiple classes or binary classes?  

- We are predicting binary classes because the options are 1 for fraudlent and 0 for non-fradulent. 

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

- Based on these observations, the possible 3 machine learning models that I would choose are **KNN** and **SVM** and **Random Forest**. 

## First Model

Using the first model that you've chosen, implement the following steps.

# KNN Model

In [31]:
!pip install imblearn

Collecting imblearn
  Downloading imblearn-0.0-py2.py3-none-any.whl.metadata (355 bytes)
Collecting imbalanced-learn (from imblearn)
  Using cached imbalanced_learn-0.13.0-py3-none-any.whl.metadata (8.8 kB)
Collecting sklearn-compat<1,>=0.1 (from imbalanced-learn->imblearn)
  Using cached sklearn_compat-0.1.3-py3-none-any.whl.metadata (18 kB)
Downloading imblearn-0.0-py2.py3-none-any.whl (1.9 kB)
Using cached imbalanced_learn-0.13.0-py3-none-any.whl (238 kB)
Using cached sklearn_compat-0.1.3-py3-none-any.whl (18 kB)
Installing collected packages: sklearn-compat, imbalanced-learn, imblearn
Successfully installed imbalanced-learn-0.13.0 imblearn-0.0 sklearn-compat-0.1.3


In [69]:
# import data and models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC, SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Lasso, Ridge

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score,classification_report
from sklearn.metrics import mean_squared_error, r2_score

from imblearn.over_sampling import SMOTE

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [70]:
# read in our csv file
transformed_transactions = pd.read_csv("../data/transformed_transactions.csv")
transformed_transactions.head()

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,CASH_IN,CASH_OUT,DEBIT,PAYMENT,TRANSFER
0,54152.03,42823.36,96975.39,11438021.32,10695480.59,0,0,True,False,False,False,False
1,279331.66,8385167.08,8664498.74,394276.66,114945.0,0,0,True,False,False,False,False
2,185673.97,0.0,0.0,396994.01,582667.97,0,0,False,True,False,False,False
3,128216.41,12158.0,0.0,17406313.64,17534530.05,0,0,False,True,False,False,False
4,17567.71,104890.0,87322.29,0.0,0.0,0,0,False,False,False,True,False


In [71]:
# shows the shape of our sampled dataset
transformed_transactions.shape

(10000, 12)

In [72]:
transformed_transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   amount          10000 non-null  float64
 1   oldbalanceOrg   10000 non-null  float64
 2   newbalanceOrig  10000 non-null  float64
 3   oldbalanceDest  10000 non-null  float64
 4   newbalanceDest  10000 non-null  float64
 5   isFraud         10000 non-null  int64  
 6   isFlaggedFraud  10000 non-null  int64  
 7   CASH_IN         10000 non-null  bool   
 8   CASH_OUT        10000 non-null  bool   
 9   DEBIT           10000 non-null  bool   
 10  PAYMENT         10000 non-null  bool   
 11  TRANSFER        10000 non-null  bool   
dtypes: bool(5), float64(5), int64(2)
memory usage: 595.8 KB


### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [73]:
# select your predictors & target
X = transformed_transactions.drop(columns=['isFraud', 'isFlaggedFraud'])
y = transformed_transactions['isFraud'] # This is our target variable

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size =0.3, random_state=42)

In [74]:
# train kNN on the imbalanced data
knn_imb = KNeighborsClassifier(n_neighbors=3)
knn_imb.fit(X_train, y_train)

yhat = knn_imb.predict(X_test)
baseline_acc = accuracy_score(y_test, yhat)

print("Baseline testing accuracy (imbalanced)", baseline_acc)

Baseline testing accuracy (imbalanced) 0.9996666666666667


In [75]:
# Applying SMOTE to rebleane the training set
smote = SMOTE(k_neighbors=2, random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("Class distribution after SMOTE:")
print(y_train_smote.value_counts())

Class distribution after SMOTE:
isFraud
0    6990
1    6990
Name: count, dtype: int64


### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [76]:
# Randomly search for the best hyperparameters on a Random Forest Model:
params = {
        'n_neighbors': list(range(1, 31)),
        'weights': ['uniform', 'distance'],
        'metric': ['euclidean', 'manhattan']
}

knn = KNeighborsClassifier()
# set up GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    knn,
    param_grid=params,
    cv=5,
    scoring='accuracy'
)

# fit this model on your training data
grid_search.fit(X_train, y_train)

### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [77]:
knn_smote = KNeighborsClassifier(n_neighbors=3)
knn_smote.fit(X_train_smote, y_train_smote)

yhat_pred = knn_smote.predict(X_test)
smote_acc = accuracy_score(y_test, yhat_pred)

print("Testing accuracy after applying SMOTE:", smote_acc)

Testing accuracy after applying SMOTE: 0.996


In [78]:
print(precision_score(y_test, yhat_pred))

0.15384615384615385


In [79]:
best_knn = grid_search.best_estimator_

yhat = best_knn.predict(X_test) 


In [80]:
confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[2997    0]
 [   1    2]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2997
           1       1.00      0.67      0.80         3

    accuracy                           1.00      3000
   macro avg       1.00      0.83      0.90      3000
weighted avg       1.00      1.00      1.00      3000



## Second Model

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

# SVM Model

### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [84]:
param_grid = {
    'C': np.linspace(0.01, 10, 100),
    'penalty': ['l2'],
    'loss': ['squared_hinge'],
    'dual': [True, False]
}

svc = LinearSVC(max_iter=10000)

# TODO: set up RandomizedSearchCV with 5-fold cross-validation
random_search = RandomizedSearchCV(
    estimator=svc,
    param_distributions=param_grid,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=1,
)

# TODO: fit this model on your training data
random_search.fit(X_train, y_train)



In [85]:
# Randomly search for the best hyperparameters on a logistic regression model
lin_svc = LinearSVC(C=1.0, max_iter=10000, random_state=42)

# train the classifier on the dataset
lin_svc.fit(X_train, y_train)

# make predictions on the same dataset
yhat = lin_svc.predict(X_test) 

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[2997    0]
 [   0    3]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2997
           1       1.00      1.00      1.00         3

    accuracy                           1.00      3000
   macro avg       1.00      1.00      1.00      3000
weighted avg       1.00      1.00      1.00      3000



### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.

# Random Forest Model

In [89]:
param_dist = {
    "criterion": ['squared_error', 'absolute_error', 'friedman_mse', 'poisson'],
    "max_depth": range(5, 100, 5),
    "min_samples_split": range(1, 100, 1),
    "max_features": ["sqrt", "log2"],
}

# create an instance of randomforest regression model
rf = RandomForestRegressor()

# set up random search for hyperparameter tuning
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    cv=5
)

# train our model across a range of hyperparameters
random_search.fit(X_train, y_train)

# chose the best rf decision tree
best_rf = random_search.best_estimator_

# make predictions using the best tree
y_pred_rf = best_rf.predict(X_test)

# calculate the error
mse_rf = mean_squared_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)

print(f"RandomizedSearchCV - Test MSE for RandomForest: {mse_rf:.2f}")
print(f"RandomizedSearchCV - Test R2 for RandomForest: {r2_rf:.2f}")

RandomizedSearchCV - Test MSE for RandomForest: 0.00
RandomizedSearchCV - Test R2 for RandomForest: 0.56


In [94]:
param_dist = {
    "criterion": ["squared_error", "absolute_error", "friedman_mse", "poisson"],
    "max_depth": range(5, 100, 5),
    "min_samples_split": range(1, 100, 1),
    "max_features": ["sqrt", "log2"]
}
rf = RandomForestRegressor()

random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_dist, cv=5)
random_search.fit(X_train, y_train)

# extract the best model from random search
best_rf = random_search.best_estimator_

yhat_rf = best_rf.predict(X_test)
mse_rf = mean_squared_error(y_test, yhat_rf)
r2_rf = r2_score(y_test, yhat_rf)

print(f"RandomizedSearchCV - Test MSE for RandomForest: {mse_rf:.2f}")
print(f"RandomizedSearchCV - Test R2 for RandomForest: {r2_rf:.2f}")

RandomizedSearchCV - Test MSE for RandomForest: 0.00
RandomizedSearchCV - Test R2 for RandomForest: 0.52


In [93]:
best_rf = grid_search.best_estimator_

yhat = best_rf.predict(X_test) 

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[2997    0]
 [   1    2]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2997
           1       1.00      0.67      0.80         3

    accuracy                           1.00      3000
   macro avg       1.00      0.83      0.90      3000
weighted avg       1.00      1.00      1.00      3000



KNN Model Accuracy = 0.90 and f-1 Score = 0.80
SVM Model Accuracy = 1.00 and f-1 Score = 1.00
Random Forest Accuracy = 0.90 and f-1 score = 0.80

Based on the three models choosen, the best model would be the **SVM model**. 