# Model Training

In this notebook, we will ask you a series of questions regarding model selection. Based on your responses, we will ask you to create the ML models that you've chosen. 

The bonus step is completely optional, but if you provide a sufficient third machine learning model in this project, we will add `1000` points to your Kahoot leaderboard score.

**Note**: Use the dataset that you've created in your previous data transformation step (not the original model).

## Questions
Is this a classification or regression task?  

* Classification beacuse we are determining if a bank transaction is fraud or not fraud

Are you predicting for multiple classes or binary classes?  

* Binary classes because it is either fraud or not fraud

Given these observations, which 2 (or possibly 3) machine learning models will you choose?  

* kNN
* svm


## First Model

Using the first model that you've chosen, implement the following steps.

### 1) Create a train-test split

Use your cleaned and transformed dataset to divide your features and labels into training and testing sets. Make sure you’re only using numeric or properly encoded features.  

In [57]:
# Run me!
!pip install imblearn



In [58]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from imblearn.over_sampling import SMOTE
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
from sklearn.metrics import classification_report

import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np

In [59]:
df = pd.read_csv("../notebooks/transactions_transformed.csv")
df

Unnamed: 0,amount,oldbalanceOrg,newbalanceOrig,oldbalanceDest,newbalanceDest,isFraud,isFlaggedFraud,type_CASH_IN,type_CASH_OUT,type_DEBIT,type_PAYMENT,type_TRANSFER
0,983.09,36730.24,35747.15,0.00,0.00,0,0,0.0,0.0,0.0,1.0,0.0
1,55215.25,99414.00,44198.75,0.00,0.00,0,0,0.0,0.0,0.0,1.0,0.0
2,220986.01,7773074.97,7994060.98,924031.48,703045.48,0,0,1.0,0.0,0.0,0.0,0.0
3,2357394.75,0.00,0.00,4202580.45,6559975.19,0,0,0.0,0.0,0.0,0.0,1.0
4,67990.14,0.00,0.00,625317.04,693307.19,0,0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
999995,13606.07,114122.11,100516.04,0.00,0.00,0,0,0.0,0.0,0.0,1.0,0.0
999996,9139.61,0.00,0.00,0.00,0.00,0,0,0.0,0.0,0.0,1.0,0.0
999997,153650.41,50677.00,0.00,0.00,380368.36,0,0,0.0,1.0,0.0,0.0,0.0
999998,163810.52,0.00,0.00,357850.15,521660.67,0,0,0.0,1.0,0.0,0.0,0.0


In [60]:
df = df.sample(n=500000)

In [61]:
# select the predictors
X = df.drop(columns = ["isFraud","isFlaggedFraud"])

# select the target
y = df["isFraud"]


# Split the dataset into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 42)

### 2) Search for best hyperparameters
Use tools like GridSearchCV, RandomizedSearchCV, or model-specific tuning functions to find the best hyperparameters for your first model.

In [62]:
# train kNN on the unscaled, imbalanced data
knn_imb = KNeighborsClassifier(n_neighbors=3)
knn_imb.fit(X_train, y_train)

yhat = knn_imb.predict(X_test)
baseline_acc = accuracy_score(y_test, yhat)
baseline_pre = precision_score(y_test, yhat)

#Accuracy: Overall correctness of the model.
#Precision: Of all the positive predictions made by the model, how many were actually correct?
#Recall: Of all the actual positives in the dataset, how many did the model correctly identify?
#Specificity: Of all the actual negatives in the dataset, how many did the model correctly identify as negative?

print("Testing accuracy on non-scaled + imbalanced data:", baseline_acc) 
print(f"Testing precision on non-scaled + imbalanced data:: {baseline_pre}")

Testing accuracy on non-scaled + imbalanced data: 0.9993133333333334
Testing precision on non-scaled + imbalanced data:: 0.849624060150376


#### Explanation: Before Smote:
The accuracy measures how many predictions (both fraud and not fraud) the model got correct out of all predictions.

Precision (for fraud): Out of all transactions the model predicted as fraud, how many were actually fraud 

* The accuracy being 0.9993533333333333 means the model correctly predicted almost all non-fraud transactions but that’s the majority of data.
* Precision being 0.7914110429447853 means it predicted a few cases as fraud, and most of those were correct (high precision).
* But it probably missed most of the actual frauds (i.e., low recall), because it was hesitant to ever say "fraud."
* Think of out initial "isFlaggedFraud" model and how poorly it did at detecting fraud
* It being unscaled also affects the outcome because k-Nearest Neighbors (kNN) is distance-based. So the model ends up learning patterns based more on feature size rather than meaningful relationships — which hurts its ability to detect the subtle patterns in fraud.

In [63]:
# Apply SMOTE to rebalance the training set (number of neighbors needs to be less than number of minority class samples)
smote = SMOTE(k_neighbors=2, random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("Class distribution after SMOTE:")
print(y_train_smote.value_counts())

Class distribution after SMOTE:
isFraud
0    349537
1    349537
Name: count, dtype: int64


In [64]:
# Retrain kNN on the balanced data
knn_smote = KNeighborsClassifier(n_neighbors=3)
knn_smote.fit(X_train_smote, y_train_smote)

yhat_pred = knn_smote.predict(X_test)
smote_acc = accuracy_score(y_test, yhat_pred)
smote_prec = precision_score(y_test, yhat_pred)


print(f"Testing accuracy after applying SMOTE: {smote_acc}")
print(f"Testing precision after applying SMOTE : {smote_prec}")

Testing accuracy after applying SMOTE: 0.9953333333333333
Testing precision after applying SMOTE : 0.18811881188118812


#### Explanation: After Smote :
* The accuracy is still high (0.997) but precision dropped to 0.23 which means the model is now predicting more transactions as “fraud”, but a lot of them are false positives (lower precision)

#### Randomized Search on Smote data 

In [65]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import RandomizedSearchCV

#  Set up hyperparameter space
param_grid = {
    'n_neighbors': list(range(1, 31)),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Use a fresh (untrained) KNN model
knn = KNeighborsClassifier()

# Run RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=knn,
    param_distributions=param_grid,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1,
    verbose=2
)

# Step 5: Fit on SMOTE data (not scaled)
random_search.fit(X_train_smote, y_train_smote)

# Step 6: Predict on raw test set (also unscaled!)
yhat = random_search.best_estimator_.predict(X_test)

# Step 7: Evaluate
from sklearn.metrics import confusion_matrix, classification_report

conf_matrix = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Best parameters:", random_search.best_params_)
print("Confusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
[CV] END .metric=euclidean, n_neighbors=24, weights=distance; total time=   9.7s
[CV] END .metric=euclidean, n_neighbors=24, weights=distance; total time=   9.9s
[CV] END .metric=euclidean, n_neighbors=24, weights=distance; total time=  10.1s
[CV] END ..metric=euclidean, n_neighbors=23, weights=uniform; total time=  17.5s
[CV] END ..metric=euclidean, n_neighbors=23, weights=uniform; total time=  17.6s
[CV] END ..metric=euclidean, n_neighbors=23, weights=uniform; total time=  18.0s
[CV] END ..metric=euclidean, n_neighbors=23, weights=uniform; total time=  18.0s
[CV] END .metric=euclidean, n_neighbors=24, weights=distance; total time=   9.0s
[CV] END ..metric=euclidean, n_neighbors=23, weights=uniform; total time=  19.7s
[CV] END .metric=euclidean, n_neighbors=24, weights=distance; total time=  11.6s
[CV] END ...metric=euclidean, n_neighbors=3, weights=uniform; total time=  13.9s
[CV] END .metric=euclidean, n_neighbors=28, wei

In [66]:
print("Best parameters:", random_search.best_params_)
print("Confusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)

Best parameters: {'weights': 'uniform', 'n_neighbors': 1, 'metric': 'euclidean'}
Confusion Matrix:
 [[149392    412]
 [    48    148]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    149804
           1       0.26      0.76      0.39       196

    accuracy                           1.00    150000
   macro avg       0.63      0.88      0.69    150000
weighted avg       1.00      1.00      1.00    150000



#### Reflection after RandomizedSearch on smote parameters:
* The precison is a little better detecting fraud with .32 but it is not a strong model since the f1-score is low with a 0.46. 


####  Scale and Smote 
Now we will scale and smote to see if that affects the model's accuracy and precison

##### MinMaxScaler
* Using `MinMaxScaler` so that all features lie in the range [0, 1]. 
* Retrain the kNN model on the scaled data and observe any improvements in performance.

In [67]:
from sklearn.preprocessing import MinMaxScaler
# Initialize the MinMaxScaler to scale data to the range [0, 1]
scaler = MinMaxScaler(feature_range=(0, 1))

# Fit on the training data, then transform both training and testing data
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert scaled training data to a DataFrame for a quick visualization
df_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
print("First five rows of the scaled training data:")
print(df_scaled.head())

First five rows of the scaled training data:
     amount  oldbalanceOrg  newbalanceOrig  oldbalanceDest  newbalanceDest  \
0  0.001585       0.000266        0.000000        0.000850        0.001263   
1  0.001833       0.210261        0.211302        0.000927        0.000451   
2  0.001378       0.000547        0.000000        0.000414        0.000773   
3  0.000197       0.000008        0.000000        0.000000        0.000000   
4  0.000010       0.000000        0.000000        0.000000        0.000000   

   type_CASH_IN  type_CASH_OUT  type_DEBIT  type_PAYMENT  type_TRANSFER  
0           0.0            1.0         0.0           0.0            0.0  
1           1.0            0.0         0.0           0.0            0.0  
2           0.0            1.0         0.0           0.0            0.0  
3           0.0            0.0         0.0           1.0            0.0  
4           0.0            0.0         0.0           1.0            0.0  


In [68]:
# Retrain kNN on the scaled data (but imbalanced) data using k=3 for illustration
knn_scaled = KNeighborsClassifier(n_neighbors=3)
knn_scaled.fit(X_train_scaled, y_train)

# Evaluate the classifier on the scaled test data
test_pred_scaled = knn_scaled.predict(X_test_scaled) #Model's predicted class labels on scaled test set
scaled_accuracy = accuracy_score(y_test, test_pred_scaled) #Accuracy score of the model

print("Testing accuracy on scaled data:", scaled_accuracy)

Testing accuracy on scaled data: 0.99946


In [69]:
# Apply SMOTE to rebalance the training set (number of neighbors needs to be less than number of minority class samples)
smote = SMOTE(k_neighbors=2, random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_scaled, y_train)
# X_train_smote: New training features
# y_train_smote: New training labels that are resampled, now balanced class distribution


print("Class distribution after SMOTE:")
print(y_train_smote.value_counts()) #we now see that the labels are equal

Class distribution after SMOTE:
isFraud
0    349537
1    349537
Name: count, dtype: int64


In [70]:
#Train KNN on scaled, SMOTE-balanced data
knn_scaled_smote = KNeighborsClassifier(n_neighbors=3) #New KNN model trained on balanced, scaled data — better for learning from both classes equally
knn_scaled_smote.fit(X_train_smote, y_train_smote)

In [71]:
# Predict and evaluate
yhat_smote_scaled = knn_scaled_smote.predict(X_test_scaled) #Predictions from SMOTE-trained model

#Accuracy and percision of balanced KNN
smote_scaled_acc = accuracy_score(y_test, yhat_smote_scaled)
smote_scaled_prec = precision_score(y_test, yhat_smote_scaled)

print(f"Testing accuracy after SMOTE + scaling: {smote_scaled_acc}")
print(f"Testing precision after SMOTE + scaling: {smote_scaled_prec}")

Testing accuracy after SMOTE + scaling: 0.9971333333333333
Testing precision after SMOTE + scaling: 0.2918149466192171


#### Explanation: After Smote + Scaled:
* A similar occuring with randomized smote is occuring. The accuracy is still high (0.997) but precision dropped to 0.35 which means the model is now predicting more transactions as “fraud”, but a lot of them are false positives (lower precision). At the same time, scaling the data ensures that no single feature (like amount) dominates the distance calculations in kNN, leading to fairer and more balanced learning.



In [72]:
"""
#Grid Search Run Time is long <30 minutes

# Implement full grid-search on the knn model to find best hyperparams
params = {
        'n_neighbors': list(range(1, 31)),
        'weights': ['uniform', 'distance'],
        'metric': ['euclidean', 'manhattan']
}

knn = KNeighborsClassifier()

# set up GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(
    knn,
    param_grid=params,
    cv=5,
    scoring='accuracy'
    #scoring= 'f1'
)
#fit this model on your training data
grid_search.fit(X_train_smote, y_train_smote)

print("Best parameters:", grid_search.best_params_)
print("Best CV accuracy:", grid_search.best_score_)
"""


'\n#Grid Search Run Time is long <30 minutes\n\n# Implement full grid-search on the knn model to find best hyperparams\nparams = {\n        \'n_neighbors\': list(range(1, 31)),\n        \'weights\': [\'uniform\', \'distance\'],\n        \'metric\': [\'euclidean\', \'manhattan\']\n}\n\nknn = KNeighborsClassifier()\n\n# set up GridSearchCV with 5-fold cross-validation\ngrid_search = GridSearchCV(\n    knn,\n    param_grid=params,\n    cv=5,\n    scoring=\'accuracy\'\n    #scoring= \'f1\'\n)\n#fit this model on your training data\ngrid_search.fit(X_train_smote, y_train_smote)\n\nprint("Best parameters:", grid_search.best_params_)\nprint("Best CV accuracy:", grid_search.best_score_)\n'

In [None]:
#Randomized Search for scaled and smoted Run Time is long <30 minutes


#  Set up hyperparameter space
param_grid = {
    'n_neighbors': list(range(1, 31)),
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

# Use a fresh (untrained) KNN model
knn = KNeighborsClassifier()

# Run RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=knn_scaled_smote,
    param_distributions=param_grid,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1,
    verbose=2
)

# Step 5: Fit on SMOTE data (not scaled)
random_search.fit(X_train_smote, y_train_smote)

# Step 6: Predict on raw test set (also unscaled!)
yhat = random_search.best_estimator_.predict(X_test)

# Step 7: Evaluate
from sklearn.metrics import confusion_matrix, classification_report

conf_matrix = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Best parameters:", random_search.best_params_)
print("Confusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)


Fitting 5 folds for each of 20 candidates, totalling 100 fits


### 3) Train your model
Select the model with best hyperparameters and generate predictions on your test set. Evaluate your models accuracy, precision, recall, and sensitivity.  

In [None]:
# Step 7: Evaluate
from sklearn.metrics import confusion_matrix, classification_report

conf_matrix = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Best parameters:", random_search.best_params_)
print("Confusion Matrix:\n", conf_matrix)
print("\nClassification Report:\n", class_report)

Best parameters: {'weights': 'uniform', 'n_neighbors': 1, 'metric': 'euclidean'}
Confusion Matrix:
 [[2996    0]
 [   4    0]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2996
           1       0.00      0.00      0.00         4

    accuracy                           1.00      3000
   macro avg       0.50      0.50      0.50      3000
weighted avg       1.00      1.00      1.00      3000



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Overall Conclusions about kNN model:

* The best kNN model was the 

## Second Model SVM

Create a second machine learning object and rerun steps (2) & (3) on this model. Compare accuracy metrics between these two models. Which handles the class imbalance more effectively?

Create as many code-blocks as needed.

In [None]:
import numpy as np
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix, classification_report

from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV

# TODO: perform a train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# initialize LinearSVC with regularization parameter C=1.0 and set max_iter for convergence
lin_svc = LinearSVC(C=1.0, max_iter=10000, random_state=42)

# train the classifier on the dataset
lin_svc.fit(X_train, y_train)

# make predictions on the same dataset
yhat = lin_svc.predict(X_test) 

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[2996    0]
 [   0    4]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2996
           1       1.00      1.00      1.00         4

    accuracy                           1.00      3000
   macro avg       1.00      1.00      1.00      3000
weighted avg       1.00      1.00      1.00      3000



This is a poor model and worse than svm since the f1 score is only 0.07 at detecting fraud. 

In [None]:
# TODO: implement random search on the LinearSVC model to find best hyperparams
# DOCS: https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html
param_grid = {
    'C': np.linspace(0.01, 10, 100),
    'penalty': ['l2'],
    'loss': ['squared_hinge'],
    'dual': [True, False]
}

svc = LinearSVC(max_iter=10000)

# TODO: set up RandomizedSearchCV with 5-fold cross-validation
random_search = RandomizedSearchCV(
    estimator=svc,
    param_distributions=param_grid,
    n_iter=20,
    cv=5,
    scoring='accuracy',
    random_state=42,
    n_jobs=1,
)

# TODO: fit this model on your training data
random_search.fit(X_train, y_train)



In [None]:
best_svc = random_search.best_estimator_

#make predicitions on the same dataset
yhat = best_svc.predict(X_test)

confusion = confusion_matrix(y_test,yhat)
class_report = classification_report(y_test,yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report:\n", class_report)

Confusion Matrix 
 [[2996    0]
 [   0    4]]

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2996
           1       1.00      1.00      1.00         4

    accuracy                           1.00      3000
   macro avg       1.00      1.00      1.00      3000
weighted avg       1.00      1.00      1.00      3000



### (Bonus/Optional) Third Model

Create a third machine learning model and rerun steps (2) & (3) on this model. Which model has the best predictive capabilities? 

Create as many code-blocks as needed.

In [None]:
from sklearn.ensemble import RandomForestClassifier
# instantiate a RandomForestClassifier with a fixed random seed for reproducibility
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

In [None]:
# let's generate a classification report to see how well our random forest performed
yhat = rf.predict(X_test) 

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[2996    0]
 [   1    3]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2996
           1       1.00      0.75      0.86         4

    accuracy                           1.00      3000
   macro avg       1.00      0.88      0.93      3000
weighted avg       1.00      1.00      1.00      3000



In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# create a decision stump (a tree with max_depth=1) to use as the weak learner
stump = DecisionTreeClassifier(max_depth=1)

# initialize the AdaBoostClassifier with 50 weak learners and a fixed random state
ada = AdaBoostClassifier(estimator=stump, random_state=42)

ada.fit(X_train, y_train)

In [None]:
# let's generate a classification report to see how well our random forest performed
yhat = ada.predict(X_test) 

confusion = confusion_matrix(y_test, yhat)
class_report = classification_report(y_test, yhat)

print("Confusion Matrix \n", confusion)
print("\nClassification Report\n", class_report)

Confusion Matrix 
 [[2996    0]
 [   1    3]]

Classification Report
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      2996
           1       1.00      0.75      0.86         4

    accuracy                           1.00      3000
   macro avg       1.00      0.88      0.93      3000
weighted avg       1.00      1.00      1.00      3000

