# **1. Import Library**

*Library* [`pandas`](https://pandas.pydata.org) to read, write and process data in table form using DataFrame.

*Library* [`sklearn.preprocessing`](https://scikit-learn.org/1.5/modules/preprocessing.html) to Normalizes numerical features to ensure fair comparisons and Converts categorical variables into numeric form..

*Library* [`sklearn.model_selection`](https://scikit-learn.org/stable/model_selection.html) to Splits data into training and testing sets..

*Library* [`sklearn.tree`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), [`sklearn.ensemble`](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) & [`sklearn.naive_bayes`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) to Machine learning models for classification.

*Library* [`sklearn.metrics`](https://scikit-learn.org/1.5/api/sklearn.metrics.html) to Evaluates model performance using accuracy, precision, recall, and F1-score..



In [22]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.metrics import confusion_matrix, accuracy_score, f1_score, precision_score, recall_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB

# **2. Loading Dataset from Clustering Results**

In [23]:
data = pd.read_csv('clustered_transactions.csv')
data

Unnamed: 0,TransactionID,AccountID,TransactionAmount,TransactionDate,TransactionType,Location,DeviceID,IP Address,MerchantID,Channel,CustomerAge,CustomerOccupation,TransactionDuration,LoginAttempts,AccountBalance,PreviousTransactionDate,Cluster_KMeans,Cluster_DBSCAN
0,TX000001,AC00128,14.09,2023-04-11 16:29:14,Debit,San Diego,D000380,162.198.218.92,M015,ATM,70,Doctor,81,1,5112.21,2024-11-04 08:08:08,2,0
1,TX000002,AC00455,376.24,2023-06-27 16:44:19,Debit,Houston,D000051,13.149.61.4,M052,ATM,68,Doctor,141,1,13758.91,2024-11-04 08:09:35,2,0
2,TX000003,AC00019,126.29,2023-07-10 18:16:08,Debit,Mesa,D000235,215.97.143.157,M009,Online,19,Student,56,1,1122.35,2024-11-04 08:07:04,1,1
3,TX000004,AC00070,184.50,2023-05-05 16:32:11,Debit,Raleigh,D000187,200.13.225.150,M002,Online,26,Student,25,1,8569.06,2024-11-04 08:09:06,1,1
4,TX000005,AC00411,13.45,2023-10-16 17:51:24,Credit,Atlanta,D000308,65.164.3.100,M091,Online,26,Student,198,1,7429.40,2024-11-04 08:06:39,1,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2507,TX002508,AC00297,856.21,2023-04-26 17:09:36,Credit,Colorado Springs,D000625,21.157.41.17,M072,Branch,33,Doctor,109,1,12690.79,2024-11-04 08:11:29,2,18
2508,TX002509,AC00322,251.54,2023-03-22 17:36:48,Debit,Tucson,D000410,49.174.157.140,M029,Branch,48,Doctor,177,1,254.75,2024-11-04 08:11:42,2,12
2509,TX002510,AC00095,28.63,2023-08-21 17:08:50,Debit,San Diego,D000095,58.1.27.124,M087,Branch,56,Retired,146,1,3382.91,2024-11-04 08:08:39,3,11
2510,TX002511,AC00118,185.97,2023-02-24 16:24:46,Debit,Denver,D000634,21.190.11.223,M041,Online,23,Student,19,1,1776.91,2024-11-04 08:12:22,1,1


In [24]:
data.drop(columns=['TransactionID','AccountID','TransactionDate','Location','DeviceID','IP Address','MerchantID','PreviousTransactionDate','Cluster_DBSCAN'], inplace=True)
data.head()

Unnamed: 0,TransactionAmount,TransactionType,Channel,CustomerAge,CustomerOccupation,TransactionDuration,LoginAttempts,AccountBalance,Cluster_KMeans
0,14.09,Debit,ATM,70,Doctor,81,1,5112.21,2
1,376.24,Debit,ATM,68,Doctor,141,1,13758.91,2
2,126.29,Debit,Online,19,Student,56,1,1122.35,1
3,184.5,Debit,Online,26,Student,25,1,8569.06,1
4,13.45,Credit,Online,26,Student,198,1,7429.4,1


In [25]:
categorical_columns = data.select_dtypes(include=['object']).columns
for column in categorical_columns:
    label_encoder = LabelEncoder()
    data[column] = label_encoder.fit_transform(data[column])

In [26]:
numerical_columns = ['TransactionAmount', 'CustomerAge', 'TransactionDuration', 'LoginAttempts', 'AccountBalance']
scaler = MinMaxScaler()
data[numerical_columns] = scaler.fit_transform(data[numerical_columns])

1. Drops irrelevant columns that are not needed for classification.
2. Cluster_DBSCAN is removed since we are focusing on Cluster_KMeans.
3. Converts categorical features into numeric form (e.g., "Doctor" → 0, "Engineer" → 1).
4. Scales numerical features to a range of [0, 1] to avoid bias towards larger numbers.

# **3. Data Splitting**

In [27]:
X = data.drop(columns=['Cluster_KMeans'])
y = data['Cluster_KMeans']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

1. X contains all features except the target (Cluster_KMeans).
2. y is the target variable (Cluster_KMeans).
3. train_test_split() splits data into 80% training and 20% testing.

# **4. Building Classification Models**


## **a. Building Classification Models**

In [28]:
DecisionTree = DecisionTreeClassifier().fit(X_train, y_train)
rf = RandomForestClassifier().fit(X_train, y_train)
gnb = GaussianNB().fit(X_train, y_train)

1. Decision Tree
    - A simple, rule-based approach that splits data using if-else conditions.
2. Random Forest Classifier
    - To improve accuracy and reduce overfitting by combining multiple decision trees.
3. Gaussian Naive Bayes
    - To classify transactions based on probabilities derived from Bayes' Theorem.

## **b. Evaluation of Classification Models**

In [29]:
def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    results = {
        'Confusion Matrix': cm,
        'Accuracy': accuracy_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred, average='weighted'),
        'Recall': recall_score(y_test, y_pred, average='weighted'),
        'F1 - Score': f1_score(y_test, y_pred, average='weighted')
    }
    return results

results = {
    'Decision Tree': evaluate_model(DecisionTree, X_test, y_test),
    'Random Forest': evaluate_model(rf, X_test, y_test),
    'Gaussian Naive Bayes': evaluate_model(gnb, X_test, y_test)
}

summary = pd.DataFrame(columns=['Model','Accuracy', 'Precision', 'Recall', 'F1-Score'])
rows = []
for model_name, metrics in results.items():
    rows.append({
        'Model': model_name,
        'Accuracy': metrics['Accuracy'],
        'Precision': metrics['Precision'],
        'Recall': metrics['Recall'],
        'F1-Score': metrics['F1 - Score']
    })
summary = pd.DataFrame(rows)
print(summary)

                  Model  Accuracy  Precision  Recall  F1-Score
0         Decision Tree       1.0        1.0     1.0       1.0
1         Random Forest       1.0        1.0     1.0       1.0
2  Gaussian Naive Bayes       1.0        1.0     1.0       1.0


1. All models achieved 100% accuracy, precision, recall, and F1-score.
2. This suggests that the clusters are highly separable based on the given features.

## **c. Tuning Classification Models**

In [30]:
rf = RandomForestClassifier(random_state=42)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'criterion': ['gini', 'entropy']
}
gridsearch = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, n_jobs=-1, verbose=1)
gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 162 candidates, totalling 810 fits


## **d. Evaluation of Classification Models after Tuning**

In [31]:
print("Best Parameters:", gridsearch.best_params_)
print("Best Accuracy:", gridsearch.best_score_)

Best Parameters: {'criterion': 'gini', 'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}
Best Accuracy: 1.0


1. Achieved 100% accuracy, meaning the model perfectly predicts Cluster_KMeans.
2. The limited depth (max_depth=5) suggests the dataset has well-separated classes.

## **e. Analysis of Classification Model Evaluation Results**

1. All models achieved 100% accuracy before and after tuning
2. Hyperparameter tuning did not change the model performance
3. The dataset is likely too easy to classify, meaning the clusters are well-separated.
Follow-up Recommendations :
1. Remove highly correlated features and retrain the model.
2. Test the classifier on new, real-world transaction data.
3. Before deploying, validate the model on unseen data and ensure data leakage is not affecting performance.