# Name: Aindri Singh
# Roll Number: 102316039
# Batch: 3P12
# Assignment 2: Sampling Assignment

# Sampling Assignment: Credit Card Fraud Detection

## Part 1: Data Loading and Initial Exploration

In [20]:
# Part 1: Data Loading and Initial Exploration
# Importing required libraries
import pandas as pd
import numpy as np

# Loading dataset directly from GitHub
url = "https://raw.githubusercontent.com/AnjulaMehto/Sampling_Assignment/main/Creditcard_data.csv"
df = pd.read_csv(url)

print("First 5 rows of the dataset:")
display(df.head())

print("\nClass distribution before balancing:")
display(df['Class'].value_counts())

First 5 rows of the dataset:


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,1
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0



Class distribution before balancing:


Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0,763
1,9


## Part 2: Feature-Target Separation and Dataset Balancing

In [21]:
# Part 2: Feature-Target Separation and Dataset Balancing
# Separating Features and Target
X = df.drop('Class', axis=1)
y = df['Class']

# Balancing the Dataset using RandomOverSampler
from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=42)
X_balanced, y_balanced = ros.fit_resample(X, y)

print("Class distribution after Random Over-sampling:")
display(y_balanced.value_counts())

Class distribution after Random Over-sampling:


Unnamed: 0_level_0,count
Class,Unnamed: 1_level_1
0,763
1,763


## Part 3: Creating Five Samples

In [22]:
# Part 3: Creating Five Samples
samples = []

for i in range(5):
    # Combine balanced features and target, then sample a fraction (70%)
    sample = pd.concat([X_balanced, y_balanced], axis=1)\
                .sample(frac=0.7, random_state=i)
    samples.append(sample)

print(f"Created {len(samples)} samples, each with 70% of the balanced dataset.")
print(f"Example: First sample shape: {samples[0].shape}")

Created 5 samples, each with 70% of the balanced dataset.
Example: First sample shape: (1068, 31)


## Part 4: Defining Sampling Techniques and Machine Learning Models

In [23]:
# Part 4: Define five different sampling techniques (Sampling1, Sampling2, Sampling3, Sampling4, Sampling5)
from imblearn.under_sampling import RandomUnderSampler, NearMiss, TomekLinks
from imblearn.over_sampling import SMOTE

sampling_methods = [
    RandomUnderSampler(random_state=42), # Sampling1 (RandomUnderSampler)
    RandomOverSampler(random_state=42),  # Sampling2 (RandomOverSampler)
    SMOTE(random_state=42),              # Sampling3 (SMOTE)
    NearMiss(),                          # Sampling4 (NearMiss)
    TomekLinks()                         # Sampling5 (TomekLinks)
]

print("Defined sampling methods:")
for i, sampler in enumerate(sampling_methods):
    print(f"Sampling{i+1}: {type(sampler).__name__}")

Defined sampling methods:
Sampling1: RandomUnderSampler
Sampling2: RandomOverSampler
Sampling3: SMOTE
Sampling4: NearMiss
Sampling5: TomekLinks


In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Part 4: Define five different ML models (M1, M2, M3, M4 and M5)
models = [
    LogisticRegression(max_iter=1000), # M1 (LogisticRegression)
    DecisionTreeClassifier(),           # M2 (DecisionTreeClassifier)
    RandomForestClassifier(),           # M3 (RandomForestClassifier)
    KNeighborsClassifier(),             # M4 (KNeighborsClassifier)
    GaussianNB()                        # M5 (GaussianNB)
]

print("Defined machine learning models:")
for i, model in enumerate(models):
    print(f"M{i+1}: {type(model).__name__}")

Defined machine learning models:
M1: LogisticRegression
M2: DecisionTreeClassifier
M3: RandomForestClassifier
M4: KNeighborsClassifier
M5: GaussianNB


## Part 5: Applying Sampling Techniques and Models, Determining Best Accuracy

In [25]:
# Part 5: Determine which sampling technique gives higher accuracy on which model.
results = []

for model in models:
    model_results = []
    for sampler in sampling_methods:
        # Apply the sampling technique to the balanced dataset
        X_s, y_s = sampler.fit_resample(X_balanced, y_balanced)

        # Split the resampled data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(
            X_s, y_s, test_size=0.3, random_state=42
        )

        # Train the model and make predictions
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        # Calculate accuracy and store the result
        acc = accuracy_score(y_test, y_pred) * 100
        model_results.append(round(acc, 2))

    results.append(model_results)

# Create a DataFrame to display the accuracy results
accuracy_table = pd.DataFrame(
    results,
    columns=["Sampling1 (RandomUnderSampler)", "Sampling2 (RandomOverSampler)", "Sampling3 (SMOTE)", "Sampling4 (NearMiss)", "Sampling5 (TomekLinks)"],
    index=["M1 (LogisticRegression)", "M2 (DecisionTreeClassifier)", "M3 (RandomForestClassifier)", "M4 (KNeighborsClassifier)", "M5 (GaussianNB)"]
)

print("Accuracy Table (%):")
display(accuracy_table)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Accuracy Table (%):


Unnamed: 0,Sampling1 (RandomUnderSampler),Sampling2 (RandomOverSampler),Sampling3 (SMOTE),Sampling4 (NearMiss),Sampling5 (TomekLinks)
M1 (LogisticRegression),91.27,91.7,91.7,91.27,91.7
M2 (DecisionTreeClassifier),100.0,98.69,98.47,100.0,98.47
M3 (RandomForestClassifier),100.0,99.78,100.0,100.0,100.0
M4 (KNeighborsClassifier),97.6,98.47,98.47,97.6,98.47
M5 (GaussianNB),65.94,78.17,78.17,67.03,78.17


### Analysis of Results:

From the `accuracy_table`, we can observe the performance of each model with different sampling techniques. To identify the best combination, we look for the highest accuracy scores.

*   **M1 (Logistic Regression):** Shows consistent accuracy around 91-92% across most sampling methods, with slightly higher performance for Random Over-sampling, SMOTE, and TomekLinks. The convergence warning suggests that `max_iter` might need to be increased or the data scaled for this model.
*   **M2 (Decision Tree Classifier):** Achieves very high accuracy, with 100% using Random Under-sampling, and consistently above 98% for others. This indicates it performs exceptionally well, especially with `RandomUnderSampler` in this specific setup.
*   **M3 (Random Forest Classifier):** Demonstrates excellent performance, achieving 100% accuracy with NearMiss, and nearly 100% (99.78%) with all other sampling methods. This is the most robust model in this comparison.
*   **M4 (K-Neighbors Classifier):** Performs well, with accuracy ranging from 97.60% to 98.47%. Random Over-sampling, SMOTE, and TomekLinks yield slightly better results for this model.
*   **M5 (Gaussian Naive Bayes):** Shows the lowest accuracy among the models, ranging from 65.94% to 78.17%. Random Over-sampling, SMOTE, and TomekLinks provide a notable improvement compared to Random Under-sampling and NearMiss.

**Overall Best Performance:**

*   **Random Forest Classifier (M3)** achieved 100% accuracy with **NearMiss (Sampling4)**.
*   **Decision Tree Classifier (M2)** also achieved 100% accuracy with **Random Under-sampling (Sampling1)**.

These results suggest that for this specific dataset and experimental setup, ensemble methods like Random Forest and tree-based models like Decision Trees, when combined with appropriate sampling techniques, yield the highest accuracy. The choice of sampling technique significantly impacts the performance, particularly for simpler models like Gaussian Naive Bayes and Logistic Regression.