Reference Chat - https://chat.openai.com/share/cb88c8d6-90d6-44da-ba5b-668bfcb096f7


We'll simulate a small dataset of transactions, each with a few basic features:

Transaction ID: A unique identifier for each transaction.

Amount: The amount of money involved in the transaction.

Type: The type of transaction (e.g., deposit, withdrawal).

Location: The geographic location of the transaction.

Time: The time at which the transaction occurred.

IsFraud: A flag indicating whether the transaction is fraudulent (for the sake of example, we'll include this, though in a real scenario, this would be unknown and what the model aims to predict).

For the pre-filtering step, we'll perform:

Removal of outliers based on the transaction amount.
Filtering based on transaction type, focusing perhaps on more commonly fraudulent types.
Basic cleaning, like filling in missing values if necessary (our sample will not include missing values for simplicity).

In [None]:
import pandas as pd
import numpy as np

# Sample dataset generation
np.random.seed(42)  # For reproducibility

# Generating sample data
df = pd.read_csv('transactions.csv')


# Pre-filtering steps
# 1. Removal of outliers in transaction amount
q1, q3 = np.percentile(df["Amount"], [25, 75])
iqr = q3 - q1
lower_bound = q1 - (1.5 * iqr)
upper_bound = q3 + (1.5 * iqr)

print(q1,q3)

filtered_df = df[(df["Amount"] > lower_bound) & (df["Amount"] < upper_bound)]

# 2. Filtering based on transaction type (e.g., focusing on 'withdrawal' and 'transfer' types)
filtered_df = filtered_df[filtered_df["Type"].isin(["withdrawal", "transfer"])]

# Displaying the original and filtered datasets
filtered_df.to_csv('filtered_transactions.csv', index=False)


2485.5075 7467.8475


In Step 2, Feature Extraction, the goal is to transform raw data into meaningful features that can be used by machine learning models to detect fraudulent activities in e-banking transactions. Given the sample dataset structure provided earlier, we will focus on extracting and engineering features that could be significant for identifying potential fraud.

Feature Engineering Ideas
Time Features: Fraudulent activities might occur more frequently during certain hours. Extract features like the hour of the day or the day of the week from the transaction time.

Amount Features: Since the transaction amount could be indicative of fraud, consider normalizing or binning this feature. For example, transactions can be categorized into small, medium, and large based on amount thresholds.

Location-Based Features: If certain locations are more prone to fraud, create dummy variables for each location type (e.g., Online, Branch, ATM).

Type-Based Features: Similarly, transaction types could be relevant. Create dummy variables for each transaction type (e.g., deposit, withdrawal, transfer).

In [None]:

# taking filtered data
df_filtered = pd.read_csv('filtered_transactions.csv')
df_filtered['Time'] = pd.to_datetime(df['Time'])

# 1. Time Features
df_filtered['HourOfDay'] = df_filtered['Time'].dt.hour
df_filtered['DayOfWeek'] = df_filtered['Time'].dt.dayofweek  # Monday=0, Sunday=6

# 2. Amount Features: Example of binning into categories
amount_bins = [-1, 1000, 5000, np.inf]  # Bins for the transaction amount
amount_labels = ['small', 'medium', 'large']
df_filtered['AmountCategory'] = pd.cut(df_filtered['Amount'], bins=amount_bins, labels=amount_labels)

# 3. Location-Based Features: Creating dummy variables
location_dummies = pd.get_dummies(df_filtered['Location'], prefix='Loc')
df_filtered = pd.concat([df_filtered, location_dummies], axis=1)

# 4. Type-Based Features: Creating dummy variables
type_dummies = pd.get_dummies(df_filtered['Type'], prefix='Type')
df_filtered = pd.concat([df_filtered, type_dummies], axis=1)

# Dropping the original 'Time', 'Location', 'Type', 'Amount' columns if they are no longer needed
# This is optional and depends on whether you want to keep the original features alongside the engineered ones
df_filtered.drop(['Time', 'Location', 'Type', 'Amount'], axis=1, inplace=True)

# Display the DataFrame with new features
df_filtered.to_csv('Feature_extraction_transactions.csv', index=False)


For Step 3, focusing on Machine Learning for Fraud Detection, we'll implement a simple model using classical machine learning techniques as a baseline. Although the ultimate goal is to incorporate Quantum Machine Learning (QML) methods, establishing a classical baseline is a useful first step for comparison and to ensure the feature engineering process is effective.

We'll use the processed transactions data with extracted features to train a machine learning model. For this example, let's use a Random Forest Classifier, known for its effectiveness in classification tasks, including fraud detection.

Load the Processed Data: Start by loading the processed data from the CSV file created in the previous step.

Prepare the Data: Split the data into features (X) and the target variable (y), which is the IsFraud column in this case.

Split the Data: Divide the dataset into training and testing sets to evaluate the model's performance.

Train the Model: Train a Random Forest Classifier on the training data.
Evaluate the Model: Assess the model's performance on the testing set using appropriate metrics like accuracy, precision, recall, and the F1 score.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Step 1: Load the Processed Data
df = pd.read_csv('Feature_extraction_transactions.csv')  # Update this path

# Step 2: Prepare the Data
X = df.drop(['TransactionID', 'IsFraud'], axis=1)  # Features
y = df['IsFraud']  # Target variable

# Convert categorical variables (if any) into dummy/indicator variables
X = pd.get_dummies(X, drop_first=True)

# Step 3: Split the Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42)

# Step 4: Train the Model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 5: Evaluate the Model
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
precision = precision_score(y_test, predictions)
recall = recall_score(y_test, predictions)
f1 = f1_score(y_test, predictions)

metrics = {
    "Accuracy": accuracy,
    "Precision": precision,
    "Recall": recall,
    "F1 Score": f1
}

metrics_df = pd.DataFrame([metrics])

metrics_df.to_csv('final_result.csv', index=False)

Accuracy measures the overall correctness of the model.
Precision is the ratio of true positive predictions to the total positive predictions (important to minimize false positives).
Recall measures how well the model can find the positive samples (important for not missing actual fraud cases).
F1 Score is the harmonic mean of precision and recall, providing a balance between the two.