# UPI Fraud Detection using Machine Learning

This notebook focuses on building a machine learning model to detect fraudulent UPI transactions. We'll follow these steps:

1. Data Preprocessing
2. Exploratory Data Analysis (EDA)
3. Feature Engineering
4. Feature Selection
5. Model Implementation
6. Model Evaluation & Selection

## Importing Required Libraries

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, auc, precision_recall_curve

# Feature processing
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.feature_selection import SelectKBest, chi2, RFE, SelectFromModel
from sklearn.impute import SimpleImputer

# Model building and evaluation
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import xgboost as xgb

# Other utilities
from sklearn.pipeline import Pipeline
import warnings
import pickle

# Ignore warnings
warnings.filterwarnings('ignore')

# Set plot style
plt.style.use('ggplot')
sns.set(style="whitegrid")

# For reproducibility
np.random.seed(42)

## 1. Data Loading and Preprocessing

In [None]:
# Load the dataset
df = pd.read_csv('../attached_assets/Upi_fraud_dataset-checkpoint.csv')

# Display the first few rows
df.head()

In [None]:
# Check the shape of the dataset
print(f"Dataset shape: {df.shape}")

In [None]:
# Check data types and information
df.info()

### 1.1 Initial Data Exploration

In [None]:
# Get summary statistics
df.describe()

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values in each column:")
for col, count in zip(missing_values.index, missing_values.values):
    if count > 0:
        print(f"{col}: {count} ({count/len(df)*100:.2f}%)")

In [None]:
# Check for duplicate records
duplicates = df.duplicated().sum()
print(f"Number of duplicate records: {duplicates} ({duplicates/len(df)*100:.2f}%)")

In [None]:
# Check the distribution of the target variable (FraudFlag)
fraud_distribution = df['FraudFlag'].value_counts(normalize=True) * 100
print("Distribution of fraud transactions:")
print(fraud_distribution)

### 1.2 Handle Missing Values

In [None]:
# Fill missing numerical values with median
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].median(), inplace=True)

# Fill missing categorical values with mode
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols:
    if df[col].isnull().sum() > 0:
        df[col].fillna(df[col].mode()[0], inplace=True)

In [None]:
# Verify that all missing values are handled
print(f"Remaining missing values: {df.isnull().sum().sum()}")

### 1.3 Handle Duplicate Records

In [None]:
# Remove duplicate records if any
if duplicates > 0:
    df = df.drop_duplicates()
    print(f"Shape after removing duplicates: {df.shape}")

### 1.4 Process Data Types

In [None]:
# Convert Timestamp to datetime
df['Timestamp'] = pd.to_datetime(df['Timestamp'])

In [None]:
# Extract TransactionFrequency numeric value
# Example format: '5/day', '3/day'
df['TransactionFrequencyValue'] = df['TransactionFrequency'].str.split('/').str[0].astype(int)

### 1.5 Handle Outliers

In [None]:
# Identify numeric columns for outlier detection (excluding ID columns and binary flags)
numeric_cols_for_outliers = ['Amount', 'Latitude', 'Longitude', 'AvgTransactionAmount', 
                            'TransactionFrequencyValue', 'FailedAttempts']

# Check for outliers using box plots
plt.figure(figsize=(15, 10))
for i, col in enumerate(numeric_cols_for_outliers):
    plt.subplot(2, 3, i+1)
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
    plt.tight_layout()
plt.show()

In [None]:
# Function to cap outliers using IQR method
def cap_outliers(df, col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    df[col] = np.where(df[col] > upper_bound, upper_bound, df[col])
    df[col] = np.where(df[col] < lower_bound, lower_bound, df[col])
    
    return df

# Apply outlier capping to numeric columns
for col in numeric_cols_for_outliers:
    df = cap_outliers(df, col)

In [None]:
# Verify outlier treatment
plt.figure(figsize=(15, 10))
for i, col in enumerate(numeric_cols_for_outliers):
    plt.subplot(2, 3, i+1)
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col} (After Treatment)')
    plt.tight_layout()
plt.show()

## 2. Exploratory Data Analysis (EDA)

### 2.1 Distribution of Target Variable

In [None]:
# Plot the distribution of fraud vs non-fraud transactions
plt.figure(figsize=(10, 6))
sns.countplot(x='FraudFlag', data=df)
plt.title('Distribution of Fraud vs Non-Fraud Transactions')
plt.xlabel('Fraud Flag (1 = Fraud, 0 = Non-Fraud)')
plt.ylabel('Count')

# Add percentage labels
total = len(df)
for p in plt.gca().patches:
    percentage = f'{100 * p.get_height() / total:.1f}%'
    plt.gca().annotate(percentage, (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='bottom')
plt.show()

### 2.2 Transaction Amount Analysis

In [None]:
# Distribution of transaction amounts
plt.figure(figsize=(12, 6))

plt.subplot(1, 2, 1)
sns.histplot(df['Amount'], kde=True)
plt.title('Distribution of Transaction Amounts')
plt.xlabel('Amount')

plt.subplot(1, 2, 2)
sns.boxplot(x='FraudFlag', y='Amount', data=df)
plt.title('Transaction Amount by Fraud Status')
plt.xlabel('Fraud Flag (1 = Fraud, 0 = Non-Fraud)')
plt.ylabel('Amount')

plt.tight_layout()
plt.show()

In [None]:
# Compare average transaction amounts for fraud vs non-fraud
fraud_avg = df[df['FraudFlag'] == True]['Amount'].mean()
non_fraud_avg = df[df['FraudFlag'] == False]['Amount'].mean()

print(f"Average amount for fraud transactions: ${fraud_avg:.2f}")
print(f"Average amount for non-fraud transactions: ${non_fraud_avg:.2f}")

### 2.3 Temporal Analysis

In [None]:
# Extract time components
df['Day'] = df['Timestamp'].dt.day
df['Month'] = df['Timestamp'].dt.month
df['Year'] = df['Timestamp'].dt.year
df['Hour'] = df['Timestamp'].dt.hour
df['DayOfWeek'] = df['Timestamp'].dt.dayofweek

In [None]:
# Analyze fraud by hour of day
plt.figure(figsize=(12, 6))

hourly_fraud = df.groupby(['Hour', 'FraudFlag']).size().unstack().fillna(0)
hourly_fraud_rate = (hourly_fraud[True] / (hourly_fraud[True] + hourly_fraud[False])) * 100

plt.subplot(1, 2, 1)
sns.countplot(x='Hour', hue='FraudFlag', data=df)
plt.title('Transactions by Hour of Day')
plt.xlabel('Hour')
plt.ylabel('Count')
plt.legend(title='Fraud Flag')

plt.subplot(1, 2, 2)
hourly_fraud_rate.plot(kind='line', marker='o')
plt.title('Fraud Rate by Hour of Day')
plt.xlabel('Hour')
plt.ylabel('Fraud Rate (%)')
plt.grid(True)

plt.tight_layout()
plt.show()

In [None]:
# Analyze fraud by day of week
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
df['DayName'] = df['DayOfWeek'].map(lambda x: days[x])

plt.figure(figsize=(12, 6))

daily_fraud = df.groupby(['DayName', 'FraudFlag']).size().unstack().fillna(0)
daily_fraud = daily_fraud.reindex(days)  # Reorder by day of week
daily_fraud_rate = (daily_fraud[True] / (daily_fraud[True] + daily_fraud[False])) * 100

plt.subplot(1, 2, 1)
day_order = pd.CategoricalDtype(categories=days, ordered=True)
df['DayName'] = df['DayName'].astype(day_order)
sns.countplot(x='DayName', hue='FraudFlag', data=df)
plt.title('Transactions by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Fraud Flag')

plt.subplot(1, 2, 2)
daily_fraud_rate.plot(kind='bar')
plt.title('Fraud Rate by Day of Week')
plt.xlabel('Day of Week')
plt.ylabel('Fraud Rate (%)')
plt.grid(True)

plt.tight_layout()
plt.show()

### 2.4 Transaction Type and Merchant Category Analysis

In [None]:
# Analyze fraud by transaction type
plt.figure(figsize=(12, 6))

tx_type_fraud = df.groupby(['TransactionType', 'FraudFlag']).size().unstack().fillna(0)
tx_type_fraud_rate = (tx_type_fraud[True] / (tx_type_fraud[True] + tx_type_fraud[False])) * 100

plt.subplot(1, 2, 1)
sns.countplot(x='TransactionType', hue='FraudFlag', data=df)
plt.title('Transactions by Type')
plt.xlabel('Transaction Type')
plt.ylabel('Count')
plt.legend(title='Fraud Flag')

plt.subplot(1, 2, 2)
tx_type_fraud_rate.plot(kind='bar')
plt.title('Fraud Rate by Transaction Type')
plt.xlabel('Transaction Type')
plt.ylabel('Fraud Rate (%)')
plt.grid(True)

plt.tight_layout()
plt.show()

In [None]:
# Analyze fraud by merchant category
plt.figure(figsize=(15, 8))

merchant_fraud = df.groupby(['MerchantCategory', 'FraudFlag']).size().unstack().fillna(0)
merchant_fraud_rate = (merchant_fraud[True] / (merchant_fraud[True] + merchant_fraud[False])) * 100

plt.subplot(1, 2, 1)
sns.countplot(y='MerchantCategory', hue='FraudFlag', data=df)
plt.title('Transactions by Merchant Category')
plt.xlabel('Count')
plt.ylabel('Merchant Category')
plt.legend(title='Fraud Flag')

plt.subplot(1, 2, 2)
merchant_fraud_rate.sort_values().plot(kind='barh')
plt.title('Fraud Rate by Merchant Category')
plt.xlabel('Fraud Rate (%)')
plt.ylabel('Merchant Category')
plt.grid(True)

plt.tight_layout()
plt.show()

### 2.5 Location Analysis

In [None]:
# Plotting fraudulent vs. non-fraudulent transactions on a scatter plot based on location
plt.figure(figsize=(12, 10))

# Sample the data if it's too large for visualization
sample_size = min(5000, len(df))
df_sample = df.sample(sample_size, random_state=42)

# Create scatter plot with different colors for fraud vs non-fraud
fraud = df_sample[df_sample['FraudFlag'] == True]
non_fraud = df_sample[df_sample['FraudFlag'] == False]

plt.scatter(non_fraud['Longitude'], non_fraud['Latitude'], 
            alpha=0.6, c='blue', s=15, label='Non-Fraud')
plt.scatter(fraud['Longitude'], fraud['Latitude'], 
            alpha=0.6, c='red', s=30, label='Fraud')

plt.title('Geographical Distribution of Transactions')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.legend()
plt.grid(True)
plt.show()

In [None]:
# Analyze fraud for unusual locations
plt.figure(figsize=(10, 6))

unusual_loc_fraud = df.groupby(['UnusualLocation', 'FraudFlag']).size().unstack().fillna(0)
unusual_loc_fraud_rate = (unusual_loc_fraud[True] / (unusual_loc_fraud[True] + unusual_loc_fraud[False])) * 100

plt.subplot(1, 2, 1)
sns.countplot(x='UnusualLocation', hue='FraudFlag', data=df)
plt.title('Transactions by Location Status')
plt.xlabel('Unusual Location')
plt.ylabel('Count')
plt.legend(title='Fraud Flag')

plt.subplot(1, 2, 2)
unusual_loc_fraud_rate.plot(kind='bar')
plt.title('Fraud Rate by Location Status')
plt.xlabel('Unusual Location')
plt.ylabel('Fraud Rate (%)')
plt.grid(True)

plt.tight_layout()
plt.show()

### 2.6 Device and IP Analysis

In [None]:
# Analyze fraud for new devices
plt.figure(figsize=(10, 6))

new_device_fraud = df.groupby(['NewDevice', 'FraudFlag']).size().unstack().fillna(0)
new_device_fraud_rate = (new_device_fraud[True] / (new_device_fraud[True] + new_device_fraud[False])) * 100

plt.subplot(1, 2, 1)
sns.countplot(x='NewDevice', hue='FraudFlag', data=df)
plt.title('Transactions by Device Status')
plt.xlabel('New Device')
plt.ylabel('Count')
plt.legend(title='Fraud Flag')

plt.subplot(1, 2, 2)
new_device_fraud_rate.plot(kind='bar')
plt.title('Fraud Rate by Device Status')
plt.xlabel('New Device')
plt.ylabel('Fraud Rate (%)')
plt.grid(True)

plt.tight_layout()
plt.show()

### 2.7 Correlation Analysis

In [None]:
# Convert boolean to numeric for correlation analysis
df_corr = df.copy()
bool_cols = ['UnusualLocation', 'UnusualAmount', 'NewDevice', 'FraudFlag']
for col in bool_cols:
    df_corr[col] = df_corr[col].astype(int)

# Select relevant numeric columns for correlation analysis
corr_cols = ['Amount', 'AvgTransactionAmount', 'TransactionFrequencyValue', 
             'UnusualLocation', 'UnusualAmount', 'NewDevice', 'FailedAttempts', 'FraudFlag',
             'Hour', 'DayOfWeek']

# Calculate correlation matrix
corr_matrix = df_corr[corr_cols].corr()

# Plot correlation heatmap
plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt=".2f", cmap='coolwarm', 
            square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.title('Correlation Matrix of Numeric Features')
plt.tight_layout()
plt.show()

### 2.8 Failed Attempts Analysis

In [None]:
# Analyze failed attempts relation to fraud
plt.figure(figsize=(12, 6))

# Count plot of failed attempts by fraud status
plt.subplot(1, 2, 1)
sns.countplot(x='FailedAttempts', hue='FraudFlag', data=df)
plt.title('Failed Attempts by Fraud Status')
plt.xlabel('Number of Failed Attempts')
plt.ylabel('Count')
plt.legend(title='Fraud Flag')

# Calculate fraud rate for each number of failed attempts
failed_attempts_fraud = df.groupby(['FailedAttempts', 'FraudFlag']).size().unstack().fillna(0)
failed_attempts_fraud_rate = (failed_attempts_fraud[True] / (failed_attempts_fraud[True] + failed_attempts_fraud[False])) * 100

plt.subplot(1, 2, 2)
failed_attempts_fraud_rate.plot(kind='bar')
plt.title('Fraud Rate by Number of Failed Attempts')
plt.xlabel('Number of Failed Attempts')
plt.ylabel('Fraud Rate (%)')
plt.grid(True)

plt.tight_layout()
plt.show()

### 2.9 Bank Analysis

In [None]:
# Analyze fraud by bank
plt.figure(figsize=(15, 8))

bank_fraud = df.groupby(['BankName', 'FraudFlag']).size().unstack().fillna(0)
bank_fraud_rate = (bank_fraud[True] / (bank_fraud[True] + bank_fraud[False])) * 100

plt.subplot(1, 2, 1)
sns.countplot(y='BankName', hue='FraudFlag', data=df)
plt.title('Transactions by Bank')
plt.xlabel('Count')
plt.ylabel('Bank Name')
plt.legend(title='Fraud Flag')

plt.subplot(1, 2, 2)
bank_fraud_rate.sort_values().plot(kind='barh')
plt.title('Fraud Rate by Bank')
plt.xlabel('Fraud Rate (%)')
plt.ylabel('Bank Name')
plt.grid(True)

plt.tight_layout()
plt.show()

## 3. Feature Engineering

### 3.1 Time-Based Features

In [None]:
# Create weekend flag
df['IsWeekend'] = df['DayOfWeek'].apply(lambda x: 1 if x >= 5 else 0)  # 5 = Saturday, 6 = Sunday

# Create night time flag (10 PM - 6 AM)
df['IsNightTime'] = df['Hour'].apply(lambda x: 1 if (x >= 22 or x < 6) else 0)

### 3.2 Amount-Based Features

In [None]:
# Create amount ratio feature
df['AmountRatio'] = df['Amount'] / df['AvgTransactionAmount']

# Create amount difference feature
df['AmountDiff'] = df['Amount'] - df['AvgTransactionAmount']

### 3.3 Risk Score Feature

In [None]:
# Create a risk score combining multiple risk factors
df['RiskScore'] = df['UnusualLocation'].astype(int) + \
                 df['UnusualAmount'].astype(int) + \
                 df['NewDevice'].astype(int) + \
                 df['FailedAttempts'] + \
                 df['IsNightTime']

# Analyze risk score effectiveness
plt.figure(figsize=(12, 6))
risk_score_fraud = df.groupby(['RiskScore', 'FraudFlag']).size().unstack().fillna(0)
if True in risk_score_fraud.columns:  # Check if there are fraud cases in the score ranges
    risk_score_fraud_rate = (risk_score_fraud[True] / (risk_score_fraud[True] + risk_score_fraud[False])) * 100

    plt.bar(risk_score_fraud_rate.index, risk_score_fraud_rate.values)
    plt.title('Fraud Rate by Risk Score')
    plt.xlabel('Risk Score')
    plt.ylabel('Fraud Rate (%)')
    plt.grid(True)
    plt.show()

### 3.4 Phone Number Features

In [None]:
# Clean phone numbers
df['PhoneNumber'] = df['PhoneNumber'].astype(str).str.replace(r'\D', '', regex=True)

# Extract country code (assuming +91 for India)
df['HasCountryCode'] = df['PhoneNumber'].apply(lambda x: 1 if x.startswith('91') else 0)

### 3.5 IP Address Features

In [None]:
# Extract IP address octets
df['IP_FirstOctet'] = df['IPAddress'].str.split('.').str[0].astype(int)

# Create High-Risk IP Flag (simplified approach)
# IPs in certain ranges are often used for VPNs or are high-risk
# This is a simplified approach - in production, you would use IP reputation databases
high_risk_ranges = [(0, 10), (172, 172), (192, 192), (198, 198)]
df['HighRiskIP'] = df['IP_FirstOctet'].apply(
    lambda x: 1 if any(lower <= x <= upper for lower, upper in high_risk_ranges) else 0
)

## 4. Feature Selection

### 4.1 Prepare Data for Feature Selection

In [None]:
# Separate features and target
# Exclude non-predictive columns
exclude_cols = ['TransactionID', 'UserID', 'DeviceID', 'IPAddress', 'PhoneNumber', 
                'Timestamp', 'TransactionFrequency', 'DayName']

# Define categorical columns for encoding
categorical_cols = ['MerchantCategory', 'TransactionType', 'BankName']

# Define numerical columns to be scaled
numerical_cols = ['Amount', 'Latitude', 'Longitude', 'AvgTransactionAmount', 
                  'TransactionFrequencyValue', 'FailedAttempts', 'Day', 'Month', 'Year', 
                  'Hour', 'DayOfWeek', 'AmountRatio', 'AmountDiff', 'RiskScore', 'IP_FirstOctet']

# Define boolean columns (already 0/1)
boolean_cols = ['UnusualLocation', 'UnusualAmount', 'NewDevice', 'IsWeekend', 
                'IsNightTime', 'HasCountryCode', 'HighRiskIP']

# All features
all_features = categorical_cols + numerical_cols + boolean_cols

# Create X (features) and y (target)
X = df[all_features]
y = df['FraudFlag']

In [None]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

### 4.2 Preprocess Features

In [None]:
# Preprocess categorical features with one-hot encoding
ohe = OneHotEncoder(sparse=False, drop='first', handle_unknown='ignore')
cat_features_train = ohe.fit_transform(X_train[categorical_cols])
cat_features_test = ohe.transform(X_test[categorical_cols])

# Create DataFrame with encoded features
cat_feature_names = ohe.get_feature_names_out(categorical_cols)
cat_features_train_df = pd.DataFrame(cat_features_train, columns=cat_feature_names, index=X_train.index)
cat_features_test_df = pd.DataFrame(cat_features_test, columns=cat_feature_names, index=X_test.index)

# Preprocess numerical features with standard scaling
scaler = StandardScaler()
num_features_train = scaler.fit_transform(X_train[numerical_cols])
num_features_test = scaler.transform(X_test[numerical_cols])

# Create DataFrame with scaled features
num_features_train_df = pd.DataFrame(num_features_train, columns=numerical_cols, index=X_train.index)
num_features_test_df = pd.DataFrame(num_features_test, columns=numerical_cols, index=X_test.index)

# Combine all features
X_train_processed = pd.concat([num_features_train_df, cat_features_train_df, X_train[boolean_cols]], axis=1)
X_test_processed = pd.concat([num_features_test_df, cat_features_test_df, X_test[boolean_cols]], axis=1)

### 4.3 Feature Selection Methods

In [None]:
# Method 1: Feature importance from Random Forest
rf_selector = RandomForestClassifier(n_estimators=100, random_state=42)
rf_selector.fit(X_train_processed, y_train)

# Get feature importances
feature_importances = pd.DataFrame({
    'Feature': X_train_processed.columns,
    'Importance': rf_selector.feature_importances_
})
feature_importances = feature_importances.sort_values('Importance', ascending=False)

# Plot top 20 features
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=feature_importances.head(20))
plt.title('Top 20 Features by Importance')
plt.tight_layout()
plt.show()

In [None]:
# Method 2: Select features using RFE with Logistic Regression
lr = LogisticRegression(max_iter=1000, random_state=42)
rfe_selector = RFE(estimator=lr, n_features_to_select=20, step=1)
rfe_selector.fit(X_train_processed, y_train)

# Get selected features
selected_features_rfe = X_train_processed.columns[rfe_selector.support_]
print("Features selected by RFE:")
print(selected_features_rfe.tolist())

In [None]:
# Method 3: Select features using XGBoost feature importance
xgb_selector = xgb.XGBClassifier(n_estimators=100, learning_rate=0.1, random_state=42)
xgb_selector.fit(X_train_processed, y_train)

# Get feature importances
xgb_importances = pd.DataFrame({
    'Feature': X_train_processed.columns,
    'Importance': xgb_selector.feature_importances_
})
xgb_importances = xgb_importances.sort_values('Importance', ascending=False)

# Plot top 20 features
plt.figure(figsize=(12, 8))
sns.barplot(x='Importance', y='Feature', data=xgb_importances.head(20))
plt.title('Top 20 Features by XGBoost Importance')
plt.tight_layout()
plt.show()

### 4.4 Select Final Features for Modeling

In [None]:
# Combine results from different feature selection methods
top_rf_features = feature_importances.head(20)['Feature'].tolist()
top_xgb_features = xgb_importances.head(20)['Feature'].tolist()

# Find common features across methods
common_features = list(set(top_rf_features) & set(top_xgb_features) & set(selected_features_rfe))
print(f"Number of common features across all methods: {len(common_features)}")

# If not enough common features, take union of top features from each method
if len(common_features) < 10:
    # Take top 15 from each method
    top_rf_features = feature_importances.head(15)['Feature'].tolist()
    top_xgb_features = xgb_importances.head(15)['Feature'].tolist()
    selected_features_rfe_top = list(selected_features_rfe)[:15] if len(selected_features_rfe) > 15 else selected_features_rfe
    
    # Combine all unique features
    final_features = list(set(top_rf_features + top_xgb_features + selected_features_rfe_top))
else:
    final_features = common_features

print(f"Final number of selected features: {len(final_features)}")
print("Selected features:")
print(final_features)

In [None]:
# Create final datasets with selected features
X_train_final = X_train_processed[final_features]
X_test_final = X_test_processed[final_features]

print(f"Shape of final training dataset: {X_train_final.shape}")
print(f"Shape of final test dataset: {X_test_final.shape}")

## 5. Model Implementation

### 5.1 Model Training

In [None]:
# Function to evaluate model performance
def evaluate_model(model, X_train, X_test, y_train, y_test):
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    y_test_prob = model.predict_proba(X_test)[:, 1] if hasattr(model, 'predict_proba') else None
    
    # Calculate metrics
    train_accuracy = np.mean(y_train_pred == y_train)
    test_accuracy = np.mean(y_test_pred == y_test)
    
    # Print classification report
    print("Classification Report:")
    print(classification_report(y_test, y_test_pred))
    
    # Plot confusion matrix
    cm = confusion_matrix(y_test, y_test_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Non-Fraud', 'Fraud'],
                yticklabels=['Non-Fraud', 'Fraud'])
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix')
    plt.show()
    
    # Plot ROC curve if applicable
    if y_test_prob is not None:
        fpr, tpr, _ = roc_curve(y_test, y_test_prob)
        roc_auc = auc(fpr, tpr)
        
        plt.figure(figsize=(8, 6))
        plt.plot(fpr, tpr, lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
        plt.plot([0, 1], [0, 1], 'k--', lw=2)
        plt.xlim([0.0, 1.0])
        plt.ylim([0.0, 1.05])
        plt.xlabel('False Positive Rate')
        plt.ylabel('True Positive Rate')
        plt.title('Receiver Operating Characteristic (ROC)')
        plt.legend(loc="lower right")
        plt.show()
        
        # Plot precision-recall curve
        precision, recall, _ = precision_recall_curve(y_test, y_test_prob)
        plt.figure(figsize=(8, 6))
        plt.plot(recall, precision, lw=2)
        plt.xlabel('Recall')
        plt.ylabel('Precision')
        plt.title('Precision-Recall Curve')
        plt.show()
    
    return {
        'model': model,
        'train_accuracy': train_accuracy,
        'test_accuracy': test_accuracy,
        'confusion_matrix': cm,
        'y_test_pred': y_test_pred,
        'y_test_prob': y_test_prob
    }

### 5.2 Logistic Regression

In [None]:
print("Logistic Regression Model:\n")
lr_model = LogisticRegression(random_state=42, max_iter=1000, class_weight='balanced')
lr_results = evaluate_model(lr_model, X_train_final, X_test_final, y_train, y_test)

### 5.3 Decision Tree

In [None]:
print("Decision Tree Model:\n")
dt_model = DecisionTreeClassifier(random_state=42, class_weight='balanced')
dt_results = evaluate_model(dt_model, X_train_final, X_test_final, y_train, y_test)

### 5.4 Random Forest

In [None]:
print("Random Forest Model:\n")
rf_model = RandomForestClassifier(random_state=42, n_estimators=100, class_weight='balanced')
rf_results = evaluate_model(rf_model, X_train_final, X_test_final, y_train, y_test)

### 5.5 Support Vector Machine

In [None]:
print("Support Vector Machine Model:\n")
svm_model = SVC(random_state=42, probability=True, class_weight='balanced')
svm_results = evaluate_model(svm_model, X_train_final, X_test_final, y_train, y_test)

### 5.6 XGBoost

In [None]:
print("XGBoost Model:\n")
# Calculate scale_pos_weight for imbalanced dataset
scale_pos_weight = len(y_train[y_train == False]) / len(y_train[y_train == True])
xgb_model = xgb.XGBClassifier(random_state=42, n_estimators=100, scale_pos_weight=scale_pos_weight)
xgb_results = evaluate_model(xgb_model, X_train_final, X_test_final, y_train, y_test)

## 6. Hyperparameter Tuning

### 6.1 Hyperparameter Tuning for Best Model

In [None]:
# Let's tune the hyperparameters for the XGBoost model
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 0.9, 1.0],
    'colsample_bytree': [0.8, 0.9, 1.0]
}

# Using RandomizedSearchCV to speed up the process
xgb_tuned = xgb.XGBClassifier(random_state=42, scale_pos_weight=scale_pos_weight)
random_search = RandomizedSearchCV(xgb_tuned, param_distributions=param_grid, n_iter=10, 
                                  scoring='f1', cv=3, random_state=42, n_jobs=-1)

random_search.fit(X_train_final, y_train)

print("Best parameters found:", random_search.best_params_)
print("Best score:", random_search.best_score_)

In [None]:
# Evaluate the model with best parameters
best_xgb_model = random_search.best_estimator_
print("XGBoost Model with Tuned Hyperparameters:\n")
xgb_tuned_results = evaluate_model(best_xgb_model, X_train_final, X_test_final, y_train, y_test)

## 7. Model Comparison & Selection

In [None]:
# Collect all results
models = [
    {'name': 'Logistic Regression', 'results': lr_results},
    {'name': 'Decision Tree', 'results': dt_results},
    {'name': 'Random Forest', 'results': rf_results},
    {'name': 'SVM', 'results': svm_results},
    {'name': 'XGBoost', 'results': xgb_results},
    {'name': 'XGBoost (Tuned)', 'results': xgb_tuned_results}
]

# Calculate precision, recall, and F1 score for each model
for model_info in models:
    results = model_info['results']
    y_pred = results['y_test_pred']
    
    # Calculate metrics
    from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
    
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    # Calculate ROC-AUC if probabilities are available
    if results['y_test_prob'] is not None:
        roc_auc = roc_auc_score(y_test, results['y_test_prob'])
    else:
        roc_auc = None
    
    # Store metrics in results dictionary
    results['precision'] = precision
    results['recall'] = recall
    results['f1_score'] = f1
    results['roc_auc'] = roc_auc

In [None]:
# Create a comparison table
comparison_data = {
    'Model': [],
    'Accuracy': [],
    'Precision': [],
    'Recall': [],
    'F1 Score': [],
    'ROC-AUC': []
}

for model_info in models:
    comparison_data['Model'].append(model_info['name'])
    comparison_data['Accuracy'].append(model_info['results']['test_accuracy'])
    comparison_data['Precision'].append(model_info['results']['precision'])
    comparison_data['Recall'].append(model_info['results']['recall'])
    comparison_data['F1 Score'].append(model_info['results']['f1_score'])
    comparison_data['ROC-AUC'].append(model_info['results']['roc_auc'])

comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.sort_values('F1 Score', ascending=False)

print("Model Comparison:")
comparison_df

In [None]:
# Visualize model comparison
plt.figure(figsize=(15, 10))

# Prepare data for plotting
models_for_plot = comparison_df['Model']
metrics = ['Accuracy', 'Precision', 'Recall', 'F1 Score', 'ROC-AUC']
metrics_data = comparison_df[metrics].values

# Plot each metric
x = np.arange(len(models_for_plot))
width = 0.15
multiplier = 0

fig, ax = plt.subplots(figsize=(15, 8))

for attribute, measurement in zip(metrics, metrics_data.T):
    offset = width * multiplier
    rects = ax.bar(x + offset, measurement, width, label=attribute)
    multiplier += 1

# Add labels and title
ax.set_ylabel('Score')
ax.set_title('Model Performance Comparison')
ax.set_xticks(x + width * 2)
ax.set_xticklabels(models_for_plot, rotation=45, ha='right')
ax.legend(loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=5)
ax.set_ylim(0, 1)

plt.tight_layout()
plt.show()

## 8. Save Best Model for Deployment

In [None]:
# Find the best model based on F1 score
best_model_name = comparison_df.iloc[0]['Model']
best_model_index = [model_info['name'] for model_info in models].index(best_model_name)
best_model = models[best_model_index]['results']['model']

print(f"The best model is: {best_model_name}")

# Save preprocessing objects
preprocessing_objects = {
    'ohe': ohe,
    'scaler': scaler,
    'categorical_cols': categorical_cols,
    'numerical_cols': numerical_cols,
    'boolean_cols': boolean_cols,
    'final_features': final_features
}

# Save model and preprocessing objects to disk
import pickle

# Save best model
with open('../models/best_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)
    
# Save preprocessing objects
with open('../models/preprocessing_objects.pkl', 'wb') as f:
    pickle.dump(preprocessing_objects, f)

print("Model and preprocessing objects saved successfully.")

## 9. Conclusion

In this notebook, we built a comprehensive fraud detection system for UPI transactions. We've:

1. Preprocessed the data by handling missing values, duplicates, and outliers
2. Performed exploratory data analysis to understand patterns in fraudulent transactions
3. Created new features to improve model performance
4. Selected the most important features using various methods
5. Implemented and evaluated multiple classification models
6. Tuned hyperparameters to improve performance
7. Selected the best model based on performance metrics

The best performing model can now be deployed to a web application using Streamlit, which is implemented separately.