## Import Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

## Load Dataset

In [3]:
# Load dataset
df = pd.read_csv('fraud_data.csv')
df.head(5)

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,-2.312227,1.951992,-1.609851,3.997906,-0.522188,-1.426545,-2.537387,1.391657,-2.770089,-2.772272,...,0.517232,-0.035049,-0.465211,0.320198,0.044519,0.17784,0.261145,-0.143276,0.0,1
1,-3.043541,-3.157307,1.088463,2.288644,1.359805,-1.064823,0.325574,-0.067794,-0.270953,-0.838587,...,0.661696,0.435477,1.375966,-0.293803,0.279798,-0.145362,-0.252773,0.035764,529.0,1
2,-2.30335,1.759247,-0.359745,2.330243,-0.821628,-0.075788,0.56232,-0.399147,-0.238253,-1.525412,...,-0.294166,-0.932391,0.172726,-0.08733,-0.156114,-0.542628,0.039566,-0.153029,239.93,1
3,-4.397974,1.358367,-2.592844,2.679787,-1.128131,-1.706536,-3.496197,-0.248778,-0.247768,-4.801637,...,0.573574,0.176968,-0.436207,-0.053502,0.252405,-0.657488,-0.827136,0.849573,59.0,1
4,1.234235,3.01974,-4.304597,4.732795,3.624201,-1.357746,1.713445,-0.496358,-1.282858,-2.447469,...,-0.379068,-0.704181,-0.656805,-1.632653,1.488901,0.566797,-0.010016,0.146793,1.0,1


## Define X and y

In [4]:
X = df.drop('Class', axis=1)  # Features
y = df['Class']  # Target

## Split the dataset

In [5]:
# Split dataset into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Decision Tree Model

In [6]:
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)
predictions = dt_classifier.predict(X_test)
print("Initial Decision Tree Model Accuracy:", accuracy_score(y_test, predictions))

Initial Decision Tree Model Accuracy: 0.9991748885221726


The initial decision tree model achieved an exceptionally high accuracy of 99.92% on the testing set, indicating that the model is highly effective at classifying fraudulent and non-fraudulent transactions in the given dataset. This level of accuracy suggests that the decision tree has successfully captured the underlying patterns and relationships between the features and the target variable. However, while the accuracy is impressive, it's important to remain cautious about potential overfitting to the training data, as decision trees are prone to learning too specific patterns that may not generalize well to unseen data.

## Training and Testing Scores to Check for Overfitting

In [7]:
train_accuracy = accuracy_score(y_train, dt_classifier.predict(X_train))
test_accuracy = accuracy_score(y_test, predictions)
print("Training Accuracy:", train_accuracy)
print("Testing Accuracy:", test_accuracy)

Training Accuracy: 1.0
Testing Accuracy: 0.9991748885221726


The model's perfect training accuracy (100%) and slightly lower testing accuracy (99.92%) suggest it has learned the training data exceptionally well, potentially to the point of overfitting. Despite this, the high testing accuracy indicates the model remains highly effective on unseen data, though the slight discrepancy underscores the need for caution against overfitting and the importance of evaluating model generalizability.

## Investigate Optimal Max_Depth

In [8]:
param_grid = {'max_depth': range(1, 20)}
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)
optimal_max_depth = grid_search.best_params_['max_depth']
print("Optimal Max Depth:", optimal_max_depth)

Optimal Max Depth: 5


The determination of an optimal max depth of 5 for the decision tree model suggests a balanced approach to model complexity and generalization. By limiting the depth to 5, the model is restrained from growing too complex, which can help in preventing overfitting to the training data. This depth ensures the model is complex enough to capture the underlying patterns in the data necessary for accurate fraud detection, while also simple enough to maintain good performance on unseen data.

## Second Decision Tree Model with Optimal Max_Depth

In [9]:
dt_optimized = DecisionTreeClassifier(max_depth=optimal_max_depth, random_state=42)
dt_optimized.fit(X_train, y_train)
optimized_predictions = dt_optimized.predict(X_test)
optimized_accuracy = accuracy_score(y_test, optimized_predictions)
print("Optimized Decision Tree Model Accuracy:", optimized_accuracy)

Optimized Decision Tree Model Accuracy: 0.9994733330992591


The optimized decision tree model, with a max depth set to 5, achieved an accuracy of 99.95% on the test data, indicating a slight improvement over the initial model without this optimization. This increase in accuracy suggests that the restriction on tree depth helped mitigate overfitting, allowing the model to better generalize to unseen data.

## Ensemble Methods

### Random Forest

In [10]:
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)
rf_predictions = rf_classifier.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_predictions))

Random Forest Accuracy: 0.9995611109160493


The Random Forest model, with an accuracy of 99.96% on the test data, showcases a further improvement in the model's ability to classify fraudulent transactions accurately. This improvement over the optimized decision tree model indicates the benefits of using ensemble methods, like Random Forest, which combines multiple decision trees to reduce overfitting and increase prediction accuracy.

### Gradient Boosting

In [11]:
gb_classifier = GradientBoostingClassifier(random_state=42)
gb_classifier.fit(X_train, y_train)
gb_predictions = gb_classifier.predict(X_test)
print("Gradient Boosting Accuracy:", accuracy_score(y_test, gb_predictions))

Gradient Boosting Accuracy: 0.9989817773252344


The Gradient Boosting model achieved an accuracy of 99.90% on the test data, which, while slightly lower than the Random Forest model, still represents a highly effective performance in accurately classifying fraudulent transactions. This slight decrease in accuracy compared to the Random Forest model might reflect differences in how these ensemble methods manage the trade-off between bias and variance. Gradient Boosting iteratively corrects errors of the previous trees, which can sometimes lead to overemphasis on hard-to-predict instances.

## Best-Performing Model

In [12]:
best_model = rf_classifier
classification_rep = classification_report(y_test, rf_predictions)
print("Classification Report for Best Performing Model:\n", classification_rep)

Classification Report for Best Performing Model:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     56864
           1       0.96      0.78      0.86        98

    accuracy                           1.00     56962
   macro avg       0.98      0.89      0.93     56962
weighted avg       1.00      1.00      1.00     56962



The classification report for the best-performing model, the Random Forest given its highest accuracy, shows exceptional performance in identifying non-fraudulent transactions (class 0) with perfect precision, recall, and an F1-score of 1.00. For fraudulent transactions (class 1), the model also performs well, with a high precision of 0.96, indicating that when it predicts a transaction as fraud, it is correct 96% of the time. The recall of 0.78 for fraud detection, however, suggests that the model identifies 78% of actual fraudulent transactions, missing 22%. The F1-score of 0.86 for fraud detection balances precision and recall, reflecting a strong overall performance. The macro and weighted averages indicate high overall accuracy and a good balance between the classes, highlighting the model's effectiveness in fraud detection while maintaining a focus on minimizing false positives.

## Feature Importance for Best-Performing Model

In [13]:
feature_importances = pd.Series(best_model.feature_importances_, index=X.columns)
top_10_features = feature_importances.nlargest(10)
print("Top 10 Features with Best Predictive Power:\n", top_10_features)

Top 10 Features with Best Predictive Power:
 V17    0.133363
V12    0.130387
V14    0.104119
V16    0.089681
V10    0.086219
V11    0.076871
V9     0.040900
V7     0.035715
V4     0.033716
V18    0.027955
dtype: float64


The list of top 10 features with the best predictive power in the model highlights the most influential variables in detecting fraudulent transactions. Features V17, V12, and V14 emerge as the most significant, indicating their strong association with the likelihood of fraud. These features, along with V16, V10, and V11, which also show considerable predictive power, could be areas of particular interest for further analysis or for developing targeted fraud prevention strategies. Lower down the list, V9, V7, V4, and V18, though less influential than the top variables, still contribute significantly to the model's decision-making process. This distribution of feature importance underscores the complexity of fraud detection, where multiple dimensions of transaction data are analyzed to identify patterns indicative of fraudulent activity.