# Logistic Regression for Credit Card Fraud Detection

### Loading the data
Load the data from `fraud_data.csv`.

In [None]:
import numpy as np
import pandas as pd

data = pd.read_csv('fraud_data.csv')

## Print the percentage of fraud observations

X = data.iloc[:,:-1]
y = data.iloc[:,-1]

# Use X_train, X_test, y_train, y_test for all of the following questions
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

# Calculate and print the percentage of fraud observations in the dataset
percentage_fraud = (y.value_counts()[1] / y.count()) * 100
print(f"Percentage of fraud observations: {percentage_fraud:.2f}%")

Percentage of fraud observations: 1.64%


Only 1.64% of observations in the dataset are flagged as fraud. This imbalance indicates that there are far fewer fraud cases compared to legitimate ones. It's a typical scenario in real life where fraud is rare compared to regular transactions.

### Predictions using the majority class label

Using `X_train`, `X_test`, `y_train`, and `y_test` (as defined above), train a dummy classifier that classifies everything as the majority class of the training data.

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

## Instantiate and fit a dummy classifier that always predict class label by the majority class of the training data
## Use DummyClassifier in sklearn with strategy 'most_frequent'
## Next, use dummpy.fit function to fit the data
dummy = DummyClassifier(strategy='most_frequent')

# Use dummy.fit function to fit the data
dummy.fit(X_train, y_train)

dummy_test_pred = dummy.predict(X_test)

# Measure test accuracy of the dummy classifier
dummy_test_acc = accuracy_score(y_test, dummy_test_pred)

print('Dummy classifier accuray:', dummy_test_acc)

Dummy classifier accuray: 0.9852507374631269


The dummy classifier's accuracy is around 98.52%, which seems impressive. But it's misleading because the dummy classifier just guesses the most common class—in this case, non-fraudulent transactions. Since most transactions aren't fraud, it's right most of the time. But it's not actually detecting fraud; it's just mirroring the dataset's class distribution.

The dummy classifier's recall score is 0.0, meaning it doesn't correctly classify any fraudulent transactions. This isn't surprising because the dummy classifier always picks the majority class, which here is non-fraudulent transactions. Recall measures a model's ability to find all relevant cases, so a score of 0.0 means the dummy classifier completely misses all fraudulent transactions, which are crucial to detect in this scenario.

In [None]:
from sklearn.metrics import recall_score

# Measure test recall score of your dummy classifier
# Since it's a binary classification, we can use the default or set average='binary'
dummy_test_recall = recall_score(y_test, dummy_test_pred, average='binary')

print('Dummy classifier recall:', dummy_test_recall)


Dummy classifier recall: 0.0


The recall of the dummy classifier is extremely low, zero, because it doesn't catch any of the positive cases, meaning it misses all fraudulent transactions. In fraud detection, recall is crucial—it shows how well the classifier identifies fraud. With a recall of zero, even if the classifier seems accurate due to imbalanced classes, it's ineffective at spotting fraud.

### Training a logistic regression model

Train a logisitic regression classifier with default parameters using X_train and y_train.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, recall_score

# Instantiate a logistic regression model and fit it to the training data
logR = LogisticRegression(max_iter=1000) # Increase max_iter if needed for convergence
logR.fit(X_train, y_train)

# Make predictions on the test set using the logistic regression model
logR_test_pred = logR.predict(X_test)

# Measure test accuracy of the logistic regression model
logR_test_acc = accuracy_score(y_test, logR_test_pred)
print('Logistic classifier accuracy:', logR_test_acc)

# Measure test recall of the logistic regression model
logR_test_recall = recall_score(y_test, logR_test_pred, average='binary')
print('Logistic classifier recall:', logR_test_recall)


Logistic classifier accuracy: 0.9966814159292036
Logistic classifier recall: 0.8


The logistic regression model performs far better than the dummy classifier in terms of recall. It achieves a recall of 0.8, meaning it accurately detects 80% of fraudulent transactions. In contrast, the dummy classifier has a recall of 0.0, meaning it doesn't catch any fraudulent transactions.

Although both classifiers boast high accuracy scores (approximately 98.52% for the dummy and 99.67% for logistic regression), recall is the crucial metric for fraud detection. It's more important to catch fraud, even if it means wrongly flagging some legitimate transactions. The logistic regression model's high recall suggests it's much more effective for fraud detection, despite the dataset's imbalance. On the other hand, the dummy classifier's high accuracy is misleading—it mainly reflects the prevalence of the majority class, not its fraud detection ability.

### Grid search for selecting hyperparameters for Logistic Regression

In [None]:
from sklearn.model_selection import GridSearchCV

## Define the grid of logistic regression parameters
# Define the grid of logistic regression parameters
parameters = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2'],  # Norm used in the penalization
    'solver': ['liblinear']  # Solver that supports both l1 and l2 penalties
}
model = LogisticRegression(max_iter=1000)

## Perform grid search CV to find best model parameter setting
cmodel = GridSearchCV(model, parameters, cv=5)  # cv=5 for 5-fold cross-validation
cmodel.fit(X_train, y_train.ravel())

## Fit logistic regression with best parameters to the entire training data
best_params = cmodel.best_params_
model = LogisticRegression(**best_params, max_iter=1000)
model.fit(X_train, y_train.ravel())

logR_test_pred = model.predict(X_test)

# Measure test accuracy
logR_test_acc = accuracy_score(y_test, logR_test_pred)
print('Logistic classifier accuracy:', logR_test_acc)

# Measure test recall
logR_test_recall = recall_score(y_test, logR_test_pred, average='binary')
print('Logistic classifier recall:', logR_test_recall)

Logistic classifier accuracy: 0.9963126843657817
Logistic classifier recall: 0.775


The logistic regression model with default parameters performed slightly better than the model optimized via grid search. Default parameters achieved about 99.668% accuracy and a recall of 0.8, while the grid search model had about 99.631% accuracy and a recall of 0.775.

The higher recall with default parameters suggests it was better at catching fraud (true positives). Though differences are small, they indicate the default model was slightly more effective at classifying fraudulent transactions. However, depending on the cost of false negatives versus false positives, the trade-offs between recall and precision may guide final model selection.

This comparison highlights the importance of considering multiple metrics, especially in imbalanced classification tasks. While grid search helps with hyperparameter tuning, it doesn't always guarantee better performance than default settings, particularly if optimization metrics don't align perfectly with practical objectives.