# Fraud Detection with Logistic Regression and Feature Engineering

# 1. Data Preparation:

a. Load the dataset, and provide an overview of the available features, including transaction details, customer information, and labels (fraudulent or non-fraudulent).
b. Describe the class distribution of fraudulent and non-fraudulent transactions and discuss the imbalance issue.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer


In [2]:
import pandas as pd

# Load the dataset
data = pd.read_csv('Fraud_Detection.csv')
data.head()

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,Label,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount
0,T1,C5841053,10/1/1994,F,JAMSHEDPUR,17819.05,2/8/2016,143207,25.0
1,T2,C2142763,4/4/1957,M,JHAJJAR,2270.69,2/8/2016,141858,27999.0
2,T3,C4417068,26/11/96,F,MUMBAI,17874.44,2/8/2016,142712,459.0
3,T4,C5342380,14/9/73,F,MUMBAI,866503.21,2/8/2016,142714,2060.0
4,T5,C9031234,24/3/88,F,NAVI MUMBAI,6714.43,2/8/2016,181156,1762.5


In [3]:
data.isna().sum()

TransactionID            0
CustomerID               0
CustomerDOB           3397
Label                 1100
CustLocation           151
CustAccountBalance    2369
TransactionDate          0
TransactionTime          0
TransactionAmount        0
dtype: int64

In [4]:
df=data.fillna(method="bfill")
df.isna().sum()

TransactionID         0
CustomerID            0
CustomerDOB           0
Label                 0
CustLocation          0
CustAccountBalance    0
TransactionDate       0
TransactionTime       0
TransactionAmount     0
dtype: int64

In [5]:
#encoding Label

from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
data['Label']=le.fit_transform(data['Label'])
data.head()

Unnamed: 0,TransactionID,CustomerID,CustomerDOB,Label,CustLocation,CustAccountBalance,TransactionDate,TransactionTime,TransactionAmount
0,T1,C5841053,10/1/1994,0,JAMSHEDPUR,17819.05,2/8/2016,143207,25.0
1,T2,C2142763,4/4/1957,1,JHAJJAR,2270.69,2/8/2016,141858,27999.0
2,T3,C4417068,26/11/96,0,MUMBAI,17874.44,2/8/2016,142712,459.0
3,T4,C5342380,14/9/73,0,MUMBAI,866503.21,2/8/2016,142714,2060.0
4,T5,C9031234,24/3/88,0,NAVI MUMBAI,6714.43,2/8/2016,181156,1762.5


In [6]:
fraudulent_count = data['Label'].sum()
non_fraudulent_count = len(data) - fraudulent_count

# Calculate the proportion of fraudulent transactions
fraudulent_proportion = fraudulent_count / len(data)
non_fraudulent_proportion = non_fraudulent_count / len(data)

print("Class Distribution:")
print(f"Fraudulent Transactions: {fraudulent_count} ({fraudulent_proportion * 100:.2f}%)")
print(f"Non-Fraudulent Transactions: {non_fraudulent_count} ({non_fraudulent_proportion * 100:.2f}%)")

Class Distribution:
Fraudulent Transactions: 768832 (73.32%)
Non-Fraudulent Transactions: 279735 (26.68%)


# 2. Initial Logistic Regression Model:

a. Implement a basic logistic regression model using the raw dataset.
b. Evaluate the model's performance using standard metrics like accuracy, precision, recall, and F1-score

In [7]:
X= data[['TransactionTime','TransactionAmount']]  # Independent variables
y= data['Label']  # Dependent variable

xtrain,xtest,ytrain,ytest=train_test_split(X,y,test_size=0.25,random_state=2)
print('Training data -X - shape:\t',xtrain.shape)
print()
print('Training data -Y - shape:\t',ytrain.shape)
print()
print('Testing data shape\n')
print('testing data(x-input) shape :\t',xtest.shape)
print()
print('testing data(Y-input) shape :\t',ytest.shape)

Training data -X - shape:	 (786425, 2)

Training data -Y - shape:	 (786425,)

Testing data shape

testing data(x-input) shape :	 (262142, 2)

testing data(Y-input) shape :	 (262142,)


In [8]:
#invoke the log reg algo from sklearn
log_reg=LogisticRegression()

#Train the model
print('Training the model\n')
log_reg.fit(xtrain,ytrain)

#Test the model

ypred= log_reg.predict(xtest)
print('Predicted Lable for the input samples:\n',ypred)
print()
print('Testing is completed\n')
print('Testing Samples are : \t',len(ypred))

Training the model

Predicted Lable for the input samples:
 [1 1 1 ... 1 1 1]

Testing is completed

Testing Samples are : 	 262142


In [9]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report



print('****Performance measures for Logstic regression****')
print()
print('='*80)
print('Accuracy Score:\t\n',accuracy_score(ytest,ypred))
print()
print('='*80)
print('Confussinon matrix:\n',confusion_matrix(ytest,ypred))
print()
print('='*80)
print('Classification report: \n',classification_report(ytest,ypred))
print('='*80)

****Performance measures for Logstic regression****

Accuracy Score:	
 0.7299898528278567

Confussinon matrix:
 [[     0  70516      0]
 [     0 191361      0]
 [     0    265      0]]

Classification report: 
               precision    recall  f1-score   support

           0       0.00      0.00      0.00     70516
           1       0.73      1.00      0.84    191361
           3       0.00      0.00      0.00       265

    accuracy                           0.73    262142
   macro avg       0.24      0.33      0.28    262142
weighted avg       0.53      0.73      0.62    262142



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# 3. Feature Engineering:

a. Apply feature engineering techniques to enhance the predictive power of the model. These techniques may include:
-Creating new features.
-Scaling or normalizing features.
-Handling missing values.
-Encoding categorical variables.
b. Explain why each feature engineering technique is relevant for fraud detection.


In [10]:
#Scaling
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
sc_x=sc.fit_transform(X)
sc_x.shape

#Handling missing values
df=data.fillna(method="bfill")
df.isna().sum()

TransactionID         0
CustomerID            0
CustomerDOB           0
Label                 0
CustLocation          0
CustAccountBalance    0
TransactionDate       0
TransactionTime       0
TransactionAmount     0
dtype: int64

# 4. Handling Imbalanced Data:
    
a. Discuss the challenges associated with imbalanced datasets in the context of fraud detection.
b. Implement strategies to address class imbalance, such as:
- Oversampling the minority class.
-Undersampling the majority class.
-Using synthetic data generation techniques (e.g., SMOTE).

# 5. Logistic Regression with Feature-Engineered Data:

a. Train a logistic regression model using the feature-engineered dataset and the methods for handling imbalanced data. I
b. Evaluate the model's performance using appropriate evaluation metrics.


In [11]:
# Train a logistic regression model 
lr_model_fe = LogisticRegression()
lr_model_fe.fit(xtrain, ytrain)

# Make predictions
y = lr_model_fe.predict(xtest)


# 6. Model Interpretation:

a. Interpret the coefficients of the logistic regression model and discuss which features have the most influence on fraud detection.
b. Explain how the logistic regression model can be used for decision-making in identifying potential fraud.

In [12]:
# Get coefficients and feature names
coef = lr_model_fe.coef_
feature_names = xtrain.columns

# Interpret the coefficients
feature_importance = pd.Series(coef[0], index=feature_names)
feature_importance = feature_importance.abs().sort_values(ascending=False)
print(feature_importance)


TransactionAmount    0.001361
TransactionTime      0.000326
dtype: float64


# 7. Model Comparison:

a. Compare the performance of the initial logistic regression model with the feature-engineered and balanced data model.
b. Discuss the advantages and limitations of each approach.