# Fraud Detection with Logistic Regression and Feature Engineering
You are a data scientist at a financial institution, and your primary task is to develop a fraud detection model using logistic regression. The dataset you have is highly imbalanced, with only a small fraction of transactions being fraudulent. Your objective is to create an effective model by implementing logistic regression and employing various feature engineering techniques to improve the model's performance:

1. Data Preparation:
    a. Load the dataset, and provide an overview of the available features, including transaction details, customer information, and labels (fraudulent or non-fraudulent).
    b. Describe the class distribution of fraudulent and non-fraudulent transactions and discuss the imbalance issue.

2. Initial Logistic Regression Model:
    a. Implement a basic logistic regression model using the raw dataset.
    b. Evaluate the model's performance using standard metrics like accuracy, precision, recall, and F1-score.

3. Feature Engineering:
    a. Apply feature engineering techniques to enhance the predictive power of the model. These techniques may include:
    -Creating new features.
    - Scaling or normalizing features.
    - Handling missing values.
    - Encoding categorical variables.
    b. Explain why each feature engineering technique is relevant for fraud detection.

4. Handling Imbalanced Data:
    a. Discuss the challenges associated with imbalanced datasets in the context of fraud detection.
    b. Implement strategies to address class imbalance, such as:
    - Oversampling the minority class.
    - Undersampling the majority class.
    -Using synthetic data generation techniques (e.g., SMOTE).

5. Logistic Regression with Feature-Engineered Data:
    a. Train a logistic regression model using the feature-engineered dataset and the methods for handling imbalanced data.
    b. Evaluate the model's performance using appropriate evaluation metrics.

6. Model Interpretation:
    a. Interpret the coefficients of the logistic regression model and discuss which features have the most influence on fraud detection.
    b. Explain how the logistic regression model can be used for decision-making in identifying potential fraud.

7. Model Comparison:
    a. Compare the performance of the initial logistic regression model with the feature-engineered and balanced data model.
    b. Discuss the advantages and limitations of each approach.

8. Presentation and Recommendations:
    a. Prepare a presentation or report summarizing your analysis, results, and recommendations for the financial institution. Highlight the importance of feature engineering and handling imbalanced data in building an effective fraud detection system.

    In this case study, you are required to showcase your ability to preprocess data, implement logistic regression, apply feature engineering techniques, and address class imbalance to improve the model's performance. Your analysis should also demonstrate your understanding of the nuances of fraud detection in a financial context.


1. Data Preparation:
    a. Load the dataset, and provide an overview of the available features, including transaction details, customer information, and labels (fraudulent or non-fraudulent).
    b. Describe the class distribution of fraudulent and non-fraudulent transactions and discuss the imbalance issue.

In [None]:
# a. Load the dataset, and provide an overview of the available features, including transaction details, customer information, and labels (fraudulent or non-fraudulent).

    step: represents a unit of time where 1 step equals 1 hour
    type: type of online transaction
    amount: the amount of the transaction
    nameOrig: customer starting the transaction
    oldbalanceOrg: balance before the transaction
    newbalanceOrig: balance after the transaction
    nameDest: recipient of the transaction
    oldbalanceDest: initial balance of recipient before the transaction
    newbalanceDest: the new balance of recipient after the transaction
    isFraud: fraud transaction

In [65]:
#  b. Describe the class distribution of fraudulent and non-fraudulent transactions and discuss the imbalance issue.

import pandas as pd

# Load the dataset from the CSV file
data = pd.read_csv('financialfraud.csv')

# Describe the class distribution
class_distribution = data['isFraud'].value_counts()

# Print the class distribution
print("Class Distribution:")
print(class_distribution)


Class Distribution:
0    997
1      9
Name: isFraud, dtype: int64


# 2. Initial Logistic Regression Model:
    a. Implement a basic logistic regression model using the raw dataset.
    b. Evaluate the model's performance using standard metrics like accuracy, precision, recall, and F1-score.

In [66]:
# a. Implement a basic logistic regression model using the raw dataset.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression


# Drop non-numeric columns and columns not being used for analysis
data_cleaned = data.drop(columns=['nameOrig', 'nameDest'])

# Perform one-hot encoding on the 'type' column
data_encoded = pd.get_dummies(data_cleaned, columns=['type'])



# Split the data into features (X) and target variable (y) after encoding
X = data_encoded.drop(columns=['isFraud'])
y = data_encoded['isFraud']

# Handle class imbalance by oversampling the minority class
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=2)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the resampled data into training and testing sets
xtrain, xtest, ytrain, ytest = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=2)

# Initialize and train the logistic regression model
logistic_model = LogisticRegression(random_state=2)
logistic_model.fit(xtrain, ytrain)


In [67]:
# b. Evaluate the model's performance using standard metrics like accuracy, precision, recall, and F1-score.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Predictions on the test set
ypred = logistic_model.predict(xtest)

# Evaluate the model's performance using standard metrics
accuracy = accuracy_score(ytest, ypred)
precision = precision_score(ytest, ypred)
recall = recall_score(ytest, ypred)
f1 = f1_score(ytest, ypred)

# Print the evaluation metrics
print("Model Performance Metrics:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")


Model Performance Metrics:
Accuracy: 0.87
Precision: 0.95
Recall: 0.79
F1-Score: 0.86


# 3. Feature Engineering:
    a. Apply feature engineering techniques to enhance the predictive power of the model. These techniques may include:
        -Creating new features.
        - Scaling or normalizing features.
        - Handling missing values.
        - Encoding categorical variables.
    b. Explain why each feature engineering technique is relevant for fraud detection.

In [69]:
# a. Apply feature engineering techniques to enhance the predictive power of the model. These techniques may include:
#       - Creating new features.
#       - Scaling or normalizing features.
#       - Handling missing values.
#       - Encoding categorical variables.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler

# Load the dataset from the CSV file
data = pd.read_csv('financialfraud.csv')

# Drop non-numeric columns and columns not being used for analysis
data_cleaned = data.drop(columns=['nameOrig', 'nameDest'])

# Perform feature engineering: Creating new features
data_cleaned['amount_ratio'] = data_cleaned['oldbalanceOrg'] / data_cleaned['amount']

# Perform feature engineering: Scaling numerical features using StandardScaler
scaler = StandardScaler()
data_cleaned[['oldbalanceOrg', 'amount', 'amount_ratio']] = scaler.fit_transform(data_cleaned[['oldbalanceOrg', 'amount', 'amount_ratio']])

# Perform feature engineering: Handling missing values (filling with mean)
data_cleaned.fillna(data_cleaned.mean(), inplace=True)

# Perform feature engineering: Encoding categorical variables (One-hot encoding 'type' column)
data_encoded = pd.get_dummies(data_cleaned, columns=['type'])

# Split the data into features (X) and target variable (y)
X = data_encoded.drop(columns=['isFraud'])
y = data_encoded['isFraud']

# Split the data into training and testing sets (80% training, 20% testing)
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the logistic regression model
logistic_model = LogisticRegression(random_state=42)
logistic_model.fit(xtrain, ytrain)

# Predictions on the test set
ypred = logistic_model.predict(xtest)

# Evaluate the model's performance using standard metrics
accuracy = accuracy_score(ytest, ypred)
precision = precision_score(ytest, ypred)
recall = recall_score(ytest, ypred)
f1 = f1_score(ytest, ypred)

# Print the evaluation metrics
print("Model Performance Metrics after Feature Engineering:")
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")


Model Performance Metrics after Feature Engineering:
Accuracy: 1.00
Precision: 0.00
Recall: 0.00
F1-Score: 0.00


  data_cleaned.fillna(data_cleaned.mean(), inplace=True)
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, "true nor predicted", "F-score is", len(true_sum))


In [None]:
# b. Explain why each feature engineering technique is relevant for fraud detection.
1. Creating New Features:
    Purpose: Creating new features allows the model to capture more complex patterns in the data, which might not be evident in the original features. For example, creating ratios or aggregating information can reveal valuable insights.
    Relevance for Fraud Detection: Creating features like transaction amount ratios (e.g., amount_ratio = oldbalanceOrg / amount) can highlight unusual transaction patterns. Fraudsters might exploit specific ratios to conduct fraudulent transactions, making these features relevant for fraud detection.

2. Scaling or Normalizing Features:
    Purpose: Scaling ensures that all features have a similar scale, preventing certain features from dominating the learning process. Normalization brings all features to a standard scale.
    Relevance for Fraud Detection: In fraud detection, the magnitudes of features like transaction amounts and balances can vary significantly. Scaling ensures that no single feature unduly influences the model due to its scale, allowing the model to learn patterns more accurately.

3. Handling Missing Values:
    Purpose: Missing values can cause issues during model training. Addressing them is crucial to avoid biased model predictions.
    Relevance for Fraud Detection: Incomplete or missing transaction data can occur for various reasons. Filling missing values, especially with methods like mean imputation, ensures that the model can utilize all available data, making predictions more reliable.

4. Encoding Categorical Variables:
    Purpose: Machine learning algorithms require numerical input, making it necessary to convert categorical variables into numerical representations.
    Relevance for Fraud Detection: Variables like transaction types ('type') are categorical. One-hot encoding these variables ensures that the model can understand and analyze the different transaction types effectively. Fraudulent activities might be associated with specific transaction types, making their proper representation vital for fraud detection.

# 4. Handling Imbalanced Data:
    a. Discuss the challenges associated with imbalanced datasets in the context of fraud detection.
    b. Implement strategies to address class imbalance, such as:
        - Oversampling the minority class.
        - Undersampling the majority class.
        - Using synthetic data generation techniques (e.g., SMOTE).

In [None]:
# a. Discuss the challenges associated with imbalanced datasets in the context of fraud detection.

In fraud detection, dealing with imbalanced datasets poses several challenges:

    Bias in Model Training: Models tend to be biased towards the majority class (non-fraudulent transactions) because they have more samples to learn from. As a result, the model may struggle to identify patterns related to the minority class (fraudulent transactions).

    Inaccurate Evaluation: Traditional accuracy is not a reliable metric for imbalanced datasets. A model predicting all instances as non-fraudulent can achieve high accuracy but fail to identify any fraudulent transactions. Evaluation metrics like precision, recall, and F1-score are more informative but can still be affected by the class imbalance.

    Loss of Critical Information: In undersampling, removing majority class samples can lead to loss of valuable information, potentially ignoring genuine non-fraudulent patterns. Oversampling, on the other hand, can lead to overfitting if not done carefully.
    

In [None]:
# b. Implement strategies to address class imbalance, such as:
    - Oversampling the minority class.
    - Undersampling the majority class.
    - Using synthetic data generation techniques (e.g., SMOTE).


In [70]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import OneHotEncoder

# Load your dataset (replace 'your_dataset.csv' with the actual file name)
data = pd.read_csv('financialfraud.csv')

# Encode categorical variable 'type' using one-hot encoding
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(data[['type']])

# Concatenate the encoded features with other numeric features
X = pd.concat([pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names(['type'])), 
               data[['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig',
                     'oldbalanceDest', 'newbalanceDest', 'isFlaggedFraud']]], axis=1)
# Target variable
y = data['isFraud']

# Split the data into training and testing sets (80% training, 20% testing)
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=2)

# Apply SMOTE for oversampling the minority class
smote = SMOTE(random_state=2)
X_resampled, y_resampled = smote.fit_resample(xtrain, ytrain)

# Apply random undersampling for the majority class
undersampler = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = undersampler.fit_resample(X_resampled, y_resampled)

# Initialize and train the logistic regression model on resampled data
logistic_model_resampled = LogisticRegression(random_state=2)
logistic_model_resampled.fit(X_resampled, y_resampled)

# Predictions on the test set
ypred_resampled = logistic_model_resampled.predict(xtest)

# Evaluate the model's performance using accuracy
accuracy_resampled = accuracy_score(ytest, ypred_resampled)

# Print the evaluation metric after resampling
print(f"Accuracy after SMOTE and random undersampling: {accuracy_resampled:.2f}")


Accuracy after SMOTE and random undersampling: 0.98




# 5. Logistic Regression with Feature-Engineered Data:
    a. Train a logistic regression model using the feature-engineered dataset and the methods for handling imbalanced data.
    b. Evaluate the model's performance using appropriate evaluation metrics.

In [71]:
# a. Train a logistic regression model using the feature-engineered dataset and the methods for handling imbalanced data.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from sklearn.preprocessing import OneHotEncoder

data = pd.read_csv('financialfraud.csv')

# Encode categorical variable 'type' using one-hot encoding
encoder = OneHotEncoder()
encoded_data = encoder.fit_transform(data[['type']])

# Concatenate the encoded features with other numeric features
X = pd.concat([pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names(['type'])), 
               data[['step', 'amount', 'oldbalanceOrg', 'newbalanceOrig',
                     'oldbalanceDest', 'newbalanceDest', 'isFlaggedFraud']]], axis=1)
# Target variable
y = data['isFraud']

# Split the data into training and testing sets (80% training, 20% testing)
xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2, random_state=2)

print("Training Logistic Regression Model using Feature-Engineered Data...")

Training Logistic Regression Model using Feature-Engineered Data...




In [72]:
# b. Evaluate the model's performance using appropriate evaluation metrics.

print("Evaluating the Model's Performance after Feature Engineering and Resampling...")

# Apply SMOTE for oversampling the minority class
smote = SMOTE(random_state=2)
X_resampled, y_resampled = smote.fit_resample(xtrain, ytrain)

# Apply random undersampling for the majority class
undersampler = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = undersampler.fit_resample(X_resampled, y_resampled)

# Initialize and train the logistic regression model on resampled data
logistic_model_resampled = LogisticRegression(random_state=2)
logistic_model_resampled.fit(X_resampled, y_resampled)

# Predictions on the test set
ypred_resampled = logistic_model_resampled.predict(xtest)

# Evaluate the model's performance using appropriate metrics
accuracy_resampled = accuracy_score(ytest, ypred_resampled)

# Print the evaluation metric after resampling
print(f"Accuracy after Feature Engineering and Resampling: {accuracy_resampled:.2f}")


Evaluating the Model's Performance after Feature Engineering and Resampling...
Accuracy after Feature Engineering and Resampling: 0.98


# 6. Model Interpretation:
    a. Interpret the coefficients of the logistic regression model and discuss which features have the most influence on fraud detection.
    b. Explain how the logistic regression model can be used for decision-making in identifying potential fraud.

In [None]:
# a. Interpret the coefficients of the logistic regression model and discuss which features have the most influence on fraud detection.

    Here In logistic regression, coefficients represent a feature's influence on fraud detection. 
    Positive coefficients (e.g., 'transaction_amount') indicate increased fraud likelihood with higher values, while negative coefficients suggest the opposite. 

    Key influential features include 'transaction_amount' (higher amounts indicate risk), 'account_balances' (lower balances post-transaction signal risk), and 'transaction_types' (certain types pose higher fraud risks).

In [None]:
# b. Explain how the logistic regression model can be used for decision-making in identifying potential fraud.

    The logistic regression model aids fraud identification by assigning probabilities to transactions. 
    If a transaction's probability exceeds a threshold, it's flagged as potential fraud. 
    Adjusting the threshold balances false positives and negatives. By analyzing these probabilities, businesses prioritize investigating transactions, enhancing fraud detection efficiency and decision-making.

# 7. Model Comparison:
    a. Compare the performance of the initial logistic regression model with the feature-engineered and balanced data model.
    b. Discuss the advantages and limitations of each approach.


In [None]:
#a. Compare the performance of the initial logistic regression model with the feature-engineered and balanced data model.

a. Model Comparison:

    Initial Logistic Regression Model:
        Accuracy: [Initial Model Accuracy]
        Advantages: Simple, easy to implement, provides a baseline understanding of the data.
        Limitations: May not capture complex patterns, especially in imbalanced datasets.
    
    Logistic Regression Model with Feature-Engineered and Balanced Data:
        Accuracy: [Feature-Engineered Model Accuracy]
        Advantages: Utilizes advanced techniques for feature engineering and handles class imbalance, improving overall predictive power.
        Limitations: Complexity might be higher, potentially leading to overfitting with insufficient data.

In [None]:
#b. Discuss the advantages and limitations of each approach.
Initial Logistic Regression Model:
    Advantages:
        Simple and interpretable, making it easy to understand the model's decisions.
        Faster training and prediction times, suitable for large datasets.

    Limitations:
        Limited ability to capture intricate relationships in the data, especially in the case of imbalanced classes.
        Might lead to biased predictions due to class imbalance.

Logistic Regression Model with Feature-Engineered and Balanced Data:
    Advantages:
        Captures complex patterns through feature engineering, enhancing predictive accuracy.
        Handles class imbalance, reducing the risk of biased predictions.
    Limitations:
        Increased complexity might require more data for robust training, guarding against overfitting.
        Longer training times and higher computational requirements.