# Logistic Regression of Selected Features from IEEE Data

### Load the Dataset

In [5]:
import statsmodels.api as sm
import pandas as pd

# load the train_selected_features.csv
folder_path = "./data/"
train_selected_features = pd.read_csv(f"{folder_path}train_selected_features.csv")

### Turn categorical variables into dummies

In [6]:
data = train_selected_features
# Convert categorical variables into dummy variables
categorical_cols = [
    "Card_Bank_Issuer",
    "Card_Bank_Type",
    "billing_region",
    "card_product_type",
]

data = pd.get_dummies(data, columns=categorical_cols, drop_first=True)

In [7]:
### Define Predictor Variables and Independent or Outcome variable
# Define predictors (independent variables) and the target (dependent variable)

X = data[
    [
        "Amount",
        "Count_number_transactions_last24",
        "avg_daily_tx_card_lastmonth",
        "Count_names_associated_card",
        "hour",
        "Count_Number_Devices_associated",
        "dayofweek",
        "Count_number_transactions_lasthour",
        "Count_other_cards_associated",
        "Count_products_associated_card",
    ]
]

y = data["isFraud"]

# Viewing/Interpreting Results 

In [8]:
# Add a constant (intercept) to the predictors
X = sm.add_constant(X)

# Fit the logistic regression model
model = sm.Logit(y, X).fit()

# View the model summary
print(model.summary())

Optimization terminated successfully.
         Current function value: 0.144656
         Iterations 10
                           Logit Regression Results                           
Dep. Variable:                isFraud   No. Observations:               590540
Model:                          Logit   Df Residuals:                   590529
Method:                           MLE   Df Model:                           10
Date:                Wed, 04 Dec 2024   Pseudo R-squ.:                 0.04632
Time:                        21:44:18   Log-Likelihood:                -85425.
converged:                       True   LL-Null:                       -89574.
Covariance Type:            nonrobust   LLR p-value:                     0.000
                                         coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------
const                                 -3.1106      0.018   

## Interpretations 

Based on the results of this model we found that :

> ### Key Insights for Each Predictor


The logistic regression model provides insights into the factors influencing credit card fraud. The baseline log-odds of fraud, represented by the intercept, confirm that fraudulent transactions are rare events. Significant predictors include transaction amount, with higher amounts slightly increasing the likelihood of fraud, and average daily transactions in the last month, which reduce the likelihood, suggesting frequent activity signals legitimate usage. Similarly, high transaction activity in the last hour is associated with reduced fraud risk. In contrast, factors like the number of names or products associated with a card and the number of other linked cards increase the likelihood of fraud, potentially indicating unusual account behaviors. Timing also plays a role, as transactions later in the day show a marginal decrease in fraud likelihood. However, other predictors such as the number of transactions in the last 24 hours, the number of devices associated with the card, and the day of the week are not significant contributors to fraud detection. Overall, the model identifies patterns that help distinguish legitimate behavior from potentially fraudulent activities, offering actionable insights for targeted monitoring and intervention.