# Model Building
In this stage, you will build several machine learning models on the cleaned data set and attempt to train a model that performs better than baseline. Depending on your data set, this may mean different things.
## Imports

In [54]:
import os
import sys
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint, uniform as sp_uniform
import numpy as np
from lightgbm import LGBMClassifier
import joblib
from sklearn.ensemble import VotingClassifier

In [62]:
src_path = os.path.abspath('../src/')
sys.path.append(src_path)

from data_modeling import score_classification, downsample, scaled_model_search

## Functions
For your convenience, we have included a few pre-written functions, which you might find useful in your model building. They are by no means necessary, but feel free to use any or all of them.

### score_classification
score_classification takes the predicted results from a model and scores them on every classification metric ever. It also gives the confusion matrix.

Parameters:
- y_train: (1d array-like) The correct y values for the training data set
- y_train_pred: (1d array-like) The predicted y values from the training data set
- y_test: (1d array-like) The correct y values for the test data set
- y_test_pred: (1d array-like) The predicted y values from the test data set

This function uses [sklearn](https://scikit-learn.org/stable/modules/classes.html).metrics to calculate each score. The required functions are imported inside the function.

### downsample
Takes a dataframe and the name (string) of its target column and [downsamples](https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data) the majority class to equal the minority class.

Parameters:
- df: a Pandas DataFrame containing the data to be downsampled
- target: string. The name of the target variable.

This function uses the Python libraries [Pandas](https://pandas.pydata.org/docs/reference/index.html) (pd), which has been imported above, and [resample](https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html) from the [sklearn](https://scikit-learn.org/stable/modules/classes.html) library, which is imported inside the function.

### scaled_model_search 
Takes a list of scalers and models, along with test-train split data, and runs a search over every possible combination of scaler and model. It prints out the best result. Currently the metric used is accuracy, but it would be simple enough to change depending on the situation.

Parameters:
- scalers: a list of initialized scaler functions (ex: scalers = [StandardScaler(), RobustScaler(), QuantileTransformer(random_state = 42)]
- models: a list of initialized model function (ex: models = [LogisticRegression(), ExtraTreesClassifier(random_state = 42), RandomForestClassifier(random_state = 42)]
- X_train: DataFrame containing the training data set without the target variable
- y_train: DataFrame containing the target variable for the training data.
- X_test: DataFrame containing the test data set without the target variable
- y_test: DataFrame containing the target variable for the test data.

This function uses the [sklearn](https://scikit-learn.org/stable/modules/classes.html) function [accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) as a metric to compare the models, and it has been imported inside the function. It also uses [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) from [sklearn](https://scikit-learn.org/stable/modules/classes.html), which has been imported inside the function.

## Data
Read in the clean data set from your data_preparation notebook. It should be ready for some preliminary model-building by now, but you should consider your variables and decide if you want to use all of them to train a model. You should have a clear reason for excluding any variables. Also consider time-series data (if applicable to your set). If you have data from multiple years, should you train and test on each year individually? Train on one year and test on another?

In [66]:
file_path = '../data/interim/combined_df.csv'
data = pd.read_csv(file_path)

In [9]:
data.head()

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,id_29,id_30,id_31,id_32,id_33,id_34,id_35,id_36,id_37,id_38
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,
2,2987002,0,86469,59.0,W,4663,490.0,150.0,visa,166.0,...,,,,,,,,,,
3,2987003,0,86499,50.0,W,18132,567.0,150.0,mastercard,117.0,...,,,,,,,,,,
4,2987004,0,86506,50.0,H,4497,514.0,150.0,mastercard,102.0,...,NotFound,Android 7.0,samsung browser 6.2,32.0,2220x1080,match_status:2,T,F,T,T


## Data Splitting
Once you have an idea of how you plan to use the data, split your data into train and test groups or, if you prefer a more complicated approach, multiple folds. 

In [67]:
# Ensure the target column is of type integer for stratification
data['isFraud'] = data['isFraud'].astype(int)

# First split: training (70%) and temp (30%)
train_data, temp_data = train_test_split(data, test_size=0.3, stratify=data['isFraud'], random_state=42)

# Second split: validation (15%) and test (15%) from temp (30%)
valid_data, test_data = train_test_split(temp_data, test_size=0.5, stratify=temp_data['isFraud'], random_state=42)

# Check the distribution of classes within each split
def check_class_distribution(data, dataset_name):
    class_distribution = data['isFraud'].value_counts(normalize=True) * 100
    print(f"Class distribution in {dataset_name}:\n{class_distribution}")

check_class_distribution(data, "Original Data")
check_class_distribution(train_data, "Training Data")
check_class_distribution(valid_data, "Validation Data")
check_class_distribution(test_data, "Test Data")

Class distribution in Original Data:
isFraud
0    96.500999
1     3.499001
Name: proportion, dtype: float64
Class distribution in Training Data:
isFraud
0    96.501023
1     3.498977
Name: proportion, dtype: float64
Class distribution in Validation Data:
isFraud
0    96.500378
1     3.499622
Name: proportion, dtype: float64
Class distribution in Test Data:
isFraud
0    96.501507
1     3.498493
Name: proportion, dtype: float64


## Dummy Model
Before anything else, let's build a baseline model. This will serve as a "sanity check" for everything that comes after. Choose a simplistic model and, without any preprocessing or tuning, train a model on the training set. How well does it perform on the test set?

In [36]:
X_train = train_data.drop('isFraud', axis=1)
y_train = train_data['isFraud']
X_valid = valid_data.drop('isFraud', axis=1)
y_valid = valid_data['isFraud']
X_test = test_data.drop('isFraud', axis=1)
y_test = test_data['isFraud']

In [37]:
dummy_clf = DummyClassifier(strategy='most_frequent')

dummy_clf.fit(X_train, y_train)

y_pred_valid = dummy_clf.predict(X_valid)

y_pred_test = dummy_clf.predict(X_test)

In [38]:
print("Validation Performance:")
print(classification_report(y_valid, y_pred_valid))
print("Validation Accuracy:", accuracy_score(y_valid, y_pred_valid))

print("Test Performance:")
print(classification_report(y_test, y_pred_test))
print("Test Accuracy:", accuracy_score(y_test, y_pred_test))

Validation Performance:
              precision    recall  f1-score   support

           0       0.97      1.00      0.98     85481
           1       0.00      0.00      0.00      3100

    accuracy                           0.97     88581
   macro avg       0.48      0.50      0.49     88581
weighted avg       0.93      0.97      0.95     88581

Validation Accuracy: 0.9650037818493807
Test Performance:
              precision    recall  f1-score   support

           0       0.97      1.00      0.98     85482
           1       0.00      0.00      0.00      3099

    accuracy                           0.97     88581
   macro avg       0.48      0.50      0.49     88581
weighted avg       0.93      0.97      0.95     88581

Test Accuracy: 0.96501507095201


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Baseline Model
Before anything else, let's build a baseline model. This will serve as a "sanity check" for everything that comes after. Choose a simplistic model and, without any preprocessing or tuning, train a model on the training set. How well does it perform on the test set?

In [39]:
# Select only numeric columns for the training data
train_data_numeric = train_data.select_dtypes(include=['number'])

# Select only numeric columns for the validation data
valid_data_numeric = valid_data.select_dtypes(include=['number'])

In [40]:
# Extract features and target from training data
X_train = train_data_numeric.drop('isFraud', axis=1)
y_train = train_data_numeric['isFraud']

# Extract features and target from validation data
X_valid = valid_data_numeric.drop('isFraud', axis=1)
y_valid = valid_data_numeric['isFraud']

In [15]:
# Print columns before imputation to confirm all are present
print("Columns before imputation:", X_train.columns.tolist())

Columns before imputation: ['TransactionID', 'TransactionDT', 'TransactionAmt', 'card1', 'card2', 'card3', 'card5', 'addr1', 'addr2', 'dist1', 'dist2', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11', 'C12', 'C13', 'C14', 'D1', 'D2', 'D3', 'D4', 'D5', 'D6', 'D7', 'D8', 'D9', 'D10', 'D11', 'D12', 'D13', 'D14', 'D15', 'V1', 'V2', 'V3', 'V4', 'v5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'V34', 'V35', 'V36', 'V37', 'V38', 'V39', 'V40', 'V41', 'V42', 'V43', 'V44', 'V45', 'V46', 'V47', 'V48', 'V49', 'V50', 'V51', 'V52', 'V53', 'V54', 'V55', 'V56', 'V57', 'V58', 'V59', 'V60', 'V61', 'V62', 'V63', 'V64', 'V65', 'V66', 'V67', 'V68', 'V69', 'V70', 'V71', 'V72', 'V73', 'V74', 'V75', 'V76', 'V77', 'V78', 'V79', 'V80', 'V81', 'V82', 'V83', 'V84', 'V85', 'V86', 'V87', 'V88', 'V89', 'V90', 'V91', 'V92', 'V93', 'V94', 'V95', 'V96', 

In [41]:
# Identify columns with all NaN values
all_nan_columns = X_train.columns[X_train.isna().all()]

# Drop columns that are entirely NaN
X_train = X_train.drop(columns=all_nan_columns)
X_valid = X_valid.drop(columns=all_nan_columns)

# Create an imputer object with a median filling strategy
imputer = SimpleImputer(strategy='median')

# Fit on the training data and transform it
X_train_imputed = imputer.fit_transform(X_train)
X_valid_imputed = imputer.transform(X_valid)

# Convert back to DataFrame using the columns that were actually imputed
X_train = pd.DataFrame(X_train_imputed, columns=X_train.columns)
X_valid = pd.DataFrame(X_valid_imputed, columns=X_valid.columns)

### Baseline Model 1: Random Forest

In [14]:
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(X_train, y_train)

y_pred_valid = rf_clf.predict(X_valid)

print(classification_report(y_valid, y_pred_valid))
print("Accuracy:", accuracy_score(y_valid, y_pred_valid))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99     85539
           1       0.85      0.31      0.45      3042

    accuracy                           0.97     88581
   macro avg       0.91      0.65      0.72     88581
weighted avg       0.97      0.97      0.97     88581

Accuracy: 0.9743850261342726


### Baseline Model 2: Extra Tree

In [18]:
et_clf = ExtraTreesClassifier(n_estimators=100, random_state=42)  # You can adjust parameters
et_clf.fit(X_train, y_train)

In [19]:
y_pred_valid_et = et_clf.predict(X_valid)

print(classification_report(y_valid, y_pred_valid_et))
print("Accuracy:", accuracy_score(y_valid, y_pred_valid_et))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99     85539
           1       0.85      0.29      0.43      3042

    accuracy                           0.97     88581
   macro avg       0.91      0.65      0.71     88581
weighted avg       0.97      0.97      0.97     88581

Accuracy: 0.9739560402343618


### Baseline Model 3: Light GBM

In [42]:
lgb_clf = lgb.LGBMClassifier(n_estimators=100, random_state=42)  

lgb_clf.fit(X_train, y_train)

y_pred_valid_lgb = lgb_clf.predict(X_valid)

print(classification_report(y_valid, y_pred_valid_lgb))
print("Accuracy:", accuracy_score(y_valid, y_pred_valid_lgb))

[LightGBM] [Info] Number of positive: 14464, number of negative: 398914
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.114255 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 38214
[LightGBM] [Info] Number of data points in the train set: 413378, number of used features: 400
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.034990 -> initscore=-3.317083
[LightGBM] [Info] Start training from score -3.317083
              precision    recall  f1-score   support

           0       0.98      1.00      0.99     85481
           1       0.88      0.42      0.57      3100

    accuracy                           0.98     88581
   macro avg       0.93      0.71      0.78     88581
weighted avg       0.98      0.98      0.97     88581

Accuracy: 0.9776475767941206


## Model Improvement
Now you can work on improving on the baseline. There's no linear approach to this process and the steps you take will depend on the data. Below are some steps that are commonly used in building robust models. You can use any, all, or only some of them, and you are encouraged to add your own steps for your specific data set.

As you go through this process, keep in mind all that you learned during the data understanding phase and consider the following questions:
- What sort of model should you train? (ie, classification, regression? Neural network?)
- Given the distribution of your data, the presence or absence of missing data, and various other factors, is there a particular model (or ensemble) that you think will work well? (ie, RandomForest, ExtraTrees, SVM...?)
- Depending on what sort of model you train and what your data look like, you may find different evaluation metrics useful. How can you certain that you have the most well-rounded view of how well your model is performing? What metric or metrics will best capture your model priorities (and what are your model priorities)?

#### **Observations and Ideas:**

**1. Conclusions from Distribution Data:** 

**Significant Variation in Distribution:** The proportion of each 'ProductCD' varies significantly between fraud and non-fraud cases. Product type 'W' makes up a large majority of non-fraud cases but a much smaller proportion of fraud cases. Conversely, 'C' constitutes a much larger fraction of fraud cases relative to its share in non=fraud cases. This indicated that transactions associated with 'ProductCD' 'C' have a higher relative risk of fraud.

**Relative Risk:** Product types H, R and S also show variations in their distributions between fraud and non-fraud cases, though not as pronounced as W and C. H and R are more common in fraud cases relative to their presence in non-fraud cases, suggesting a potentially higher risk of fraud.

**Vulnerability of Product Types:** C and to a lesser extent H and R seem more vulnerable to fraud, or at least, they are more frequently associated with fraud. W, despite being the most common product type in general transactions, has a lower proportional representation in fraud cases.

**2. Subject Matter Modeling/Analysis Based on ProductCD Types:**

Given the distinct differences in fraud incidence across different ProductCD types, it makes sense to consider separate modeling or analysis for each ProductCD. 

**Tailored Models:** Different product types might involve different mechanisms of fraud and might interact differently with other variables. Building separate models for each ProductCD could allow us to tailor the model to capture these unique dynamics;

**Feature Relevance:** The importance and relevance of certain features might vary by product type. For example, certain features might be strong predictors of fraud in ProductCD C but not in W. Segmenting by ProductCD allows for more precise feature selection and engineering;

**Interpretability and Focus:** By segmanting the data according to ProductCD, you can focus on the specific characteristics and trends of each segment. This focus can improve interpretability and actionable insights, which are crucial in operational settings like fraud detection;

**Improved Performance:** Different distributions of fraud prevalence suggest that combining all product types into a single model might dilute the predictive power of your features relevant to specific types. Separate models can potentially improve detection accuracy and reduce false positives/negatives within each product category.

### 1. Train on the Whole Dataset: 

Initially, train a model using the entire dataset without any segmentation based on ProductCD. Evaluate the performance of this single model to establish a baseline.

### 2. Train on Subsets: 

Next, divide the dataset into five subsets based on the unique values in ProductCD. Train separate models for each subset to capture the distinct characteristics of transactions within each product category. The following strategies will be applied:
   
   
   **A. Prediction Using Individual Models:** Use the individual models to predict fraud within their respective subsets and combine the      predictions to form the final output.
   
   **B. Combining Predictions:** Aggregate predictions from all subset models for a given data point to determine the final classification.
   
   **C. Model Ensemble:** Implement ensemble techniques such as stacking or voting to combine the outputs of the subset models into a cohesive prediction.

### 3. Comparison of Approaches: 

Compare the performance of the above approaches based on two key criteria:

   **A. Performance (Scores):** Evaluate the models using metrics such as accuracy, precision, recall, F1-score, and ROC AUC.
   
   **B. Efficiency (Time and Resources):** Assess the computational resources and time required for training and prediction to determine the most efficient approach.

By comparing these **four different approaches**, the goal is to identify the best and most efficient method for credit card fraud detection.

### **Creating 5 subsets based on ProductCD**

In [68]:
def convert_to_hour(transaction_dt):
    seconds_in_a_day = transaction_dt % 86400
    hour_of_the_day = seconds_in_a_day // 3600
    return hour_of_the_day

train_data['TransactionHour'] = train_data['TransactionDT'].apply(convert_to_hour)
valid_data['TransactionHour'] = valid_data['TransactionDT'].apply(convert_to_hour)
print(train_data[['TransactionDT', 'TransactionHour']].head())
print(valid_data[['TransactionDT', 'TransactionHour']].head())

        TransactionDT  TransactionHour
335947        8274219               18
334070        8214909                1
551523       14578069               17
159780        3359536               21
270154        6553025               20
        TransactionDT  TransactionHour
457522       11722566               16
380120        9504286                0
261473        6289526               19
9625           278904                5
465841       11994295               19


In [69]:
# Define a function to create subsets based on ProductCD for any given dataset
def create_product_subsets(df):
    product_subsets = {}
    product_codes = df['ProductCD'].unique()
    for code in product_codes:
        product_subsets[code] = df[df['ProductCD'] == code]
    return product_subsets

# Create subsets for train_data
train_product_subsets = create_product_subsets(train_data)

# Create subsets for valid_data
valid_product_subsets = create_product_subsets(valid_data)

# Now you can access each subset for train and validate sets as needed
W_train_subset = train_product_subsets['W']
C_train_subset = train_product_subsets['C']
H_train_subset = train_product_subsets['H']
R_train_subset = train_product_subsets['R']
S_train_subset = train_product_subsets['S']

W_valid_subset = valid_product_subsets['W']
C_valid_subset = valid_product_subsets['C']
H_valid_subset = valid_product_subsets['H']
R_valid_subset = valid_product_subsets['R']
S_valid_subset = valid_product_subsets['S']

### Check class distribution after separating the subsets

In [70]:
def check_class_distribution(df, subset_name):
    distribution = df['isFraud'].value_counts(normalize=True) * 100
    print(f"Class distribution in {subset_name}:")
    print(distribution)
    print()

# Check class distribution in training subsets
check_class_distribution(W_train_subset, 'W_train_subset')
check_class_distribution(C_train_subset, 'C_train_subset')
check_class_distribution(H_train_subset, 'H_train_subset')
check_class_distribution(R_train_subset, 'R_train_subset')
check_class_distribution(S_train_subset, 'S_train_subset')

# Check class distribution in validation subsets
check_class_distribution(W_valid_subset, 'W_valid_subset')
check_class_distribution(C_valid_subset, 'C_valid_subset')
check_class_distribution(H_valid_subset, 'H_valid_subset')
check_class_distribution(R_valid_subset, 'R_valid_subset')
check_class_distribution(S_valid_subset, 'S_valid_subset')

Class distribution in W_train_subset:
isFraud
0    97.959071
1     2.040929
Name: proportion, dtype: float64

Class distribution in C_train_subset:
isFraud
0    88.333057
1    11.666943
Name: proportion, dtype: float64

Class distribution in H_train_subset:
isFraud
0    95.332701
1     4.667299
Name: proportion, dtype: float64

Class distribution in R_train_subset:
isFraud
0    96.157919
1     3.842081
Name: proportion, dtype: float64

Class distribution in S_train_subset:
isFraud
0    94.24
1     5.76
Name: proportion, dtype: float64

Class distribution in W_valid_subset:
isFraud
0    97.984818
1     2.015182
Name: proportion, dtype: float64

Class distribution in C_valid_subset:
isFraud
0    88.131782
1    11.868218
Name: proportion, dtype: float64

Class distribution in H_valid_subset:
isFraud
0    95.222606
1     4.777394
Name: proportion, dtype: float64

Class distribution in R_valid_subset:
isFraud
0    96.37553
1     3.62447
Name: proportion, dtype: float64

Class distribution i

In [30]:
def process_subset(train_subset, valid_subset, subset_name):
    print(f"\nProcessing subset {subset_name}:")
    
    # 1. Select only numeric columns
    numeric_cols = train_subset.select_dtypes(include=['number']).columns
    train_subset_numeric = train_subset[numeric_cols]
    valid_subset_numeric = valid_subset[numeric_cols]
    
    # Helper function to print scores
    def print_scores(y_true, y_pred, y_prob):
        print("Precision:", precision_score(y_true, y_pred))
        print("Recall:", recall_score(y_true, y_pred))
        print("F1 Score:", f1_score(y_true, y_pred))
        print("ROC AUC Score:", roc_auc_score(y_true, y_prob))
        print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
        print("Classification Report:\n", classification_report(y_true, y_pred))
    
    # Helper function to train and evaluate the model
    def train_and_evaluate_model(X_train, y_train, X_valid, y_valid):
        model = LGBMClassifier(random_state=42)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_valid)
        y_prob = model.predict_proba(X_valid)[:, 1]
        print_scores(y_valid, y_pred, y_prob)
        return model

    # 2. Calculate the precision, recall, f1-score and roc-auc scores before any processing
    y_train = train_subset_numeric['isFraud']
    X_train = train_subset_numeric.drop(columns=['isFraud'])
    y_valid = valid_subset_numeric['isFraud']
    X_valid = valid_subset_numeric.drop(columns=['isFraud'])
    
    print("Scores before any processing:")
    model = train_and_evaluate_model(X_train, y_train, X_valid, y_valid)
    
    # 3. Drop columns with all NaN values
    train_subset_cleaned = train_subset_numeric.dropna(axis=1, how='all')
    valid_subset_cleaned = valid_subset_numeric[train_subset_cleaned.columns]
    
    # 4. Fill remaining NaN values with 1234567
    train_subset_cleaned = train_subset_cleaned.fillna(1234567)
    valid_subset_cleaned = valid_subset_cleaned.fillna(1234567)
    
    y_train = train_subset_cleaned['isFraud']
    X_train = train_subset_cleaned.drop(columns=['isFraud'])
    y_valid = valid_subset_cleaned['isFraud']
    X_valid = valid_subset_cleaned.drop(columns=['isFraud'])
    
    print("Scores after filling NaNs:")
    model = train_and_evaluate_model(X_train, y_train, X_valid, y_valid)
    
    # 6. Standard Scale both training and valid sets
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_valid_scaled = scaler.transform(X_valid)
    
    print("Scores after scaling:")
    model = train_and_evaluate_model(X_train_scaled, y_train, X_valid_scaled, y_valid)
    
    # 7. Run feature importances
    feature_importances = pd.DataFrame({
        'feature': X_train.columns,
        'importance': model.feature_importances_
    }).sort_values(by='importance', ascending=False).head(20)
    
    print("Top 20 Feature Importances:")
    print(feature_importances)
    
    # 8. Drop columns TransactionID and TransactionDT
    if 'TransactionID' in X_train.columns and 'TransactionDT' in X_train.columns:
        X_train = X_train.drop(columns=['TransactionID', 'TransactionDT'])
        X_valid = X_valid.drop(columns=['TransactionID', 'TransactionDT'])
    
    # Scale again after dropping columns
    X_train_scaled = scaler.fit_transform(X_train)
    X_valid_scaled = scaler.transform(X_valid)
    
    print("Scores after dropping TransactionID and TransactionDT:")
    model = train_and_evaluate_model(X_train_scaled, y_train, X_valid_scaled, y_valid)
    
    # 9. Run random search to find best parameters
    param_dist = {
    'n_estimators': sp_randint(100, 1000),
    'learning_rate': sp_uniform(0.01, 0.2),
    'num_leaves': sp_randint(20, 150),
    'boosting_type': ['gbdt', 'dart'],
    'max_depth': sp_randint(5, 50),
    'subsample': sp_uniform(0.5, 0.5),
    'colsample_bytree': sp_uniform(0.5, 0.5)
}
    rs_model = LGBMClassifier(random_state=42)
    rs = RandomizedSearchCV(rs_model, param_dist, n_iter=10, scoring='roc_auc', cv=3, random_state=42)
    rs.fit(X_train_scaled, y_train)
    best_model = rs.best_estimator_
    
    y_pred = best_model.predict(X_valid_scaled)
    y_prob = best_model.predict_proba(X_valid_scaled)[:, 1]
    
    print("Best Model Scores after Hyperparameter Tuning:")
    print_scores(y_valid, y_pred, y_prob)

# Apply the processing function to each subset
process_subset(W_train_subset, W_valid_subset, 'W')
process_subset(C_train_subset, C_valid_subset, 'C')
process_subset(H_train_subset, H_valid_subset, 'H')
process_subset(R_train_subset, R_valid_subset, 'R')
process_subset(S_train_subset, S_valid_subset, 'S')


Processing subset W:
Scores before any processing:
[LightGBM] [Info] Number of positive: 6274, number of negative: 301135
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.040699 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 14382
[LightGBM] [Info] Number of data points in the train set: 307409, number of used features: 194
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.020409 -> initscore=-3.871145
[LightGBM] [Info] Start training from score -3.871145
Precision: 0.8672086720867209
Recall: 0.24060150375939848
F1 Score: 0.3766921718658034
ROC AUC Score: 0.9027604015218271
Confusion Matrix:
 [[64620    49]
 [ 1010   320]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99     64669
           1       0.87      0.24      0.38      1330

    accuracy                

In [71]:
def process_subset(train_subset, valid_subset, subset_name):
    print(f"\nProcessing subset {subset_name}:")
    
    # 1. Select only numeric columns
    numeric_cols = train_subset.select_dtypes(include=['number']).columns
    train_subset_numeric = train_subset[numeric_cols]
    valid_subset_numeric = valid_subset[numeric_cols]
    
    # Helper function to print scores
    def print_scores(y_true, y_pred, y_prob):
        print("Precision:", precision_score(y_true, y_pred))
        print("Recall:", recall_score(y_true, y_pred))
        print("F1 Score:", f1_score(y_true, y_pred))
        print("ROC AUC Score:", roc_auc_score(y_true, y_prob))
        print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
        print("Classification Report:\n", classification_report(y_true, y_pred))
    
    # Helper function to train and evaluate the model
    def train_and_evaluate_model(X_train, y_train, X_valid, y_valid, verbose=-1):
        model = LGBMClassifier(random_state=42, verbose=verbose)
        model.fit(X_train, y_train)
        y_pred = model.predict(X_valid)
        y_prob = model.predict_proba(X_valid)[:, 1]
        print_scores(y_valid, y_pred, y_prob)
        return model

    # 2. Calculate the precision, recall, f1-score and roc-auc scores before any processing
    y_train = train_subset_numeric['isFraud']
    X_train = train_subset_numeric.drop(columns=['isFraud'])
    y_valid = valid_subset_numeric['isFraud']
    X_valid = valid_subset_numeric.drop(columns=['isFraud'])
    
    print("Scores before any processing:")
    model = train_and_evaluate_model(X_train, y_train, X_valid, y_valid)
    
    # 3. Drop columns with all NaN values
    train_subset_cleaned = train_subset_numeric.dropna(axis=1, how='all')
    valid_subset_cleaned = valid_subset_numeric[train_subset_cleaned.columns]
    
    # 4. Fill remaining NaN values with 1234567
    train_subset_cleaned = train_subset_cleaned.fillna(1234567)
    valid_subset_cleaned = valid_subset_cleaned.fillna(1234567)
    
    y_train = train_subset_cleaned['isFraud']
    X_train = train_subset_cleaned.drop(columns=['isFraud'])
    y_valid = valid_subset_cleaned['isFraud']
    X_valid = valid_subset_cleaned.drop(columns=['isFraud'])
    
    print("Scores after filling NaNs:")
    model = train_and_evaluate_model(X_train, y_train, X_valid, y_valid)
    
    # 6. Standard Scale both training and valid sets
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_valid_scaled = scaler.transform(X_valid)
    
    print("Scores after scaling:")
    model = train_and_evaluate_model(X_train_scaled, y_train, X_valid_scaled, y_valid)
    
    # 7. Run feature importances
    feature_importances = pd.DataFrame({
        'feature': X_train.columns,
        'importance': model.feature_importances_
    }).sort_values(by='importance', ascending=False).head(20)
    
    print("Top 20 Feature Importances:")
    print(feature_importances)
    
    # 8. Drop columns TransactionID and TransactionDT
    if 'TransactionID' in X_train.columns and 'TransactionDT' in X_train.columns:
        X_train = X_train.drop(columns=['TransactionID', 'TransactionDT'])
        X_valid = X_valid.drop(columns=['TransactionID', 'TransactionDT'])
    
    # Scale again after dropping columns
    X_train_scaled = scaler.fit_transform(X_train)
    X_valid_scaled = scaler.transform(X_valid)
    
    print("Scores after dropping TransactionID and TransactionDT:")
    model = train_and_evaluate_model(X_train_scaled, y_train, X_valid_scaled, y_valid)
    
    # 9. Run random search to find best parameters
    param_dist = {
        'n_estimators': sp_randint(100, 1000),
        'learning_rate': sp_uniform(0.01, 0.2),
        'num_leaves': sp_randint(20, 150),
        'boosting_type': ['gbdt', 'dart'],
        'max_depth': sp_randint(5, 50),
        'subsample': sp_uniform(0.5, 0.5),
        'colsample_bytree': sp_uniform(0.5, 0.5)
    }
    rs_model = LGBMClassifier(random_state=42, verbose=-1)
    rs = RandomizedSearchCV(rs_model, param_dist, n_iter=10, scoring='roc_auc', cv=3, random_state=42)
    rs.fit(X_train_scaled, y_train)
    best_model = rs.best_estimator_
    
    # Save the best model and parameters
    joblib.dump(best_model, f'best_model_{subset_name}.pkl')
    
    y_pred = best_model.predict(X_valid_scaled)
    y_prob = best_model.predict_proba(X_valid_scaled)[:, 1]
    
    print(f"Best Model Scores for {subset_name} after Hyperparameter Tuning:")
    print_scores(y_valid, y_pred, y_prob)

# Apply the processing function to each subset
process_subset(W_train_subset, W_valid_subset, 'W')
process_subset(C_train_subset, C_valid_subset, 'C')
process_subset(H_train_subset, H_valid_subset, 'H')
process_subset(R_train_subset, R_valid_subset, 'R')
process_subset(S_train_subset, S_valid_subset, 'S')


Processing subset W:
Scores before any processing:
Precision: 0.8672086720867209
Recall: 0.24060150375939848
F1 Score: 0.3766921718658034
ROC AUC Score: 0.9027604015218271
Confusion Matrix:
 [[64620    49]
 [ 1010   320]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99     64669
           1       0.87      0.24      0.38      1330

    accuracy                           0.98     65999
   macro avg       0.93      0.62      0.68     65999
weighted avg       0.98      0.98      0.98     65999

Scores after filling NaNs:
Precision: 0.877906976744186
Recall: 0.22706766917293233
F1 Score: 0.3608124253285544
ROC AUC Score: 0.8959613541577893
Confusion Matrix:
 [[64627    42]
 [ 1028   302]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      1.00      0.99     64669
           1       0.88      0.23      0.36      1330

    accuracy                           0.

In [61]:
# Function to create product subsets
def create_product_subsets(df):
    product_subsets = {}
    product_codes = df['ProductCD'].unique()
    for code in product_codes:
        product_subsets[code] = df[df['ProductCD'] == code]
    return product_subsets

# Function to add TransactionHour
def add_transaction_hour(df):
    df['TransactionHour'] = ((df['TransactionDT'] % 86400) // 3600).astype(int)
    return df

# Function to prepare test data
def prepare_test_data(test_subset):
    # Select only numeric columns
    numeric_cols = test_subset.select_dtypes(include=['number']).columns
    if 'isFraud' in numeric_cols:
        numeric_cols = numeric_cols.drop('isFraud')
    test_subset_numeric = test_subset[numeric_cols]

    # Drop columns with all NaN values
    test_subset_cleaned = test_subset_numeric.dropna(axis=1, how='all')

    # Fill remaining NaN values
    test_subset_filled = test_subset_cleaned.fillna(1234567)
    
    # Standard Scale the test data
    scaler = StandardScaler()
    test_subset_scaled = scaler.fit_transform(test_subset_filled)
    
    # Drop columns TransactionID and TransactionDT
    if 'TransactionID' in test_subset_cleaned.columns and 'TransactionDT' in test_subset_cleaned.columns:
        test_subset_cleaned = test_subset_cleaned.drop(columns=['TransactionID', 'TransactionDT'])
    
    # Scale again after dropping columns
    test_subset_scaled = scaler.fit_transform(test_subset_cleaned)
    
    return test_subset_scaled


# Add TransactionHour to the test data
test_data = add_transaction_hour(test_data)

# Create subsets based on product type
test_product_subsets = create_product_subsets(test_data)

# Load the best models for each subset
W_model = joblib.load('best_model_W.pkl')
C_model = joblib.load('best_model_C.pkl')
H_model = joblib.load('best_model_H.pkl')
R_model = joblib.load('best_model_R.pkl')
S_model = joblib.load('best_model_S.pkl')

# Prepare the test data for prediction
W_test_subset = prepare_test_data(test_product_subsets['W'])
C_test_subset = prepare_test_data(test_product_subsets['C'])
H_test_subset = prepare_test_data(test_product_subsets['H'])
R_test_subset = prepare_test_data(test_product_subsets['R'])
S_test_subset = prepare_test_data(test_product_subsets['S'])

# Make predictions using the best models for each subset
W_pred = W_model.predict_proba(W_test_subset)[:, 1]
C_pred = C_model.predict_proba(C_test_subset)[:, 1]
H_pred = H_model.predict_proba(H_test_subset)[:, 1]
R_pred = R_model.predict_proba(R_test_subset)[:, 1]
S_pred = S_model.predict_proba(S_test_subset)[:, 1]

# Aggregate predictions from all subsets
combined_preds = np.concatenate([W_pred, C_pred, H_pred, R_pred, S_pred])

# Calculate the final predicted probabilities
test_data['isFraud_pred'] = combined_preds

# Evaluate the combined model
y_test = test_data['isFraud']
y_pred = (test_data['isFraud_pred'] > 0.5).astype(int)

print("ROC AUC Score:", roc_auc_score(y_test, test_data['isFraud_pred']))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

ROC AUC Score: 0.49924355452884717
Confusion Matrix:
 [[81492  3990]
 [ 2949   150]]
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.95      0.96     85482
           1       0.04      0.05      0.04      3099

    accuracy                           0.92     88581
   macro avg       0.50      0.50      0.50     88581
weighted avg       0.93      0.92      0.93     88581



In [57]:
# Define a function to preprocess the test data
def preprocess_test_data(test_data):
    # Create the 'TransactionHour' column
    test_data['TransactionHour'] = (test_data['TransactionDT'] % 86400) // 3600
    
    # Create subsets based on 'ProductCD'
    product_subsets = create_product_subsets(test_data)
    
    # Preprocessing steps
    def prepare_test_data(test_subset):
        # Select only numeric columns
        numeric_cols = test_subset.select_dtypes(include=['number']).columns
        test_subset_numeric = test_subset[numeric_cols]
        
        # Drop columns with all NaN values
        test_subset_cleaned = test_subset_numeric.dropna(axis=1, how='all')
        
        # Drop target column if present
        if 'isFraud' in test_subset_cleaned.columns:
            test_subset_cleaned = test_subset_cleaned.drop(columns=['isFraud'])
        
        # Fill remaining NaN values
        test_subset_filled = test_subset_cleaned.fillna(1234567)
        
        # Standard scale
        scaler = StandardScaler()
        test_subset_scaled = scaler.fit_transform(test_subset_filled)
        
        # Drop 'TransactionID' and 'TransactionDT' columns
        if 'TransactionID' in test_subset_cleaned.columns:
            test_subset_cleaned = test_subset_cleaned.drop(columns=['TransactionID', 'TransactionDT'])
        
        # Scale again after dropping columns
        test_subset_final_scaled = scaler.fit_transform(test_subset_cleaned)
        
        return test_subset_final_scaled
    
    W_test_subset = prepare_test_data(product_subsets['W'])
    C_test_subset = prepare_test_data(product_subsets['C'])
    H_test_subset = prepare_test_data(product_subsets['H'])
    R_test_subset = prepare_test_data(product_subsets['R'])
    S_test_subset = prepare_test_data(product_subsets['S'])
    
    return W_test_subset, C_test_subset, H_test_subset, R_test_subset, S_test_subset

# Preprocess the test data
W_test_subset, C_test_subset, H_test_subset, R_test_subset, S_test_subset = preprocess_test_data(test_data)

# Load the best models for each subset
W_model = joblib.load('best_model_W.pkl')
C_model = joblib.load('best_model_C.pkl')
H_model = joblib.load('best_model_H.pkl')
R_model = joblib.load('best_model_R.pkl')
S_model = joblib.load('best_model_S.pkl')

# Create a voting classifier
voting_clf = VotingClassifier(
    estimators=[
        ('W', W_model),
        ('C', C_model),
        ('H', H_model),
        ('R', R_model),
        ('S', S_model)
    ],
    voting='soft'  # 'soft' voting uses predicted probabilities, 'hard' voting uses predicted class labels
)

# Fit the voting classifier on the training data (using dummy fit, we already have trained models)
voting_clf.fit(np.zeros((len(W_test_subset), len(W_test_subset[0]))), np.zeros(len(W_test_subset)))

# Make predictions using the best models for each subset
W_pred = W_model.predict_proba(W_test_subset)[:, 1]
C_pred = C_model.predict_proba(C_test_subset)[:, 1]
H_pred = H_model.predict_proba(H_test_subset)[:, 1]
R_pred = R_model.predict_proba(R_test_subset)[:, 1]
S_pred = S_model.predict_proba(S_test_subset)[:, 1]

# Aggregate predictions from all subsets
combined_preds = np.concatenate([W_pred, C_pred, H_pred, R_pred, S_pred])

# Calculate the final predicted probabilities
test_data['isFraud_pred'] = combined_preds

# Evaluate the combined model
y_test = test_data['isFraud']
y_pred = (test_data['isFraud_pred'] > 0.5).astype(int)

print("ROC AUC Score:", roc_auc_score(y_test, test_data['isFraud_pred']))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

ROC AUC Score: 0.49924355452884717
Confusion Matrix:
 [[81492  3990]
 [ 2949   150]]
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.95      0.96     85482
           1       0.04      0.05      0.04      3099

    accuracy                           0.92     88581
   macro avg       0.50      0.50      0.50     88581
weighted avg       0.93      0.92      0.93     88581



In [65]:
# Load the best model for the W subset
W_model = joblib.load('best_model_W.pkl')

# Prepare the W subset from the test data
def prepare_W_test_subset(test_data):
    # Create TransactionHour column
    test_data['TransactionHour'] = (test_data['TransactionDT'] // 3600) % 24

    # Select only numeric columns
    numeric_cols = test_data.select_dtypes(include=['number']).columns
    W_test_subset_numeric = test_data[numeric_cols]

    # Drop columns with all NaN values
    W_test_subset_cleaned = W_test_subset_numeric.dropna(axis=1, how='all')

    # Drop target column
    W_test_subset_cleaned = W_test_subset_cleaned.drop(columns=['isFraud'])

    # Fill remaining NaN values
    W_test_subset_filled = W_test_subset_cleaned.fillna(1234567)

    # Standard Scale
    scaler = StandardScaler()
    W_test_subset_scaled = scaler.fit_transform(W_test_subset_filled.drop(columns=['TransactionID', 'TransactionDT']))

    return W_test_subset_scaled

# Filter the W subset in test data
W_test_subset = test_data[test_data['ProductCD'] == 'W']

# Prepare the W test subset
W_test_subset_prepared = prepare_W_test_subset(W_test_subset)

# Predict using the best model for the W subset
W_test_subset['isFraud_pred'] = W_model.predict_proba(W_test_subset_prepared)[:, 1]

# Evaluate the model
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix

y_test_W = W_test_subset['isFraud']
y_pred_W = (W_test_subset['isFraud_pred'] > 0.5).astype(int)

print("ROC AUC Score:", roc_auc_score(y_test_W, W_test_subset['isFraud_pred']))
print("Confusion Matrix:\n", confusion_matrix(y_test_W, y_pred_W))
print("Classification Report:\n", classification_report(y_test_W, y_pred_W))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data['TransactionHour'] = (test_data['TransactionDT'] // 3600) % 24


ROC AUC Score: 0.7466468505376314
Confusion Matrix:
 [[62728  2169]
 [ 1138   227]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.97      0.97     64897
           1       0.09      0.17      0.12      1365

    accuracy                           0.95     66262
   macro avg       0.54      0.57      0.55     66262
weighted avg       0.96      0.95      0.96     66262



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  W_test_subset['isFraud_pred'] = W_model.predict_proba(W_test_subset_prepared)[:, 1]


In [74]:
# Define helper functions
def convert_to_hour(df):
    df['TransactionHour'] = (df['TransactionDT'] // 3600) % 24
    return df

def create_product_subsets(df):
    product_subsets = {}
    product_codes = df['ProductCD'].unique()
    for code in product_codes:
        product_subsets[code] = df[df['ProductCD'] == code]
    return product_subsets

def check_class_distribution(df):
    return df['isFraud'].value_counts(normalize=True) * 100

def prepare_test_data(test_subset):
    # Select only numeric columns
    numeric_cols = test_subset.select_dtypes(include=['number']).columns
    test_subset_numeric = test_subset[numeric_cols]

    # Drop columns with all NaN values
    test_subset_cleaned = test_subset_numeric.dropna(axis=1, how='all')

    # Fill remaining NaN values
    test_subset_filled = test_subset_cleaned.fillna(1234567)

    # Drop columns TransactionID and TransactionDT
    if 'TransactionID' in test_subset_filled.columns and 'TransactionDT' in test_subset_filled.columns:
        test_subset_filled = test_subset_filled.drop(columns=['TransactionID', 'TransactionDT'])

    # Separate features and target
    X_test = test_subset_filled.drop(columns=['isFraud'])
    y_test = test_subset['isFraud']

    # Standard Scale
    scaler = StandardScaler()
    X_test_scaled = scaler.fit_transform(X_test)

    return X_test_scaled, y_test

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print("ROC AUC Score:", roc_auc_score(y_test, y_prob))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))

# Load the best models for each subset
W_model = joblib.load('best_model_W.pkl')
C_model = joblib.load('best_model_C.pkl')
H_model = joblib.load('best_model_H.pkl')
R_model = joblib.load('best_model_R.pkl')
S_model = joblib.load('best_model_S.pkl')

# Prepare the test data
test_data = convert_to_hour(test_data)
test_product_subsets = create_product_subsets(test_data)

# Process each subset and evaluate
for subset in ['W', 'C', 'H', 'R', 'S']:
    print(f"\nProcessing and evaluating subset {subset}:")
    X_test, y_test = prepare_test_data(test_product_subsets[subset])
    model = joblib.load(f'best_model_{subset}.pkl')
    evaluate_model(model, X_test, y_test)


Processing and evaluating subset W:
ROC AUC Score: 0.7466468505376314
Confusion Matrix:
 [[62728  2169]
 [ 1138   227]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.97      0.97     64897
           1       0.09      0.17      0.12      1365

    accuracy                           0.95     66262
   macro avg       0.54      0.57      0.55     66262
weighted avg       0.96      0.95      0.96     66262


Processing and evaluating subset C:
ROC AUC Score: 0.8510247986545483
Confusion Matrix:
 [[8777  146]
 [ 812  359]]
Classification Report:
               precision    recall  f1-score   support

           0       0.92      0.98      0.95      8923
           1       0.71      0.31      0.43      1171

    accuracy                           0.91     10094
   macro avg       0.81      0.65      0.69     10094
weighted avg       0.89      0.91      0.89     10094


Processing and evaluating subset H:
ROC AUC Score: 0.837441

In [75]:
input_loc = '../data/raw/'

test_transaction = pd.read_csv(input_loc + 'test_transaction.csv', low_memory=False)
test_identity = pd.read_csv(input_loc + 'test_identity.csv', low_memory=False)

test_identity_sorted = test_identity.sort_values(by="TransactionID", ascending=True)
test_identity_pivoted = test_identity_sorted.pivot(index='TransactionID', columns='variable', values='value')

combined_test_df = pd.merge(test_transaction, test_identity_pivoted, on='TransactionID', how='left')

print(combined_test_df.head())

   TransactionID  TransactionDT  TransactionAmt ProductCD  card1  card2  \
0        3663549       18403224           31.95         W  10409  111.0   
1        3663550       18403263           49.00         W   4272  111.0   
2        3663551       18403310          171.00         W   4476  574.0   
3        3663552       18403310          284.95         W  10989  360.0   
4        3663553       18403317           67.95         W  18018  452.0   

   card3       card4  card5  card6  ...  id_29  id_30  id_31  id_32 id_33  \
0  150.0        visa  226.0  debit  ...    NaN    NaN    NaN    NaN   NaN   
1  150.0        visa  226.0  debit  ...    NaN    NaN    NaN    NaN   NaN   
2  150.0        visa  226.0  debit  ...    NaN    NaN    NaN    NaN   NaN   
3  150.0        visa  166.0  debit  ...    NaN    NaN    NaN    NaN   NaN   
4  150.0  mastercard  117.0  debit  ...    NaN    NaN    NaN    NaN   NaN   

  id_34  id_35  id_36  id_37  id_38  
0   NaN    NaN    NaN    NaN    NaN  
1   NaN   

In [76]:
# Load the combined test data
input_loc = '../data/raw/'
test_transaction = pd.read_csv(input_loc + 'test_transaction.csv', low_memory=False)
test_identity = pd.read_csv(input_loc + 'test_identity.csv', low_memory=False)
combined_test_df = pd.merge(test_transaction, test_identity, on='TransactionID', how='left')

# Define helper functions
def convert_to_hour(df):
    df['TransactionHour'] = (df['TransactionDT'] // 3600) % 24
    return df

def create_product_subsets(df):
    product_subsets = {}
    product_codes = df['ProductCD'].unique()
    for code in product_codes:
        product_subsets[code] = df[df['ProductCD'] == code]
    return product_subsets

def check_class_distribution(df):
    return df['isFraud'].value_counts(normalize=True) * 100

def prepare_test_data(test_subset):
    # Select only numeric columns
    numeric_cols = test_subset.select_dtypes(include=['number']).columns
    test_subset_numeric = test_subset[numeric_cols]

    # Drop columns with all NaN values
    test_subset_cleaned = test_subset_numeric.dropna(axis=1, how='all')

    # Fill remaining NaN values
    test_subset_filled = test_subset_cleaned.fillna(1234567)

    # Drop columns TransactionID and TransactionDT
    if 'TransactionID' in test_subset_filled.columns and 'TransactionDT' in test_subset_filled.columns:
        test_subset_filled = test_subset_filled.drop(columns=['TransactionID', 'TransactionDT'])

    # Separate features and target
    X_test = test_subset_filled.drop(columns=['isFraud'])
    y_test = test_subset['isFraud']

    # Standard Scale
    scaler = StandardScaler()
    X_test_scaled = scaler.fit_transform(X_test)

    return X_test_scaled, y_test

def evaluate_model(model, X_test, y_test):
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    print("ROC AUC Score:", roc_auc_score(y_test, y_prob))
    print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))

# Load the best models for each subset
W_model = joblib.load('best_model_W.pkl')
C_model = joblib.load('best_model_C.pkl')
H_model = joblib.load('best_model_H.pkl')
R_model = joblib.load('best_model_R.pkl')
S_model = joblib.load('best_model_S.pkl')

# Prepare the test data
combined_test_df = convert_to_hour(combined_test_df)
test_product_subsets = create_product_subsets(combined_test_df)

# Process each subset and evaluate
for subset in ['W', 'C', 'H', 'R', 'S']:
    print(f"\nProcessing and evaluating subset {subset}:")
    X_test, y_test = prepare_test_data(test_product_subsets[subset])
    model = joblib.load(f'best_model_{subset}.pkl')
    evaluate_model(model, X_test, y_test)


Processing and evaluating subset W:


KeyError: "['isFraud'] not found in axis"

### Additional Tuning, Processing, or Model-Improvement
What else can you do to improve your model from the baseline?

## Outcome
At the end of this notebook, you should have a model that is performing better than the baseline model. You should be able to explain what steps you took to train this model and why each one was chosen.