<a href="https://www.kaggle.com/code/annettazheng/eda-rf-xgb-lgbm-prediction?scriptVersionId=143049125" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# American Express Credict Card Default

Predict if a customer will default in the future [American Express - Default Prediction](https://www.kaggle.com/competitions/amex-default-prediction/overview) challenge 

### A. [Preparation](#Preparation)
* Import Library
* Define Global Variables and Funtions 

### B. [Exploratory Data Analysis](#Overview)
* [Overview](#Overview)
* [Visualizations](#Visualizations)

### C. [Feature Engineering](#Data-Processing)
* [Metrics Selection](#metric-selection)
* [Explore metrics with Logistic Regression](#Logistic-Regression)
    0. Train and Evaluate
    1. Confusion Matrix
    2. Visualize the Feature Correlation

### D. [Model Training and Evaluation](#Model-Building)
   **Steps:**
   
   0. Get X, y and split train and validation 
   1. Build a Classifier
   2. Hyperparameter Tuning using RandomizedSearchCV
   3. Build the final Classifier Model
   4. Evaluate with Accuracy and Cross Validation 
   5. Visualize the Difference in Actuals and Predictions
    
**Classifiers:**
* [Random Forest](#Random-Forest)
* [XGBoost](#XGB)
* [LightGBM](#LGBM)

### E. [Comparison and Selection](#Comparison)
* Compare and Contrast the performance of Three Classifiers with [Plots](#performance-visualization)

<!--    1. Random Forest
   2. XGBoost
   3. LightGBM -->
   
### F. [Predict for Test Results](#Test-Results)
* Get submission.csv for y_test in chunks

# Preparation

import libraries

In [None]:
import os
import warnings
import time

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, mean_absolute_error
from sklearn.model_selection import cross_val_score, TimeSeriesSplit
from sklearn.impute import SimpleImputer
from sklearn.metrics import f1_score, accuracy_score, ConfusionMatrixDisplay

# import tensorflow_decision_forests as tfdf

# !pip install dexplot
# import dexplot as dxp
import seaborn as sns
sns.set(style = "whitegrid", 
        color_codes = True,
        font_scale = 1.5)

warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from xgboost.sklearn import XGBClassifier
import xgboost as xgb
import gc
import pickle
import lightgbm as lgb
from lightgbm import LGBMClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

define global var

In [None]:
TRAIN_DATA_PATH = "../input/amex-default-prediction/train_data.csv"
TRAIN_LABELS_PATH = "../input/amex-default-prediction/train_labels.csv"
TEST_DATA_PATH = "../input/amex-default-prediction/test_data.csv"

build dataframes and functions for model eval 

In [None]:
result_model_df = pd.DataFrame(columns = ['model','train_accuracy','valid_accuracy','train_amex_metric','valid_amex_metric'])
def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:

    def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
        
    def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        y_true_pred = y_true.rename(columns={'target': 'prediction'})
        return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)

# Overview

Dataset is so big, that's why we load first 10000 rows.

In [None]:
train_X = pd.read_csv(TRAIN_DATA_PATH, parse_dates=['S_2'], nrows=40000)
train_Y = pd.read_csv(TRAIN_LABELS_PATH, nrows=40000)

Let's look at one customer

In [None]:
example_customer_id = "000f1c950ae4e388f44e9ba96dd6334dfa85d8be0416d9d0d30228301f2e4cc4"

In [None]:
customer_data_ex = train_X[train_X["customer_ID"] == example_customer_id]
customer_data_ex

Features are anonymized and normalized and there are 188 variables in total, and fall into the following general categories:
- D_* = 96 Delinquency variables
- S_* = 21 Spend variables
- P_* = 3 Payment variables
- B_* = 40 Balance variables
- R_* = 28 Risk variables


with the following features being categorical:
['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

In [None]:
all_cols = list(customer_data_ex.columns)
print(all_cols)

In [None]:
b_cols = list(filter(lambda x: x.startswith("B_"), all_cols))
print(b_cols)

Check if the customer will future payment default

In [None]:
train_Y[train_Y["customer_ID"] == example_customer_id]

In [None]:
customer_data_ex.loc[:, "S_2"] = pd.to_datetime(customer_data_ex["S_2"])

In [None]:
plt.figure(figsize=(16, 5))
sns.lineplot(data=customer_data_ex, x="S_2", y="P_2")
plt.title("P_2", fontsize=16)
plt.xlabel("date", fontsize=14)
plt.ylabel("P_2", fontsize=14);

## Visualizations

Show only 10 first customer's

In [None]:
ex_customer_ids = train_Y.iloc[:10]["customer_ID"].tolist()
ex_customer_data = train_X[train_X["customer_ID"].isin(ex_customer_ids)]

In [None]:
ex_customer_data = pd.merge(ex_customer_data, train_Y.iloc[:10], on="customer_ID")
ex_customer_data["S_2"] = pd.to_datetime(ex_customer_data["S_2"])

In [None]:
ex_customer_data.head()

How their feature time series look like

In [None]:
plt.figure(figsize=(16, 5))
for _, group in ex_customer_data.groupby("customer_ID"):
    sns.lineplot(data=group, x="S_2", y="P_2", label=group["target"].max())
plt.title("P_2", fontsize=16)
plt.xlabel("date", fontsize=14)
plt.ylabel("P_2", fontsize=14);

In [None]:
plt.figure(figsize=(16, 5))
for _, group in ex_customer_data.groupby("customer_ID"):
    sns.lineplot(data=group, x="S_2", y="B_1", label=group["target"].max())
plt.title("B_1", fontsize=16)
plt.xlabel("date", fontsize=14)
plt.ylabel("B_1", fontsize=14);

In [None]:
plt.figure(figsize=(16, 5))
for _, group in ex_customer_data.groupby("customer_ID"):
    sns.lineplot(data=group, x="S_2", y="B_2", label=group["target"].max())
plt.title("B_2", fontsize=16)
plt.xlabel("date", fontsize=14)
plt.ylabel("B_2", fontsize=14);

Let's take 1000 customers and show features histograms for target customers and for no-target

In [None]:
ex_customer_ids = train_Y.iloc[:1000]["customer_ID"].tolist()
ex_customer_data = train_X[train_X["customer_ID"].isin(ex_customer_ids)]
ex_customer_data = pd.merge(ex_customer_data, train_Y.iloc[:1000], on="customer_ID")
ex_customer_data["S_2"] = pd.to_datetime(ex_customer_data["S_2"])

ex_customer_data.shape

We have 735 no-target customers and 265 target

In [None]:
plt.figure(figsize=(16, 5))
sns.countplot(y=ex_customer_data.groupby("customer_ID")["target"].max())
plt.title("Class distribution", fontsize=16)
plt.xlabel("count", fontsize=14)
plt.ylabel("target", fontsize=14);

In [None]:
plt.figure(figsize=(16, 8))
df = ex_customer_data.groupby(["customer_ID"]).size().to_frame(name='counts').reset_index()
df = pd.merge(df, train_Y.iloc[:1000], on="customer_ID", how = 'inner')
res = df.groupby(["counts","target"]).size().to_frame('occurences').reset_index()
sns.barplot(x= 'counts', y = 'occurences', hue="target", data=res)
plt.title("Distribution of the number of records for the client", fontsize=16)
plt.xlabel("count", fontsize=14)
plt.ylabel("n_records", fontsize=14);
# px.scatter(ex_customer_data, x="date", y="amount_per_item", color="shop_id",
#            hover_data=["total_items"], title = "Order Amount per Item by Shop")

In [None]:
plt.figure(figsize=(16, 8))

sns.histplot(data=ex_customer_data, x="S_2", bins=100, hue = 'target')
plt.title("Distribution of records by time", fontsize=16)
plt.xlabel("count", fontsize=14)
plt.ylabel("n_records", fontsize=14);

In [None]:
def sort_f(x):
    try:
        a, b = x.split("_")
        return a, int(b)
    except:
        return "0", 0

all_cols = sorted(all_cols, key=sort_f)

In [None]:
categorical_cols = [
    'B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 
    'D_126', 'D_63', 'D_64', 'D_66', 'D_68',
]

In [None]:
ind = 0
for col in categorical_cols:
    if ind % 4 == 0:
        plt.figure(figsize=(16, 3))
    plt.subplot(1, 4, ind % 4 + 1)
    
    sns.countplot(data=ex_customer_data, x=col, hue="target")
    plt.ylabel("")
    
    if ind % 4 == 3:
        plt.show()
    ind += 1

In [None]:
ind = 0
for col in all_cols:
    if col in ["S_2", "customer_ID", "target"] + categorical_cols:
        continue
    
    if ind % 4 == 0:
        plt.figure(figsize=(16, 4))
    plt.subplot(1, 4, ind % 4 + 1)
    
    sns.histplot(data=ex_customer_data, x=col, hue="target", bins=20)
    plt.ylabel("")
    
    if ind % 4 == 3:
        plt.show()
    
    ind += 1

### show metrics summary

In [None]:
ex_customer_data[ex_customer_data["target"] == 0][b_cols[:]].describe()

In [None]:
ex_customer_data[ex_customer_data["target"] == 1][b_cols[:]].describe()

In [None]:
d_cols = list(filter(lambda x: x.startswith("D_"), all_cols))
ex_customer_data[ex_customer_data["target"] == 0][d_cols[:]].describe()

In [None]:
ex_customer_data[ex_customer_data["target"] == 1][d_cols[:]].describe()

In [None]:
r_cols = list(filter(lambda x: x.startswith("R_"), all_cols))
ex_customer_data[ex_customer_data["target"] == 0][r_cols[:]].describe()

In [None]:
ex_customer_data[ex_customer_data["target"] == 1][r_cols[:]].describe()

# Data Processing

### Metric Selection

In [None]:
X_mean_cols = [
    "B_37", "B_27",
    "D_42", "D_45", "D_46", "D_47", "D_48", "D_52",
    "D_54", "D_55", "D_59", "D_61", "D_62", "D_96",
    "D_105", "D_112", "D_115", "D_118", "B_32",
    "D_121", "D_122", "D_124", "D_142",
    "P_2", "P_3", 
    "S_3", "S_6", "S_7", "S_11", "S_15", "S_16", "S_17","S_19",
    "S_20", "S_22","S_23", "S_26", "S_27",
    "R_1", "R_3", "R_13", "R_18", "R_27", "R_28"
]
X_last_cols = [
    "B_2", "B_3", "B_4", "B_5", "B_7", "B_9",
    "B_18", "B_20", "B_23", "B_39",
    "D_87", "D_88","D_110", "D_111", "D_119", 
    "D_134", "D_135", "D_136", "D_137", "D_138",
]
categorical_cols = [
    'B_30', 'B_38',
    'D_114', 'D_116', 'D_117', 'D_120', 
    'D_126', 'D_63', 'D_64', 'D_66', 'D_68',
]

### Feature Engineering

In [None]:
def process_cat(df, col):
    one_hot = pd.get_dummies(df[col], prefix = col)
    df = df.drop(col, axis = 1).join(one_hot)
    return df
def get_time(df):
    df['Date'] = pd.to_datetime(df['S_2'])
    df['month'] = pd.DatetimeIndex(df['Date']).month
    df['weekofyear'] = pd.DatetimeIndex(df['Date']).weekofyear
    df['weekday'] = pd.DatetimeIndex(df['Date']).weekday
    return df
def getX(df, _X_cols = []):
    df = get_time(df)
    _chunk_mean = df.groupby("customer_ID")[X_mean_cols].mean().reset_index()
    _chunk_last = df.groupby("customer_ID")[X_last_cols].last().reset_index()
    _chunk = pd.merge(
        left=_chunk_mean, 
        right=_chunk_last, 
        how="inner",
        on="customer_ID",
        suffixes=("_mean", "_last"),
    )
    _chunk = _chunk.fillna(method='ffill').fillna(method='bfill')
    
    _chunk_dt = df.loc[:, ["customer_ID", 'weekday', 'month']]
    _chunk_dt = process_cat(_chunk_dt, 'weekday')
    _chunk_dt = process_cat(_chunk_dt, 'month')
    _chunk = pd.merge(_chunk, _chunk_dt, on="customer_ID", how="inner")
    
    _chunk_cat = df.loc[:, ["customer_ID"] + categorical_cols]
    _chunk_cat = process_cat(_chunk_cat, 'D_64')
    _chunk_cat = process_cat(_chunk_cat, 'D_63')
    _chunk = pd.merge(_chunk, _chunk_cat, on="customer_ID", how="inner")
    
    if (_X_cols != []):
        missing_cols = set(_X_cols) - set( _chunk.columns )
        for c in missing_cols:
            _chunk[c] = 0
        _chunk = _chunk[_X_cols]
    _chunk = _chunk.fillna(0)
    return _chunk    

# Logistic Regression

### In-Sample Evaluation

In [None]:
train_df = getX(train_X)
train_df = pd.merge(train_df, train_Y, on="customer_ID", how="left")

X_train = train_df.iloc[:, 1:-1]
Y_train = train_df.iloc[:, -1]
model = LogisticRegression(solver = 'lbfgs')
model.fit(X_train, Y_train)

predictions = model.predict(X_train)
valid_predictions= model.predict(X_train)

print(f"Training F1 Score: {np.mean(f1_score(Y_train, predictions)).round(4)}")
print(f"Training Accuracy: {np.mean(accuracy_score(Y_train, predictions)).round(4)}")
print(cross_val_score(model, X_train, Y_train, cv = TimeSeriesSplit()))

### Confusion Matrix

In [None]:
disp = ConfusionMatrixDisplay.from_estimator(model, X_train, Y_train,
                                             display_labels=['Absent', 'Present'],
                                             cmap='Blues') 
disp.figure_.set_size_inches((7, 7))
disp.ax_.set_title('Model Level 1: Logistic\nRegression Model In- Sample Results')
plt.show()

### Corr Map

In [None]:
plt.rcParams["figure.figsize"] = (80,80)
corr = train_df.corr()
matrix = np.triu(corr)
sns.heatmap(corr, center = 0, annot=True, fmt='.1g', cmap="YlGnBu", mask = matrix)
plt.title("Correlation of the Metrics")

<!--  -->

# Model Building 

#### get X, y and split

In [None]:
_X_cols = train_df.columns[1:-1]
print(_X_cols)

X_train, X_valid, y_train, y_valid = train_test_split(
    train_df[_X_cols], train_df["target"], test_size=0.3, 
    random_state = 42, stratify=train_df["target"],
)

X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

<!-- # model-jump -->

#### build model and Hyperparameter Tuning using RandomizedSearchCV, return the best model

In [None]:
def bestModel(classifier, parameters, cv = 5, verbose = 3, n_jobs=-1, random_state=42):
    classifier = classifier
    parameters = parameters
    CV = RandomizedSearchCV(classifier, param_distributions = parameters, 
                            cv = cv, verbose = verbose, n_jobs= n_jobs, random_state=random_state)
    CV.fit(X_train, y_train)
    return CV

#### model evaluation with accuracy and compare across different models

In [None]:
def evalModel(model_name, best_model, cv = 2):
    print('Result for ', model_name)
    score = cross_val_score(best_model, X_train, y_train, cv = cv)
    print('Cross Validate Score: ', score)
    
    if model_name == 'random forest':
        y_pred_train = best_model.predict(X_train)
        print("Accuracy for train data: ", accuracy_score(y_train, y_pred_train))
        y_pred_valid = best_model.predict(X_valid)
        print("Accuracy for validation data: ", accuracy_score(y_valid, y_pred_valid))
    
#         y_pred_train = best_model.predict_proba(X_train)[:, 1]
#         y_pred_valid = best_model.predict_proba(X_valid)[:, 1]
#         print("f1_score for train:", f1_score(y_train, y_pred_train >= 0.5))
#         print("f1_score for valid:", f1_score(y_valid, y_pred_valid >= 0.5))
        
    else:
        y_pred_train = best_model.predict(X_train)
        y_pred_valid = best_model.predict(X_valid)
        print("Accuracy for train data: ", accuracy_score(y_train, y_pred_train))
        print("Accuracy for validation data: ", accuracy_score(y_valid, y_pred_valid))
        
#         print('Precision: %.3f' % f1_score(y_test, y_pred))
#         print('Precision: %.3f' % f1_score(y_test, y_pred))


    train_amex_metric = amex_metric(
        pd.DataFrame({"target": y_train}).reset_index(drop=True), 
        pd.DataFrame({"prediction": y_pred_train}).reset_index(drop=True),
    )
    valid_amex_metric = amex_metric(
        pd.DataFrame({"target": y_valid}).reset_index(drop=True), 
        pd.DataFrame({"prediction": y_pred_valid}).reset_index(drop=True),
    )
    print('amex_metric ', 'for train', train_amex_metric)
    print('amex_metric ', 'for valid', valid_amex_metric)
    
#     'random forest'
    r_list = [model_name, accuracy_score(y_train, y_pred_train), accuracy_score(y_valid, y_pred_valid),
              train_amex_metric,valid_amex_metric]
    result_model_df.loc[len(result_model_df)] = r_list
    

    fig, axs = plt.subplots(1, 2, figsize=(18, 6))
    fig.suptitle(''+'y_pred train vs valid')
    axs[0].hist(y_pred_train, bins=50)
    axs[1].hist(y_pred_valid, bins=50)

# Random Forest

In [None]:
rfc = RandomForestClassifier()

params = {
    "n_estimators": [50, 100, 150, 200, 250], 
    "max_depth": [3, 5, 8, 10, 12],
#     'min_sample_leaf':[1, 50, 100, 200, 400, 500],
#     'max_terminal_nodes':[0, 5, 10, 15, 20, 25],
    'class_weight':['balanced', 'balanced_subsample'],
    'max_features': ['auto', 'sqrt', 'log2'],
    'criterion': ['gini','entropy','log_loss'],
    'random_state':[42]
}

CV_rfc = bestModel(rfc, params)
print(CV_rfc.best_params_)
best_rfc = CV_rfc.best_estimator_

In [None]:
evalModel('random forest', best_rfc)

In [None]:
# rfc = RandomForestClassifier(**CV_rfc.best_estimator_)
# rfc0 = time.time()
# rfc.fit(X_train, y_train)
# rfc1 = time.time()-rfc0
# # print("Training time:", rfc1)

# pred_train = rfc.predict(X_train)
# print("Accuracy for Random Forest on train data: ", accuracy_score(y_train, pred_train))
# pred_valid = rfc.predict(X_valid)
# print("Accuracy for Random Forest on validation data: ", accuracy_score(y_valid, pred_valid))

In [None]:
# y_pred_train = rfc.predict_proba(X_train)[:, 1]
# y_pred_valid = rfc.predict_proba(X_valid)[:, 1]
# print("f1_score for train:", f1_score(y_train, y_pred_train >= 0.5))
# print("f1_score for valid:", f1_score(y_train, y_pred_valid >= 0.5))

In [None]:
# rfc_train_amex_metric = amex_metric(
#     pd.DataFrame({"target": y_train}).reset_index(drop=True), 
#     pd.DataFrame({"prediction": y_pred_train}).reset_index(drop=True),
# )
# rfc_valid_amex_metric = amex_metric(
#     pd.DataFrame({"target": y_valid}).reset_index(drop=True), 
#     pd.DataFrame({"prediction": y_pred_valid}).reset_index(drop=True),
# )
# print('amex_metric', 'for train', rfc_train_amex_metric)
# print('amex_metric', 'for valid', rfc_valid_amex_metric)

# rfc_list = ['random forest', accuracy_score(y_train, pred_train), accuracy_score(y_valid, pred_valid),
#             rfc_train_amex_metric,rfc_valid_amex_metric, rfc1]
# result_model_df.loc[len(result_model_df)] = rfc_list

# fig, axs = plt.subplots(1, 2, figsize=(18, 6))
# fig.suptitle('y_pred train vs valid')
# axs[0].hist(y_pred_train, bins=50)
# axs[1].hist(y_pred_valid, bins=50)

# XGB

In [None]:
xgb = XGBClassifier()
params = {
    'n_estimators' : [50, 100, 150, 200, 250], 
    'learning_rate' : [0.01, 0.1, 0.25, 0.5, 1],
    'max_depth' : [ 3, 4, 5, 6, 8, 10, 12, 15],
    'reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0],
    'min_child_samples':[20, 200, 1000, 2000, 2500],
    'num_leaves':[31, 61, 91, 121],
    'min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0], 
    'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'gamma': [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
    'colsample_bytree' : [ 0.3, 0.4, 0.5 , 0.7 ],
    'random_state' : [42]
}
CV_xgb = bestModel(xgb, params)
print(CV_xgb.best_params_)
best_xgb = CV_xgb.best_estimator_

In [None]:
evalModel('xgboost', best_xgb)

In [None]:
# xgb = XGBClassifier(**CV_xgb.best_params_)
# xgb0 = time.time()
# xgb.fit(X_train, y_train, **opt_params)
# xg1 = time.time()-xgb0

# y_pred_train = xgb.predict(X_train, num_iteration = xgb.best_iteration_)
# y_pred_valid = xgb.predict(X_valid, num_iteration = xgb.best_iteration_)
# print("Accuracy for xgboost on train data: ", accuracy_score(y_train, y_pred_train))
# print("Accuracy for xgboost on validation data: ", accuracy_score(y_valid, y_pred_valid))

In [None]:
# train_amex_metric = amex_metric(
#     pd.DataFrame({"target": y_train}).reset_index(drop=True), 
#     pd.DataFrame({"prediction": y_pred_train}).reset_index(drop=True),
# )
# valid_amex_metric = amex_metric(
#     pd.DataFrame({"target": y_valid}).reset_index(drop=True), 
#     pd.DataFrame({"prediction": y_pred_valid}).reset_index(drop=True),
# )
# print('amex_metric', 'for train', train_amex_metric)
# print('amex_metric', 'for valid', valid_amex_metric)

# xgb_list = ['xgboost', accuracy_score(y_train, y_pred_train), accuracy_score(y_valid, y_pred_valid),
#             train_amex_metric,valid_amex_metric, xgb1]
# result_model_df.loc[len(result_model_df)] = xgb_list

# fig, axs = plt.subplots(1, 2, figsize=(18, 6))
# fig.suptitle('y_pred train vs valid')
# axs[0].hist(y_pred_train, bins=50)
# axs[1].hist(y_pred_valid, bins=50)

# LGBM

In [None]:
gbm = LGBMClassifier()
params = {
    'n_estimators' : [50, 100, 150, 200, 250], 
    'learning_rate' : [0.01, 0.1, 0.25, 0.5, 1],
    'reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0],
    'min_child_samples':[20, 200, 1000, 2000, 2500],
    'num_leaves':[31, 61, 91, 121],
    'min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0], 
    'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'max_bins':[100, 200, 500, 750, 1000],
    'random_state' : [42]
}
CV_gbm = bestModel(gbm, params)
print(CV_gbm.best_params_)
best_gbm = CV_gbm.best_estimator_

In [None]:
evalModel('lightgbm', best_gbm)

In [None]:
# gbm = LGBMClassifier()
# params = {
#     'n_estimators' : [50, 100, 150, 200, 250], 
#     'learning_rate' : [0.01, 0.1, 0.25, 0.5, 1],
#     'reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0, 100.0],
#     'min_child_samples':[20, 200, 1000, 2000, 2500],
#     'num_leaves':[31, 61, 91, 121],
#     'min_child_weight': [0.5, 1.0, 3.0, 5.0, 7.0, 10.0], 
#     'subsample': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
#     'max_bins':[100, 200, 500, 750, 1000],
#     'random_state' : [42]
# }
# CV_gbm = RandomizedSearchCV(gbm, params, verbose = 3, cv = 2, n_jobs=-1, random_state=42)
# CV_gbm.fit(X_valid, y_valid)

In [None]:
# gbm = LGBMClassifier(random_state=42, subsample = 0.7)

# params = {
#     'n_estimators' : [100, 250], 
#     'learning_rate' : [0.25, 1],
#     'reg_lambda': [5.0, 50.0],
#     'min_child_samples':[200, 2500],
#     'num_leaves':[121, 200],
#     'min_child_weight': [0.5, 1.0], 
#     'max_bins':[100, 500]
# }
# CV_gbm = RandomizedSearchCV(gbm, params, cv = 2, verbose = 3)
# CV_gbm.fit(X_train, y_train)

In [None]:
# best_gbm = CV_gbm.best_estimator_

In [None]:

# score=cross_val_score(best_gbm, X_valid, y_valid, cv=10)
# print(score)

In [None]:
# gbm = LGBMClassifier(**CV_gbm.best_params_)
# #     n_estimators=100,
# #                      learning_rate=1, 
# #                      reg_lambda=50,
# #                      min_child_samples=2400,
# #                      num_leaves=95,
# #                      min_child_weight=0.5,
# #                      max_bins=511, 
# #                      random_state=42)

# gbm0 = time.time()
# gbm.fit(X_valid, y_valid)
# gbm1 = time.time()-gbm0

In [None]:
# gbm = LGBMClassifier(**CV_gbm.best_params_)
# #     n_estimators=100,
# #                      learning_rate=1, 
# #                      reg_lambda=50,
# #                      min_child_samples=2400,
# #                      num_leaves=95,
# #                      min_child_weight=0.5,
# #                      max_bins=511, 
# #                      random_state=42)

# gbm0 = time.time()
# gbm.fit(X_train, y_train)
# gbm1 = time.time()-gbm0

In [None]:
# y_pred_train = best_gbm.predict(X_train, num_iteration = gbm.best_iteration_)
# y_pred_valid = best_gbm.predict(X_valid, num_iteration = gbm.best_iteration_)
# print("Accuracy for lightgbm on train data: ", accuracy_score(y_train, y_pred_train))
# print("Accuracy for lightgbm on validation data: ", accuracy_score(y_valid, y_pred_valid))

In [None]:
# gbm_train_amex_metric = amex_metric(
#     pd.DataFrame({"target": y_train}).reset_index(drop=True), 
#     pd.DataFrame({"prediction": y_pred_train}).reset_index(drop=True),
# )
# gbm_valid_amex_metric = amex_metric(
#     pd.DataFrame({"target": y_valid}).reset_index(drop=True), 
#     pd.DataFrame({"prediction": y_pred_valid}).reset_index(drop=True),
# )
# print('amex_metric', 'for train', gbm_train_amex_metric)
# print('amex_metric', 'for valid', gbm_valid_amex_metric)

# gbm_list = ['lightgbm', accuracy_score(y_train, y_pred_train), accuracy_score(y_valid, y_pred_valid),
#             rfc_train_amex_metric,rfc_valid_amex_metric, gbm1]
# result_model_df.loc[len(result_model_df)] = gbm_list

# fig, axs = plt.subplots(1, 2, figsize=(18, 6))
# fig.suptitle('y_pred train vs valid')
# axs[0].hist(y_pred_train, bins=50)
# axs[1].hist(y_pred_valid, bins=50)

# Comparison

#### view the performance_df

In [None]:
result_model_df

In [None]:
# saved_model = '/kaggle/input/eval-models-df-saved-from-v12/saved.csv'
# result_model_df = pd.read_csv(saved_model)
# result_model_df

In [None]:
result_model_df.to_csv("save_model_result.csv", index=False)
# res_df.head()

#### performance visualization

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))        
result_model_df.plot.bar(ax=ax, x='model', title='performance by model')   
# x='model',
#         kind='bar',
#         stacked=False,
#         )
plt.ylim([0.988, 1.001])
plt.show()

In [None]:
# fig, axs = plt.subplots(1, 2,figsize=(40,10))
# e_range = [['train_accuracy','valid_accuracy'],['train_amex_metric','valid_amex_metric']]
# range_p = 2

# for i in range(range_p):
#     axs[k].bar(x + offset, measurement, width, label=attribute)

In [None]:
# fig, axs = plt.subplots(1, 2,figsize=(40,10))
# md = result_model_df['model'].values

# e_range = [['train_accuracy','valid_accuracy'],['train_amex_metric','valid_amex_metric']]
# k = 0
# for i in e_range:
#     x = np.arange(len(i))  # the label locations
#     width = 0.1  # the width of the bars
#     multiplier = 0
#     means = result_model_df[i]
    
#     for attribute, measurement in means.items():
#         print(attribute, measurement)
#         offset = width * multiplier
#         rects = axs[k].bar(x + offset, measurement, width, label=attribute)
#         axs[k].bar_label(rects, padding=5)
#         multiplier += 1
    
#     axs[k].set_ylabel(i)
#     axs[k].set_title(i[0][6:] + ' by models')
# #     axs[k].set_xticks(x * width)
#     axs[k].legend(loc = 'upper right')
#     k+=1
# plt.show()

In [None]:
# print(result_model_df.columns)
# print([result_model_df[i].values for i in result_c])

In [None]:
# #define points values by group
# # A = result_model_df.loc[result_model_df['model'] == 'lightgbm', ['train_accuracy','valid_accuracy']]
# # B = result_model_df.loc[result_model_df['model'] == 'lightgbm', ['train_amex_metric','valid_amex_metric']]
# # B = result_model_df.loc[result_model_df['model'] == 'random forest', ['train_accuracy','valid_accuracy']]
# result_c = ['train_accuracy', 'valid_accuracy', 'train_amex_metric','valid_amex_metric', 'train_time']
# #add three histograms to one plot
# plt.bar(result_c, [result_model_df[i] for i in result_c],alpha=0.5, label=result_model_df['model'])
# # plt.hist(B, alpha=0.5, label='random forest')

# #add plot title and axis labels
# plt.title('Points Distribution by Team')
# plt.xlabel('Points')
# plt.ylabel('Frequency')

# #add legend
# plt.legend(title='Team')

# #display plot
# plt.show()

In [None]:
# result_model_df.hist(['train_accuracy', 'valid_accuracy'],by='model', legend=True, figsize=(40,20))

In [None]:
# test_df = pd.read_csv(TEST_DATA_PATH,
#                       usecols=["customer_ID"] + ['S_2'] + X_mean_cols + X_last_cols + categorical_cols)

# feature_importances = model.best_estimator_.feature_importances_
# vis_indexes = list(range(len(feature_importances)))
# vis_indexes = sorted(vis_indexes, key=lambda x: -feature_importances[x])
# plt.figure(figsize=(20, 10))
# sn.barplot(
#     x=feature_importances[vis_indexes], 
#     y=_X_cols[vis_indexes],
# )
# plt.yticks(fontsize=14);
# using the feature importance variable
# feature_imp = pd.Series(rfc.best_estimator_.feature_importances_, 
#                         index = _X_cols).sort_values(ascending = False)
# feature_imp.show()


# Test Results

choose lightgbm since it has the highest accuracy and highest amex_metric

In [None]:
model = best_gbm

Iterate over chunks of test data and make predictions for them

In [None]:
chunksize = 100000
test_df = pd.read_csv(TEST_DATA_PATH, 
                      chunksize=chunksize, 
                      parse_dates=['S_2'],
                      usecols=["customer_ID"] + ['S_2'] + X_mean_cols + X_last_cols + categorical_cols)

In [None]:
_index = []
_vals = []

for chunk in test_df:
    chunk['month'] = pd.DatetimeIndex(chunk['S_2']).month
    chunk['weekday'] = pd.DatetimeIndex(chunk['S_2']).weekday

    _chunk_mean = chunk.groupby("customer_ID")[X_mean_cols].mean().reset_index()
    _chunk_last = chunk.groupby("customer_ID")[X_last_cols].last().reset_index()
    _chunk = pd.merge(
        left=_chunk_mean, 
        right=_chunk_last, 
        how="inner",
        on="customer_ID",
        suffixes=("_mean", "_last"),
    )
    _chunk = _chunk.fillna(method='ffill').fillna(method='bfill')
    
    _chunk_dt = chunk.loc[:, ["customer_ID", 'weekday', 'month']]
    _chunk_dt = process_cat(_chunk_dt, 'weekday')
    _chunk_dt = process_cat(_chunk_dt, 'month')
    _chunk = pd.merge(_chunk, _chunk_dt, on="customer_ID", how="inner")
    
    _chunk_cat = chunk.loc[:, ["customer_ID"] + categorical_cols]
    _chunk_cat = process_cat(_chunk_cat, 'D_64')
    _chunk_cat = process_cat(_chunk_cat, 'D_63')
    _chunk = pd.merge(_chunk, _chunk_cat, on="customer_ID", how="inner")
    
    missing_cols = set(_X_cols) - set(_chunk.columns)
    for c in missing_cols:
        _chunk[c] = 0
    
    X_test = _chunk[_X_cols]
    X_test = X_test.fillna(0)
    y_test_pred = model.predict_proba(X_test)[:, 1]
    _index.extend(_chunk["customer_ID"])
    _vals.extend(y_test_pred)
    print(len(_index))

In [None]:
res_df = pd.DataFrame(
    {"customer_ID": _index, "prediction": _vals}
).groupby("customer_ID")["prediction"].mean().reset_index()

print(res_df.shape)

In [None]:
res_df.to_csv("submission.csv", index=False)
res_df.head()