### In this notebook, we use pandas to create a bunch of useful features and train XGB models. The entire pipeline is slow due to CPU.

### What you might find useful from this notebook:
### - After-pay features. It makes intuitive sense that subtracting the payments from balance/spend etc provides new information about the users' behavior.
### - Feature selection and hyperparameter tuning. Hundreds of GPU hours are burned to get these numbers.

In [1]:
import cudf
cudf.__version__

'24.08.03'

In [2]:
#%load_ext cudf.pandas

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:

import pandas as pd
from time import time
import xgboost as xgb
import numpy as np
from tqdm import tqdm
from collections import Counter, defaultdict

### Data Overview

In [5]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/train.parquet
/kaggle/input/train_labels.csv
/kaggle/input/test.parquet


```
The objective of this competition is to predict the probability that a customer does not pay back their credit card balance amount in the future based on their monthly customer profile. The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.
```

```
Your task is to predict, for each customer_ID, the probability of a future payment default (target = 1).

Note that the negative class has been subsampled for this dataset at 5%, and thus receives a 20x weighting in the scoring metric.
```

```
Evaluation
The evaluation metric,
𝑀, for this competition is the mean of two measures of rank ordering: Normalized Gini Coefficient,
𝐺, and default rate captured at 4%,
𝐷, The default rate captured at 4% is the percentage of the positive labels (defaults) captured within the highest-ranked 4% of the predictions, and represents a Sensitivity/Recall statistic.

𝑀=0.5⋅(𝐺+𝐷)

The default rate captured at 4% is the percentage of the positive labels (defaults) captured within the highest-ranked 4% of the predictions, and represents a Sensitivity/Recall statistic.
```

In [6]:
train_label = pd.read_csv('/kaggle/input/train_labels.csv')
train_label.head()

Unnamed: 0,customer_ID,target
0,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,0
1,00000fd6641609c6ece5454664794f0340ad84dddce9a2...,0
2,00001b22f846c82c51f6e3958ccd81970162bae8b007e8...,0
3,000041bdba6ecadd89a52d11886e8eaaec9325906c9723...,0
4,00007889e4fcd2614b6cbe7f8f3d2e5c728eca32d9eb8a...,0


In [7]:
train_label.target.value_counts()

target
0    340085
1    118828
Name: count, dtype: int64

```
The dataset contains aggregated profile features for each customer at each statement date. Features are anonymized and normalized, and fall into the following general categories:

D_* = Delinquency variables
S_* = Spend variables
P_* = Payment variables
B_* = Balance variables
R_* = Risk variables
```

In [8]:
%%time
df = pd.read_parquet('/kaggle/input/train.parquet')

CPU times: user 12.7 s, sys: 24.7 s, total: 37.3 s
Wall time: 2.96 s


In [9]:
df.head()

Unnamed: 0,customer_ID,S_2,P_2,D_39,B_1,B_2,R_1,S_3,D_41,B_3,...,D_136,D_137,D_138,D_139,D_140,D_141,D_142,D_143,D_144,D_145
0,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-03-09,0.938469,0,0.008724,1.006838,0.009228,0.124035,0.0,0.004709,...,-1,-1,-1,0,0,0.0,,0,0.00061,0
1,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-04-07,0.936665,0,0.004923,1.000653,0.006151,0.12675,0.0,0.002714,...,-1,-1,-1,0,0,0.0,,0,0.005492,0
2,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-05-28,0.95418,3,0.021655,1.009672,0.006815,0.123977,0.0,0.009423,...,-1,-1,-1,0,0,0.0,,0,0.006986,0
3,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-06-13,0.960384,0,0.013683,1.0027,0.001373,0.117169,0.0,0.005531,...,-1,-1,-1,0,0,0.0,,0,0.006527,0
4,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2017-07-16,0.947248,0,0.015193,1.000727,0.007605,0.117325,0.0,0.009312,...,-1,-1,-1,0,0,0.0,,0,0.008126,0


### Select two columns from each group

In [10]:
tag_groups = defaultdict(list)
lst = df.columns
cols = [s for s in lst if len(tag_groups[s.split('_')[0]]) < 2 and not tag_groups[s.split('_')[0]].append(s)]
cols = sorted(cols, reverse=True)
df.head()[cols]

Unnamed: 0,customer_ID,S_3,S_2,R_2,R_1,P_3,P_2,D_41,D_39,B_2,B_1
0,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,0.124035,2017-03-09,0,0.009228,0.736463,0.938469,0.0,0,1.006838,0.008724
1,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,0.12675,2017-04-07,0,0.006151,0.720886,0.936665,0.0,0,1.000653,0.004923
2,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,0.123977,2017-05-28,0,0.006815,0.738044,0.95418,0.0,3,1.009672,0.021655
3,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,0.117169,2017-06-13,0,0.001373,0.741813,0.960384,0.0,0,1.0027,0.013683
4,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,0.117325,2017-07-16,0,0.007605,0.691986,0.947248,0.0,0,1.000727,0.015193


### Feature Engineering

#### Groupby-Aggregation-Merge feature engineering

<a href="link.com"><img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/03.08-split-apply-combine.png" alt="Alt text"></a>

#### Difference between two features

```python
# compute "after pay" features
        for bcol in [f'B_{i}' for i in [11,14,17]]+['D_39','D_131']+[f'S_{i}' for i in [16,23]]:
            for pcol in ['P_2','P_3']:
                if bcol in df.columns:
                    df[f'{bcol}-{pcol}'] = df[bcol] - df[pcol]
```

#### Found by random search with cross-validation 

```
I launch a random search of many trials, where in each trial a small number of non-payment numeric features are randomly selected to subtract payment features and run the cross-validation again along with the original features. If the cross-validation score beats the baseline and the new after-pay features end up in the top 10 most important features, they are marked as good features for subtracting payments. After running hundreds of trials, only a handful of such good features met this criterion. I just took all the good features and put them together in a list and train the model again as the notebook does.
```

#### Cross validation

<a href="link.com"><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c7/LOOCV.gif/800px-LOOCV.gif?20200304222121" alt="Alt text"></a>

In [11]:
def get_not_used():
    # cid is the label encode of customer_ID
    # row_id indicates the order of rows
    return ['row_id', 'customer_ID', 'target', 'cid', 'S_2']
    
def preprocess(df, FEATURE_ENGINEERING):
    df['row_id'] = np.arange(df.shape[0])
    not_used = get_not_used()
    cat_cols = ['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120',
                'D_126', 'D_63', 'D_64', 'D_66', 'D_68']

    if FEATURE_ENGINEERING=='on':
        for col in df.columns:
            if col not in not_used+cat_cols:
                df[col] = df[col].round(2)
    
        # compute "after pay" features
        for bcol in [f'B_{i}' for i in [11,14,17]]+['D_39','D_131']+[f'S_{i}' for i in [16,23]]:
            for pcol in ['P_2','P_3']:
                if bcol in df.columns:
                    df[f'{bcol}-{pcol}'] = df[bcol] - df[pcol]

    df['S_2'] = pd.to_datetime(df['S_2'])
    df['cid'], _ = df.customer_ID.factorize()
        
    num_cols = [col for col in df.columns if col not in cat_cols+not_used]
    
    dgs = add_stats_step(df, num_cols)
        
    # merge might change row orders
    # restore the original row order by sorting row_id
    df = df.sort_values('row_id')
    df = df.drop(['row_id'],axis=1)
    return df, dgs

def add_stats_step(df, cols):
    n = 50
    dgs = []
    for i in range(0,len(cols),n):
        s = i
        e = min(s+n, len(cols))
        dg = add_stats_one_shot(df, cols[s:e])
        dgs.append(dg)
    return dgs

def add_stats_one_shot(df, cols):
    stats = ['mean','std']
    dg = df.groupby('customer_ID').agg({col:stats for col in cols})
    out_cols = []
    for col in cols:
        out_cols.extend([f'{col}_{s}' for s in stats])
    dg.columns = out_cols
    dg = dg.reset_index()
    return dg

def process_data(df, FEATURE_ENGINEERING):
    df,dgs = preprocess(df, FEATURE_ENGINEERING)
    df = df.drop_duplicates('customer_ID',keep='last')
    if FEATURE_ENGINEERING == 'on':
        for dg in dgs:
            df = df.merge(dg, on='customer_ID', how='left')
        diff_cols = [col for col in df.columns if col.endswith('_diff')]
        df = df.drop(diff_cols,axis=1)
    return df

def load_train(path, FEATURE_ENGINEERING):
    train = pd.read_parquet(f'{path}/train.parquet')
    
    train = process_data(train, FEATURE_ENGINEERING)
    trainl = pd.read_csv(f'{path}/train_labels.csv')
    train = train.merge(trainl, on='customer_ID', how='left')
    return train


def bold_print(x):
    print(f"\033[1m{x}\033[0m")

### XGB Params and utility functions

In [12]:
def xgb_train(x, y, xt, yt):
    bold_print(f"# of features: {x.shape[1]}")
    assert x.shape[1] == xt.shape[1]
    dtrain = xgb.DMatrix(data=x, label=y)
    dvalid = xgb.DMatrix(data=xt, label=yt)
    params = {
            'objective': 'binary:logistic', 
            'tree_method': 'hist', 
            'max_depth': 7,
            'subsample':0.88,
            'colsample_bytree': 0.5,
            'gamma':1.5,
            'min_child_weight':8,
            'lambda':70,
            'eta':0.1,
    }
    watchlist = [(dtrain, 'train'), (dvalid, 'eval')]
    bst = xgb.train(params, dtrain=dtrain,
                num_boost_round=50,evals=watchlist,
                early_stopping_rounds=500, feval=xgb_amex, maximize=True,
                verbose_eval=50)
    #return _,bst
    print('best ntree_limit:', bst.best_iteration)
    print('best score:', bst.best_score)
    return bst.predict(dvalid, iteration_range=(0,bst.best_iteration)), bst

#### Metrics

In [13]:
def xgb_amex(y_pred, y_true):
    return 'amex', amex_metric_np(y_pred,y_true.get_label())

# Created by https://www.kaggle.com/yunchonggan
# https://www.kaggle.com/competitions/amex-default-prediction/discussion/328020
def amex_metric_np(preds: np.ndarray, target: np.ndarray) -> float:
    indices = np.argsort(preds)[::-1]
    preds, target = preds[indices], target[indices]

    weight = 20.0 - target * 19.0
    cum_norm_weight = (weight / weight.sum()).cumsum()
    four_pct_mask = cum_norm_weight <= 0.04
    d = np.sum(target[four_pct_mask]) / np.sum(target)

    weighted_target = target * weight
    lorentz = (weighted_target / weighted_target.sum()).cumsum()
    gini = ((lorentz - cum_norm_weight) * weight).sum()

    n_pos = np.sum(target)
    n_neg = target.shape[0] - n_pos
    gini_max = 10 * n_neg * (n_pos + 20 * n_neg - 19) / (n_pos + 20 * n_neg)

    g = gini / gini_max
    return 0.5 * (g + d)

# we still need the official metric since the faster version above is slightly off
import pandas as pd
def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:

    def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        four_pct_cutoff = int(0.04 * df['weight'].sum())
        df['weight_cumsum'] = df['weight'].cumsum()
        df_cutoff = df.loc[df['weight_cumsum'] <= four_pct_cutoff]
        return (df_cutoff['target'] == 1).sum() / (df['target'] == 1).sum()
        
    def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = (pd.concat([y_true, y_pred], axis='columns')
              .sort_values('prediction', ascending=False))
        df['weight'] = df['target'].apply(lambda x: 20 if x==0 else 1)
        df['random'] = (df['weight'] / df['weight'].sum()).cumsum()
        total_pos = (df['target'] * df['weight']).sum()
        df['cum_pos_found'] = (df['target'] * df['weight']).cumsum()
        df['lorentz'] = df['cum_pos_found'] / total_pos
        df['gini'] = (df['lorentz'] - df['random']) * df['weight']
        return df['gini'].sum()

    def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        y_true_pred = y_true.rename(columns={'target': 'prediction'})
        return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)

### Train XGB in K-folds

In [14]:
def run_cv(FEATURE_ENGINEERING):
    start = time()
    
    path = '/kaggle/input'
    train = load_train(path, FEATURE_ENGINEERING)
    
    bold_print(f'Feature engineering time: {time()-start:.1f} seconds')
    
    not_used = get_not_used()
    not_used = [i for i in not_used if i in train.columns]
    msgs = {}
    folds = 4
    score = 0
    
    for i in range(folds):
        mask = train['cid']%folds == i
        tr,va = train[~mask], train[mask]
        
        x, y = tr.drop(not_used, axis=1), tr['target']
        xt, yt = va.drop(not_used, axis=1), va['target']
        yp, bst = xgb_train(x, y, xt, yt)
        #break
        bst.save_model(f'xgb_{i}.json')
        amex_score = amex_metric(pd.DataFrame({'target':yt.values}), 
                                        pd.DataFrame({'prediction':yp}))
        msg = f"Fold {i} amex {amex_score:.4f}"
        print(msg)
        score += amex_score
        
    score /= folds
    bold_print(f"Average amex score: {score:.4f}")
    return train, score

In [15]:
%%time

FEATURE_ENGINEERING = 'off'
train, score = run_cv(FEATURE_ENGINEERING)

[1mFeature engineering time: 41.7 seconds[0m
[1m# of features: 188[0m
[0]	train-logloss:0.52394	train-amex:0.67976	eval-logloss:0.52317	eval-amex:0.67673
[49]	train-logloss:0.22623	train-amex:0.78367	eval-logloss:0.23315	eval-amex:0.77259
best ntree_limit: 49
best score: 0.772593
Fold 0 amex 0.7722
[1m# of features: 188[0m
[0]	train-logloss:0.52396	train-amex:0.67839	eval-logloss:0.52332	eval-amex:0.67390
[49]	train-logloss:0.22661	train-amex:0.78237	eval-logloss:0.23168	eval-amex:0.77280
best ntree_limit: 49
best score: 0.772795
Fold 1 amex 0.7730
[1m# of features: 188[0m
[0]	train-logloss:0.52302	train-amex:0.67830	eval-logloss:0.52556	eval-amex:0.67761
[49]	train-logloss:0.22571	train-amex:0.78483	eval-logloss:0.23441	eval-amex:0.76998
best ntree_limit: 49
best score: 0.769981
Fold 2 amex 0.7692
[1m# of features: 188[0m
[0]	train-logloss:0.52370	train-amex:0.68243	eval-logloss:0.52368	eval-amex:0.67857
[49]	train-logloss:0.22580	train-amex:0.78412	eval-logloss:0.23402	eval

In [16]:
%%time

FEATURE_ENGINEERING = 'on'
train_fea, score = run_cv(FEATURE_ENGINEERING)

[1mFeature engineering time: 53.0 seconds[0m
[1m# of features: 584[0m
[0]	train-logloss:0.52404	train-amex:0.68922	eval-logloss:0.52335	eval-amex:0.68280
[49]	train-logloss:0.22169	train-amex:0.79070	eval-logloss:0.23075	eval-amex:0.77852
best ntree_limit: 49
best score: 0.778523
Fold 0 amex 0.7775
[1m# of features: 584[0m
[0]	train-logloss:0.52404	train-amex:0.69214	eval-logloss:0.52346	eval-amex:0.69063
[49]	train-logloss:0.22225	train-amex:0.78978	eval-logloss:0.22926	eval-amex:0.77570
best ntree_limit: 49
best score: 0.7757
Fold 1 amex 0.7761
[1m# of features: 584[0m
[0]	train-logloss:0.52291	train-amex:0.69673	eval-logloss:0.52542	eval-amex:0.69564
[49]	train-logloss:0.22120	train-amex:0.79168	eval-logloss:0.23150	eval-amex:0.77364
best ntree_limit: 48
best score: 0.773962
Fold 2 amex 0.7730
[1m# of features: 584[0m
[0]	train-logloss:0.52387	train-amex:0.68499	eval-logloss:0.52406	eval-amex:0.67763
[49]	train-logloss:0.22134	train-amex:0.79007	eval-logloss:0.23114	eval-a

In [17]:
train.head()

Unnamed: 0,customer_ID,S_2,P_2,D_39,B_1,B_2,R_1,S_3,D_41,B_3,...,D_138,D_139,D_140,D_141,D_142,D_143,D_144,D_145,cid,target
0,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2018-03-13,0.934745,0,0.009382,1.007647,0.006104,0.135021,0.0,0.007174,...,-1,0,0,0.0,,0,0.00297,0,0,0
1,00000fd6641609c6ece5454664794f0340ad84dddce9a2...,2018-03-25,0.880519,6,0.034684,1.004028,0.006911,0.165509,0.0,0.005068,...,-1,0,0,0.0,,0,0.003169,0,1,0
2,00001b22f846c82c51f6e3958ccd81970162bae8b007e8...,2018-03-12,0.880875,0,0.004284,0.812649,0.00645,,0.0,0.007196,...,-1,0,0,0.0,,0,0.000834,0,2,0
3,000041bdba6ecadd89a52d11886e8eaaec9325906c9723...,2018-03-29,0.621776,0,0.012564,1.006183,0.007829,0.287766,0.0,0.009937,...,-1,0,0,0.0,,0,0.00556,0,3,0
4,00007889e4fcd2614b6cbe7f8f3d2e5c728eca32d9eb8a...,2018-03-30,0.8719,0,0.007679,0.815746,0.001247,,0.0,0.005528,...,-1,0,0,0.0,,0,0.006944,0,4,0


In [18]:
train_fea.head()

Unnamed: 0,customer_ID,S_2,P_2,D_39,B_1,B_2,R_1,S_3,D_41,B_3,...,D_131-P_3_std,S_16-P_2_mean,S_16-P_2_std,S_16-P_3_mean,S_16-P_3_std,S_23-P_2_mean,S_23-P_2_std,S_23-P_3_mean,S_23-P_3_std,target
0,0000099d6bd597052cdcda90ffabf56573fe9d7c79be5f...,2018-03-13,0.93,0,0.01,1.01,0.01,0.14,0.0,0.01,...,0.051066,-0.93,0.025495,-0.676923,0.049897,-0.798462,0.023397,-0.545385,0.050102,0
1,00000fd6641609c6ece5454664794f0340ad84dddce9a2...,2018-03-25,0.88,6,0.03,1.0,0.01,0.17,0.0,0.01,...,0.037055,-0.896154,0.024337,-0.563846,0.037314,-0.763077,0.023939,-0.430769,0.038397,0
2,00001b22f846c82c51f6e3958ccd81970162bae8b007e8...,2018-03-12,0.88,0,0.0,0.81,0.01,,0.0,0.01,...,0.076141,-0.875385,0.027269,-0.616154,0.075337,-0.744615,0.028756,-0.485385,0.077955,0
3,000041bdba6ecadd89a52d11886e8eaaec9325906c9723...,2018-03-29,0.62,0,0.01,1.01,0.01,0.29,0.0,0.01,...,0.088318,-0.592308,0.019215,-0.604615,0.086083,-0.460769,0.0206,-0.473077,0.090405,0
4,00007889e4fcd2614b6cbe7f8f3d2e5c728eca32d9eb8a...,2018-03-30,0.87,0,0.01,0.82,0.0,,0.0,0.01,...,0.089615,-0.89,0.044535,-0.525385,0.091889,-0.757692,0.042847,-0.393077,0.090497,0
