# American Express - Default Prediction
Predict if a customer will default in the future

Whether out at a restaurant or buying tickets to a concert, modern life counts on the convenience of a credit card to make daily purchases. It saves us from carrying large amounts of cash and also can advance a full purchase that can be paid over time. How do card issuers know we’ll pay back what we charge? That’s a complex problem with many existing solutions—and even more potential improvements, to be explored in this competition.

Credit default prediction is central to managing risk in a consumer lending business. Credit default prediction allows lenders to optimize lending decisions, which leads to a better customer experience and sound business economics. Current models exist to help manage risk. But it's possible to create better models that can outperform those currently in use.

American Express is a globally integrated payments company. The largest payment card issuer in the world, they provide customers with access to products, insights, and experiences that enrich lives and build business success.

In this competition, you’ll apply your machine learning skills to predict credit default. Specifically, you will leverage an industrial scale data set to build a machine learning model that challenges the current model in production. Training, validation, and testing datasets include time-series behavioral data and anonymized customer profile information. You're free to explore any technique to create the most powerful model, from creating features to using the data in a more organic way within a model.

If successful, you'll help create a better customer experience for cardholders by making it easier to be approved for a credit card. Top solutions could challenge the credit default prediction model used by the world's largest payment card issuer—earning you cash prizes, the opportunity to interview with American Express, and potentially a rewarding new career.

## Data Description

The objective of this competition is to predict the probability that a customer does not pay back their credit card balance amount in the future based on their monthly customer profile. The target binary variable is calculated by observing 18 months performance window after the latest credit card statement, and if the customer does not pay due amount in 120 days after their latest statement date it is considered a default event.

The dataset contains aggregated profile features for each customer at each statement date. Features are anonymized and normalized, and fall into the following general categories:

*   D\_\* = Delinquency variables
*   S\_\* = Spend variables
*   P\_\* = Payment variables
*   B\_\* = Balance variables
*   R\_\* = Risk variables

with the following features being categorical:

`['B_30', 'B_38', 'D_114', 'D_116', 'D_117', 'D_120', 'D_126', 'D_63', 'D_64', 'D_66', 'D_68']`

Your task is to predict, for each `customer_ID`, the probability of a future payment default (`target = 1`).

Note that the negative class has been subsampled for this dataset at 5%, and thus receives a 20x weighting in the scoring metric.

Files
-----

*   **train\_data.csv** - training data with multiple statement dates per `customer_ID`
*   **train\_labels.csv** - `target` label for each `customer_ID`
*   **test\_data.csv** - corresponding test data; your objective is to predict the `target` label for each `customer_ID`
*   **sample\_submission.csv** - a sample submission file in the correct format

Link: https://www.kaggle.com/competitions/amex-default-prediction/overview

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from tqdm.notebook import tqdm
import tsfresh
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import accuracy_score

In [2]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [3]:
le = preprocessing.LabelEncoder()

<IPython.core.display.Javascript object>

In [4]:
df = pd.read_csv(
    "../../data/amex-default-prediction/sample_train_data_30k.csv", delimiter="\t"
).set_index("customer_ID")
df

Unnamed: 0_level_0,S_2,P_2,D_39,B_1,B_2,R_1,S_3,D_41,B_3,D_42,...,D_137,D_138,D_139,D_140,D_141,D_142,D_143,D_144,D_145,target
customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
567bb352c1ec9fe043c64dd1c5e075189c30612b456a76cbeaf0cb34341d3a4f,2017-12-29,0.392097,0.002878,0.040471,0.810208,0.004398,0.244489,0.004483,0.007960,0.270353,...,,,,0.002539,,,,0.006251,,1
567bb352c1ec9fe043c64dd1c5e075189c30612b456a76cbeaf0cb34341d3a4f,2018-01-19,0.391909,0.036332,0.057248,0.816525,0.006929,0.234017,0.005245,0.001577,0.269885,...,,,0.004141,0.000521,0.005122,,0.008661,0.004245,0.004008,1
567bb352c1ec9fe043c64dd1c5e075189c30612b456a76cbeaf0cb34341d3a4f,2018-02-08,0.361950,0.623582,0.089909,0.818877,0.007582,0.332134,0.002409,0.002631,0.274014,...,,,0.007783,0.006298,0.008711,,0.006385,0.006506,0.002321,1
567bb352c1ec9fe043c64dd1c5e075189c30612b456a76cbeaf0cb34341d3a4f,2018-03-28,0.240591,0.300403,0.106174,0.054358,0.508895,0.451140,0.007622,0.169807,0.269052,...,,,0.005840,0.006598,0.003271,,0.001816,0.003071,0.002218,1
443132d9591d20ee365043b961a8d3d67d96daab5d75469b03d4bd64a4f31b8f,2017-03-18,0.275838,0.648851,0.035793,0.841584,0.751161,0.073957,0.538993,0.026267,,...,,,0.000245,0.000904,0.002443,,0.003759,0.000124,0.002175,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9a7e60eee07bcb770a14e8b9dca634b62e2eda22947dd01fece4880fe9a6200c,2017-11-11,0.409957,0.003579,0.568713,0.011909,0.006221,0.132613,0.009133,0.439427,0.177524,...,,,0.005938,0.005980,0.000688,,0.005720,0.004623,0.001788,1
9a7e60eee07bcb770a14e8b9dca634b62e2eda22947dd01fece4880fe9a6200c,2017-12-16,0.393445,0.002969,0.572830,0.004148,0.005036,0.129955,0.001435,0.437734,0.174953,...,,,0.005850,0.000524,0.008980,,0.007779,0.006147,0.001454,1
9a7e60eee07bcb770a14e8b9dca634b62e2eda22947dd01fece4880fe9a6200c,2018-01-19,0.321152,0.035907,0.599499,0.002268,0.004242,0.130136,0.001243,0.448089,0.170339,...,0.003078,0.506428,0.003302,0.003661,0.005257,,0.006914,0.007320,0.006374,1
9a7e60eee07bcb770a14e8b9dca634b62e2eda22947dd01fece4880fe9a6200c,2018-02-18,0.329993,0.090337,0.604726,0.007973,0.000571,0.133941,0.005195,0.454299,0.169984,...,0.008086,0.507849,0.002103,0.005741,0.002001,,0.004232,0.009550,0.003749,1


<IPython.core.display.Javascript object>

In [5]:
len(df.groupby("customer_ID"))

29136

<IPython.core.display.Javascript object>

# Feature engineering

In [6]:
df.sort_values(["customer_ID", "S_2"], inplace=True)

<IPython.core.display.Javascript object>

In [7]:
df.drop("S_2", axis=1, inplace=True)

<IPython.core.display.Javascript object>

## Categorical

In [8]:
cat_columns = [
    "B_30",
    "B_38",
    "D_114",
    "D_116",
    "D_117",
    "D_120",
    "D_126",
    "D_63",
    "D_64",
    "D_66",
    "D_68",
]

<IPython.core.display.Javascript object>

In [9]:
for cat_column in tqdm(cat_columns):
    df[cat_column] = le.fit_transform(df[cat_column])

categorical_df = df[cat_columns].groupby("customer_ID").max().copy()
categorical_df

  0%|          | 0/11 [00:00<?, ?it/s]

Unnamed: 0_level_0,B_30,B_38,D_114,D_116,D_117,D_120,D_126,D_63,D_64,D_66,D_68
customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
000201146e53cacdde1c7e9d29f4d3c46fd4d9231a3744aa39fb9c6afa79b708,1,4,1,0,6,0,2,1,1,1,6
00055add5eaee481b27e40af3f47b8b24e82c1e550f6ab010000e7685692f281,0,0,0,0,3,1,1,1,3,2,4
0005e52a3fa31b7eed49ceb576f011433ee2578833cd3f9d51c9dd9448a198ff,2,3,2,2,7,2,3,1,4,2,7
0008c2f297e1b00bf567c0d2c25f3e3b356f9a3088d2bf47aaaa724d26df8787,1,6,1,0,4,1,2,1,1,2,6
000ee46c042bfab551c28d92c93969f8a3539fe1e9fc9cd2d2e188f838f7d8ba,0,1,1,0,0,0,2,1,3,2,5
...,...,...,...,...,...,...,...,...,...,...,...
fff5d7b60546129d6a8329f6d7ae6fa18e5ebae6e2ce445f96eb933f4d62e347,0,1,2,2,7,2,3,2,4,2,7
fff7c9afcb6185ebe6152587453d9bd1eb04c458005d64c6afba25f32faa51f2,0,1,1,0,5,0,2,2,4,2,6
fffce0cbbd3dc853a970145cc1ebd44c716af633214cd0b54d18b01fbd31b626,1,4,1,0,0,0,2,1,3,2,5
fffec7d7e1ca804c86f1ffdaac389c33f8039ed35bf412b12d2e3548e49d54fa,0,0,0,0,5,0,2,1,2,2,6


<IPython.core.display.Javascript object>

## Numerical

In [10]:
num_columns = list(set(df.drop("target", axis=1).columns) - set(cat_columns))
len(num_columns)

177

<IPython.core.display.Javascript object>

In [11]:
df1 = df[num_columns].copy()
df1[num_columns] = df1[num_columns].fillna(0)
df1

Unnamed: 0_level_0,D_108,D_75,B_7,D_69,R_17,D_56,R_11,B_33,R_2,D_113,...,S_7,D_51,D_112,D_111,P_2,D_81,R_24,B_15,B_14,R_14
customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000201146e53cacdde1c7e9d29f4d3c46fd4d9231a3744aa39fb9c6afa79b708,0.0,0.134335,0.203534,0.007214,0.001430,0.780770,0.000061,1.003039,0.008457,0.007752,...,0.084461,0.333458,1.006342,0.0,0.840182,0.001664,0.009399,0.007800,0.011894,0.002122
000201146e53cacdde1c7e9d29f4d3c46fd4d9231a3744aa39fb9c6afa79b708,0.0,0.141054,0.216366,0.006088,0.001841,0.779804,0.007217,1.007165,0.009267,0.006612,...,0.092281,0.337286,1.003119,0.0,0.862529,0.004318,0.009699,0.002514,0.064276,0.000139
000201146e53cacdde1c7e9d29f4d3c46fd4d9231a3744aa39fb9c6afa79b708,0.0,0.203625,0.107817,0.002976,0.004458,0.813773,0.009705,0.006053,0.006245,0.007384,...,0.100649,0.333364,1.009727,0.0,0.845870,0.003415,0.003843,0.003168,0.093861,0.006505
000201146e53cacdde1c7e9d29f4d3c46fd4d9231a3744aa39fb9c6afa79b708,0.0,0.142603,0.346959,0.004231,0.009495,0.810733,0.006223,0.003549,0.007705,0.001500,...,0.094569,0.338232,1.006872,0.0,0.948166,0.007226,0.008246,0.004135,0.116708,0.003699
000201146e53cacdde1c7e9d29f4d3c46fd4d9231a3744aa39fb9c6afa79b708,0.0,0.069216,0.040135,0.009159,0.006521,0.810838,0.006680,0.006056,0.004294,0.002769,...,0.096498,0.342288,1.009859,0.0,0.945046,0.008658,0.002318,0.003647,0.143971,0.008192
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fffec7d7e1ca804c86f1ffdaac389c33f8039ed35bf412b12d2e3548e49d54fa,0.0,0.002959,0.006707,0.002533,0.007587,0.000000,0.007649,1.000950,0.009855,0.004828,...,0.000000,0.008935,1.000198,0.0,0.703207,0.002972,0.001910,0.007673,0.006004,0.006418
fffec7d7e1ca804c86f1ffdaac389c33f8039ed35bf412b12d2e3548e49d54fa,0.0,0.000541,0.008323,0.004468,0.004944,0.000000,0.009381,1.007176,0.002231,0.009226,...,0.000000,0.006894,1.004583,0.0,0.703939,0.007425,0.007407,0.003287,0.004503,0.004783
fffec7d7e1ca804c86f1ffdaac389c33f8039ed35bf412b12d2e3548e49d54fa,0.0,0.008775,0.001540,0.005052,0.000400,0.000000,0.000023,1.000124,0.005981,0.000607,...,0.000000,0.008490,1.003093,0.0,0.710213,0.009112,0.000109,0.002357,0.004866,0.007875
fffee056e120fb326c9413fca5a7ab6618cc49be9bb6b19c34f3fb13fd50fdd3,0.0,0.009093,0.047381,0.000000,0.000828,0.000000,0.002974,1.009476,0.004490,0.000000,...,0.109090,0.007494,1.001887,0.0,0.839199,0.003138,0.005547,0.009128,0.018631,0.007964


<IPython.core.display.Javascript object>

In [12]:
df2 = tsfresh.extract_features(
    df1.reset_index(),
    column_id="customer_ID",
    default_fc_parameters=tsfresh.feature_extraction.MinimalFCParameters(),
)
df2.shape

Feature Extraction: 100%|████████████████████████████████████████████████████████████████| 30/30 [09:34<00:00, 19.17s/it]


(29136, 1770)

<IPython.core.display.Javascript object>

In [13]:
df2

Unnamed: 0,D_108__sum_values,D_108__median,D_108__mean,D_108__length,D_108__standard_deviation,D_108__variance,D_108__root_mean_square,D_108__maximum,D_108__absolute_maximum,D_108__minimum,...,R_14__sum_values,R_14__median,R_14__mean,R_14__length,R_14__standard_deviation,R_14__variance,R_14__root_mean_square,R_14__maximum,R_14__absolute_maximum,R_14__minimum
000201146e53cacdde1c7e9d29f4d3c46fd4d9231a3744aa39fb9c6afa79b708,0.0,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.066939,0.005634,0.005149,13.0,0.002895,0.000008,0.005907,0.009107,0.009107,0.000042
00055add5eaee481b27e40af3f47b8b24e82c1e550f6ab010000e7685692f281,0.0,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.066821,0.006737,0.005140,13.0,0.003392,0.000012,0.006158,0.009692,0.009692,0.000225
0005e52a3fa31b7eed49ceb576f011433ee2578833cd3f9d51c9dd9448a198ff,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.045032,0.006816,0.005629,8.0,0.003261,0.000011,0.006505,0.009698,0.009698,0.000590
0008c2f297e1b00bf567c0d2c25f3e3b356f9a3088d2bf47aaaa724d26df8787,0.0,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.068960,0.005444,0.005305,13.0,0.003186,0.000010,0.006188,0.009674,0.009674,0.000235
000ee46c042bfab551c28d92c93969f8a3539fe1e9fc9cd2d2e188f838f7d8ba,0.0,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.080886,0.006859,0.006222,13.0,0.002109,0.000004,0.006570,0.009246,0.009246,0.002203
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fff5d7b60546129d6a8329f6d7ae6fa18e5ebae6e2ce445f96eb933f4d62e347,0.0,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.061630,0.005917,0.004741,13.0,0.002742,0.000008,0.005476,0.009725,0.009725,0.000059
fff7c9afcb6185ebe6152587453d9bd1eb04c458005d64c6afba25f32faa51f2,0.0,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.057200,0.004007,0.004400,13.0,0.002892,0.000008,0.005266,0.009383,0.009383,0.000923
fffce0cbbd3dc853a970145cc1ebd44c716af633214cd0b54d18b01fbd31b626,0.0,0.0,0.0,13.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.060434,0.004999,0.004649,13.0,0.004048,0.000016,0.006164,0.009962,0.009962,0.000180
fffec7d7e1ca804c86f1ffdaac389c33f8039ed35bf412b12d2e3548e49d54fa,0.0,0.0,0.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.050865,0.007875,0.007266,7.0,0.001999,0.000004,0.007536,0.009851,0.009851,0.004257


# Data Preprocessing

In [14]:
y = df.groupby("customer_ID")[["target"]].max()
y

Unnamed: 0_level_0,target
customer_ID,Unnamed: 1_level_1
000201146e53cacdde1c7e9d29f4d3c46fd4d9231a3744aa39fb9c6afa79b708,0
00055add5eaee481b27e40af3f47b8b24e82c1e550f6ab010000e7685692f281,0
0005e52a3fa31b7eed49ceb576f011433ee2578833cd3f9d51c9dd9448a198ff,1
0008c2f297e1b00bf567c0d2c25f3e3b356f9a3088d2bf47aaaa724d26df8787,1
000ee46c042bfab551c28d92c93969f8a3539fe1e9fc9cd2d2e188f838f7d8ba,0
...,...
fff5d7b60546129d6a8329f6d7ae6fa18e5ebae6e2ce445f96eb933f4d62e347,0
fff7c9afcb6185ebe6152587453d9bd1eb04c458005d64c6afba25f32faa51f2,0
fffce0cbbd3dc853a970145cc1ebd44c716af633214cd0b54d18b01fbd31b626,0
fffec7d7e1ca804c86f1ffdaac389c33f8039ed35bf412b12d2e3548e49d54fa,0


<IPython.core.display.Javascript object>

In [15]:
X = df1.join(categorical_df).groupby("customer_ID").mean()
X

Unnamed: 0_level_0,D_108,D_75,B_7,D_69,R_17,D_56,R_11,B_33,R_2,D_113,...,B_38,D_114,D_116,D_117,D_120,D_126,D_63,D_64,D_66,D_68
customer_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
000201146e53cacdde1c7e9d29f4d3c46fd4d9231a3744aa39fb9c6afa79b708,0.0,0.097966,0.105303,0.005224,0.005308,0.839536,0.005522,0.235784,0.005353,0.004682,...,4.0,1.0,0.0,6.0,0.0,2.0,1.0,1.0,1.0,6.0
00055add5eaee481b27e40af3f47b8b24e82c1e550f6ab010000e7685692f281,0.0,0.004751,0.031515,0.005346,0.004706,0.000000,0.043364,1.004807,0.006171,0.204706,...,0.0,0.0,0.0,3.0,1.0,1.0,1.0,3.0,2.0,4.0
0005e52a3fa31b7eed49ceb576f011433ee2578833cd3f9d51c9dd9448a198ff,0.0,0.240284,0.346963,0.002343,0.005722,0.000000,0.005409,0.381506,0.003682,0.003686,...,3.0,2.0,2.0,7.0,2.0,3.0,1.0,4.0,2.0,7.0
0008c2f297e1b00bf567c0d2c25f3e3b356f9a3088d2bf47aaaa724d26df8787,0.0,0.369202,0.662566,0.005216,0.005706,0.172676,0.043989,0.004500,0.157674,0.005334,...,6.0,1.0,0.0,4.0,1.0,2.0,1.0,1.0,2.0,6.0
000ee46c042bfab551c28d92c93969f8a3539fe1e9fc9cd2d2e188f838f7d8ba,0.0,0.020253,0.044891,0.005056,0.003777,0.000000,0.004229,1.003396,0.005507,0.005421,...,1.0,1.0,0.0,0.0,0.0,2.0,1.0,3.0,2.0,5.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
fff5d7b60546129d6a8329f6d7ae6fa18e5ebae6e2ce445f96eb933f4d62e347,0.0,0.003862,0.022724,0.001441,0.004535,0.000000,0.006535,1.005859,0.003913,0.000566,...,1.0,2.0,2.0,7.0,2.0,3.0,2.0,4.0,2.0,7.0
fff7c9afcb6185ebe6152587453d9bd1eb04c458005d64c6afba25f32faa51f2,0.0,0.005477,0.017089,0.005578,0.005078,0.485051,0.044587,0.927955,0.005472,0.004829,...,1.0,1.0,0.0,5.0,0.0,2.0,2.0,4.0,2.0,6.0
fffce0cbbd3dc853a970145cc1ebd44c716af633214cd0b54d18b01fbd31b626,0.0,0.620016,0.524040,0.005258,0.005299,0.174931,0.005547,0.082335,0.005569,0.097790,...,4.0,1.0,0.0,0.0,0.0,2.0,1.0,3.0,2.0,5.0
fffec7d7e1ca804c86f1ffdaac389c33f8039ed35bf412b12d2e3548e49d54fa,0.0,0.005356,0.004414,0.004280,0.005317,0.000000,0.005501,1.004108,0.006280,0.005160,...,0.0,0.0,0.0,5.0,0.0,2.0,1.0,2.0,2.0,6.0


<IPython.core.display.Javascript object>

In [16]:
X_train, X_true, y_train, y_true = train_test_split(
    X, y, test_size=0.1, random_state=42
)
X_train.shape, X_true.shape, y_train.shape, y_true.shape

((26222, 188), (2914, 188), (26222, 1), (2914, 1))

<IPython.core.display.Javascript object>

In [17]:
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.1, random_state=42
)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((23599, 188), (2623, 188), (23599, 1), (2623, 1))

<IPython.core.display.Javascript object>

# Train

In [18]:
model = CatBoostClassifier()
model

<catboost.core.CatBoostClassifier at 0x7f46002dd3d0>

<IPython.core.display.Javascript object>

In [19]:
model.fit(Pool(X_train, y_train), eval_set=Pool(X_val, y_val), verbose=False, plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x7f46002dd3d0>

<IPython.core.display.Javascript object>

In [20]:
model.best_score_

{'learn': {'Logloss': 0.07585202295613266},
 'validation': {'Logloss': 0.23115103662981876}}

<IPython.core.display.Javascript object>

# Test

In [36]:
y_preds = model.predict(X_true)
y_preds

array([0, 0, 0, ..., 0, 1, 1])

<IPython.core.display.Javascript object>

In [25]:
accuracy_score(y_true, y_preds)

0.8901853122855182

<IPython.core.display.Javascript object>

In [22]:
# https://www.kaggle.com/code/inversion/amex-competition-metric-python


def amex_metric(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
    def top_four_percent_captured(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = pd.concat([y_true, y_pred], axis="columns").sort_values(
            "prediction", ascending=False
        )
        df["weight"] = df["target"].apply(lambda x: 20 if x == 0 else 1)
        four_pct_cutoff = int(0.04 * df["weight"].sum())
        df["weight_cumsum"] = df["weight"].cumsum()
        df_cutoff = df.loc[df["weight_cumsum"] <= four_pct_cutoff]
        return (df_cutoff["target"] == 1).sum() / (df["target"] == 1).sum()

    def weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        df = pd.concat([y_true, y_pred], axis="columns").sort_values(
            "prediction", ascending=False
        )
        df["weight"] = df["target"].apply(lambda x: 20 if x == 0 else 1)
        df["random"] = (df["weight"] / df["weight"].sum()).cumsum()
        total_pos = (df["target"] * df["weight"]).sum()
        df["cum_pos_found"] = (df["target"] * df["weight"]).cumsum()
        df["lorentz"] = df["cum_pos_found"] / total_pos
        df["gini"] = (df["lorentz"] - df["random"]) * df["weight"]
        return df["gini"].sum()

    def normalized_weighted_gini(y_true: pd.DataFrame, y_pred: pd.DataFrame) -> float:
        y_true_pred = y_true.rename(columns={"target": "prediction"})
        return weighted_gini(y_true, y_pred) / weighted_gini(y_true, y_true_pred)

    g = normalized_weighted_gini(y_true, y_pred)
    d = top_four_percent_captured(y_true, y_pred)

    return 0.5 * (g + d)

<IPython.core.display.Javascript object>

In [41]:
submission = pd.DataFrame(
    {
        "customer_ID": X_true.index,
        "prediction": model.predict_proba(X_true)[:, 1],
    }
).set_index("customer_ID")
submission

Unnamed: 0_level_0,prediction
customer_ID,Unnamed: 1_level_1
1d0e1bfb13c92134fb74d0dea267b8be76d39a255ccf37923e072ceba92921c4,0.011207
453d6a1185af510532421445ad192f150303f420ffc259f45f2efb41b370f123,0.027708
905a6ca1469d829e91217e0849f779478e150f7de427ba156d84c135978f6a7c,0.158952
3c30945bece00d124d711ced494cd21f204d40d7e48b72fdc5e4bc203cc2a054,0.462226
990ed4865b52d9febd9bc9a3d9847280c7902f416a1bc6f768fcfc68922be9cc,0.050254
...,...
7601f997a134164b2af9008b6b3c4b5c6df7698b31d1b45482dd48c5d9882769,0.002527
0d476719100794b3fb774411e28eca823d20f9bfc33666064e2eadf470c37967,0.113718
1e67f54a90ef89cdf63c32952884932b5a62b375c57bb58f8a32955fb096dfd2,0.005297
7fb7394e08909c0dc1780f5101268387682c6587fb57982f38755a0a817bca51,0.516394


<IPython.core.display.Javascript object>

In [42]:
amex_metric(y_true, submission)

0.729563465404534

<IPython.core.display.Javascript object>