# Santander Customer Transaction Prediction
Can you identify who will make a transaction?

![atms](https://storage.googleapis.com/kaggle-media/competitions/santander/atm_image.png)

At [Santander](https://www.santanderbank.com) our mission is to help people and businesses prosper. We are always looking for ways to help our customers understand their financial health and identify which products and services might help them achieve their monetary goals.

Our data science team is continually challenging our machine learning algorithms, working with the global data science community to make sure we can more accurately identify new ways to solve our most common challenge, binary classification problems such as: is a customer satisfied? Will a customer buy this product? Can a customer pay this loan?

In this challenge, we invite Kagglers to help us identify which customers will make a specific transaction in the future, irrespective of the amount of money transacted. The data provided for this competition has the same structure as the real data we have available to solve this problem.

Link: https://www.kaggle.com/competitions/santander-customer-transaction-prediction/overview

In [1]:
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [3]:
train_df = pd.read_csv(
    "../../data/santander-customer-transaction-prediction/train.csv"
).set_index("ID_code")
train_df

Unnamed: 0_level_0,target,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
ID_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
train_0,0,8.9255,-6.7863,11.9081,5.0930,11.4607,-9.2834,5.1187,18.6266,-4.9200,...,4.4354,3.9642,3.1364,1.6910,18.5227,-2.3978,7.8784,8.5635,12.7803,-1.0914
train_1,0,11.5006,-4.1473,13.8588,5.3890,12.3622,7.0433,5.6208,16.5338,3.1468,...,7.6421,7.7214,2.5837,10.9516,15.4305,2.0339,8.1267,8.7889,18.3560,1.9518
train_2,0,8.6093,-2.7457,12.0805,7.8928,10.5825,-9.0837,6.9427,14.6155,-4.9193,...,2.9057,9.7905,1.6704,1.6858,21.6042,3.1417,-6.5213,8.2675,14.7222,0.3965
train_3,0,11.0604,-2.1518,8.9522,7.1957,12.5846,-1.8361,5.8428,14.9250,-5.8609,...,4.4666,4.7433,0.7178,1.4214,23.0347,-1.2706,-2.9275,10.2922,17.9697,-8.9996
train_4,0,9.8369,-1.4834,12.8746,6.6375,12.2772,2.4486,5.9405,19.2514,6.2654,...,-1.4905,9.5214,-0.1508,9.1942,13.2876,-1.5121,3.9267,9.5031,17.9974,-8.8104
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
train_199995,0,11.4880,-0.4956,8.2622,3.5142,10.3404,11.6081,5.6709,15.1516,-0.6209,...,6.1415,13.2305,3.9901,0.9388,18.0249,-1.7939,2.1661,8.5326,16.6660,-17.8661
train_199996,0,4.9149,-2.4484,16.7052,6.6345,8.3096,-10.5628,5.8802,21.5940,-3.6797,...,4.9611,4.6549,0.6998,1.8341,22.2717,1.7337,-2.1651,6.7419,15.9054,0.3388
train_199997,0,11.2232,-5.0518,10.5127,5.6456,9.3410,-5.4086,4.5555,21.5571,0.1202,...,4.0651,5.4414,3.1032,4.8793,23.5311,-1.5736,1.2832,8.7155,13.8329,4.1995
train_199998,0,9.7148,-8.6098,13.6104,5.7930,12.5173,0.5339,6.0479,17.0152,-2.1926,...,2.6840,8.6587,2.7337,11.1178,20.4158,-0.0786,6.7980,10.0342,15.5289,-13.9001


<IPython.core.display.Javascript object>

In [4]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200000 entries, train_0 to train_199999
Columns: 201 entries, target to var_199
dtypes: float64(200), int64(1)
memory usage: 308.2+ MB


<IPython.core.display.Javascript object>

In [5]:
test_df = pd.read_csv(
    "../../data/santander-customer-transaction-prediction/test.csv"
).set_index("ID_code")
test_df

Unnamed: 0_level_0,var_0,var_1,var_2,var_3,var_4,var_5,var_6,var_7,var_8,var_9,...,var_190,var_191,var_192,var_193,var_194,var_195,var_196,var_197,var_198,var_199
ID_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
test_0,11.0656,7.7798,12.9536,9.4292,11.4327,-2.3805,5.8493,18.2675,2.1337,8.8100,...,-2.1556,11.8495,-1.4300,2.4508,13.7112,2.4669,4.3654,10.7200,15.4722,-8.7197
test_1,8.5304,1.2543,11.3047,5.1858,9.1974,-4.0117,6.0196,18.6316,-4.4131,5.9739,...,10.6165,8.8349,0.9403,10.1282,15.5765,0.4773,-1.4852,9.8714,19.1293,-20.9760
test_2,5.4827,-10.3581,10.1407,7.0479,10.2628,9.8052,4.8950,20.2537,1.5233,8.3442,...,-0.7484,10.9935,1.9803,2.1800,12.9813,2.1281,-7.1086,7.0618,19.8956,-23.1794
test_3,8.5374,-1.3222,12.0220,6.5749,8.8458,3.1744,4.9397,20.5660,3.3755,7.4578,...,9.5702,9.0766,1.6580,3.5813,15.1874,3.1656,3.9567,9.2295,13.0168,-4.2108
test_4,11.7058,-0.1327,14.1295,7.7506,9.1035,-8.5848,6.8595,10.6048,2.9890,7.1437,...,4.2259,9.1723,1.2835,3.3778,19.5542,-0.2860,-5.1612,7.2882,13.9260,-9.1846
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
test_199995,13.1678,1.0136,10.4333,6.7997,8.5974,-4.1641,4.8579,14.7625,-2.7239,6.9937,...,2.0544,9.6849,4.6734,-1.3660,12.8721,1.2013,-4.6195,9.1568,18.2102,4.8801
test_199996,9.7171,-9.1462,7.3443,9.1421,12.8936,3.0191,5.6888,18.8862,5.0915,6.3545,...,5.0071,6.6548,1.8197,2.4104,18.9037,-0.9337,2.9995,9.1112,18.1740,-20.7689
test_199997,11.6360,2.2769,11.2074,7.7649,12.6796,11.3224,5.3883,18.3794,1.6603,5.7341,...,5.1536,2.6498,2.4937,-0.0637,20.0609,-1.1742,-4.1524,9.1933,11.7905,-22.2762
test_199998,13.5745,-0.5134,13.6584,7.4855,11.2241,-11.3037,4.1959,16.8280,5.3208,8.9032,...,3.4259,8.5012,2.2713,5.7621,17.0056,1.1763,-2.3761,8.1079,8.7735,-0.2122


<IPython.core.display.Javascript object>

In [6]:
sample_submission_df = pd.read_csv(
    "../../data/santander-customer-transaction-prediction/sample_submission.csv"
).set_index("ID_code")
sample_submission_df

Unnamed: 0_level_0,target
ID_code,Unnamed: 1_level_1
test_0,0
test_1,0
test_2,0
test_3,0
test_4,0
...,...
test_199995,0
test_199996,0
test_199997,0
test_199998,0


<IPython.core.display.Javascript object>

# Data analysis

In [7]:
train_df["target"].value_counts(normalize=True)

0    0.89951
1    0.10049
Name: target, dtype: float64

<IPython.core.display.Javascript object>

In [8]:
(train_df.isna().sum() / len(train_df)).sort_values(ascending=False)

target     0.0
var_137    0.0
var_127    0.0
var_128    0.0
var_129    0.0
          ... 
var_69     0.0
var_70     0.0
var_71     0.0
var_72     0.0
var_199    0.0
Length: 201, dtype: float64

<IPython.core.display.Javascript object>

# Train

In [9]:
X = train_df.drop("target", axis=1)
y = train_df[["target"]]

X.shape, y.shape

((200000, 200), (200000, 1))

<IPython.core.display.Javascript object>

In [10]:
model = CatBoostClassifier(task_type="GPU", devices="0:1")

<IPython.core.display.Javascript object>

In [11]:
X_train, X_true, y_train, y_true = train_test_split(
    X, y, test_size=0.2, random_state=42
)
X_train.shape, X_true.shape, y_train.shape, y_true.shape

((160000, 200), (40000, 200), (160000, 1), (40000, 1))

<IPython.core.display.Javascript object>

In [12]:
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)
X_train.shape, X_val.shape, y_train.shape, y_val.shape

((128000, 200), (32000, 200), (128000, 1), (32000, 1))

<IPython.core.display.Javascript object>

In [13]:
model.fit(Pool(X_train, y_train), eval_set=Pool(X_val, y_val), verbose=False, plot=True)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x7f574e4da6a0>

<IPython.core.display.Javascript object>

In [14]:
model.get_best_score()

{'learn': {'Logloss': 0.18645701599121095},
 'validation': {'Logloss': 0.2106449279785156}}

<IPython.core.display.Javascript object>

In [15]:
y_preds = model.predict(X_true)
y_preds

array([0, 0, 0, ..., 0, 0, 0])

<IPython.core.display.Javascript object>

In [16]:
accuracy_score(y_true, y_preds)

0.9201

<IPython.core.display.Javascript object>

In [17]:
(y_true["target"] == y_preds).sum() / len(y_true)

0.9201

<IPython.core.display.Javascript object>

# Test

In [18]:
submission = pd.DataFrame(
    {"ID_code": test_df.index, "target": model.predict_proba(test_df)[:, 1]}
).set_index("ID_code")
submission

Unnamed: 0_level_0,target
ID_code,Unnamed: 1_level_1
test_0,0.079745
test_1,0.229619
test_2,0.178451
test_3,0.106693
test_4,0.048290
...,...
test_199995,0.061879
test_199996,0.009531
test_199997,0.007123
test_199998,0.087950


<IPython.core.display.Javascript object>

In [19]:
# 0.63096
submission.to_csv("../../data/santander-customer-transaction-prediction/submission.csv")

<IPython.core.display.Javascript object>