# Train More - XGB + NN - Achieve LB Boost!
Previously when we ensembled our 7 KFold XGB with our 7 KFold NN, we achieved LB 0.97730 (see [here][1]). 

We will now attempt to boost their CV and LB! Previously our XGB learning rate was 0.1 for experiments. We will now train it with 0.02 for our final submission. Previously our XGB had 7 folds, we will now train it with 10 folds.

Previously our NN had 7 folds, we will now train it with 10 folds. And we will train the same 10 folds 5x times. Each time we will use a different seed.

Afterward, we will ensemble all these new XGB and all these new NN. We will try to beat CV 0.97630 and LB 0.97730 by training more!

==============================

**NOTE** Version 1 and 2 of this notebook have a bug where the OOF and PREDS of the multiple NN are not being saved correctly. Inside each repeat for-loop, the variable OOF and PREDS was being reset to zero. 

[1]: https://www.kaggle.com/code/adilshamim8/bank-term-deposit-prediction
[2]: https://www.kaggle.com/code/cdeotte/xgboost-using-original-data-cv-0-976
[3]: https://www.kaggle.com/code/cdeotte/nn-by-gpt5-cv-0-974-wow

# Load Data
We load train, test, and original datasets. In every Kaggle playground competition, the data is synthetic and it is generated from an original dataset. In this competition, the original dataset is [here][1]

[1]: https://www.kaggle.com/datasets/sushant097/bank-marketing-dataset-full

In [1]:
import pandas as pd, numpy as np, os
import cudf

PATH = "/kaggle/input/playground-series-s5e8/"
train = cudf.read_csv(f"{PATH}train.csv").set_index('id')
print("Train shape", train.shape )
train.head()

Train shape (750000, 17)


Unnamed: 0_level_0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
0,42,technician,married,secondary,no,7,no,no,cellular,25,aug,117,3,-1,0,unknown,0
1,38,blue-collar,married,secondary,no,514,no,no,unknown,18,jun,185,1,-1,0,unknown,0
2,36,blue-collar,married,secondary,no,602,yes,no,unknown,14,may,111,2,-1,0,unknown,0
3,27,student,single,secondary,no,34,yes,no,unknown,28,may,10,2,-1,0,unknown,0
4,26,technician,married,secondary,no,889,yes,no,cellular,3,feb,902,1,-1,0,unknown,1


In [2]:
test = cudf.read_csv(f"{PATH}test.csv").set_index('id')
test['y'] = np.random.randint(0, 2, len(test))
print("Test shape", test.shape )
test.head()

Test shape (250000, 17)


Unnamed: 0_level_0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
750000,32,blue-collar,married,secondary,no,1397,yes,no,unknown,21,may,224,1,-1,0,unknown,0
750001,44,management,married,tertiary,no,23,yes,no,cellular,3,apr,586,2,-1,0,unknown,0
750002,36,self-employed,married,primary,no,46,yes,yes,cellular,13,may,111,2,-1,0,unknown,1
750003,58,blue-collar,married,secondary,no,-1380,yes,yes,unknown,29,may,125,1,-1,0,unknown,1
750004,28,technician,single,secondary,no,1950,yes,no,cellular,22,jul,181,1,-1,0,unknown,1


In [3]:
orig = cudf.read_csv("/kaggle/input/bank-marketing-dataset-full/bank-full.csv",delimiter=";")
orig['y'] = orig.y.map({'yes':1,'no':0})
orig['id'] = (np.arange(len(orig))+1e6).astype('int')
orig = orig.set_index('id')
print("Original data shape", orig.shape )
orig.head()

Original data shape (45211, 17)


Unnamed: 0_level_0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
1000000,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,0
1000001,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,0
1000002,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,0
1000003,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,0
1000004,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,0


# EDA (Exploratory Data Analysis)
We now combine all data together and then explore the columns and their properties. We observe that there is no missing data. And we observe that the categorical columns have low cardinality (i.e. under 12). We observe that most numerical columns have few unique values, and two numerical columns have around 2k and 8k unique values.

In [4]:
combine = cudf.concat([train,test,orig],axis=0)
print("Combined data shape", combine.shape )

Combined data shape (1045211, 17)


In [5]:
CATS = []
NUMS = []
for c in combine.columns[:-1]:
    t = "CAT"
    if combine[c].dtype=='object':
        CATS.append(c)
    else:
        NUMS.append(c)
        t = "NUM"
    n = combine[c].nunique()
    na = combine[c].isna().sum()
    print(f"[{t}] {c} has {n} unique and {na} NA")
print("CATS:", CATS )
print("NUMS:", NUMS )

[NUM] age has 78 unique and 0 NA
[CAT] job has 12 unique and 0 NA
[CAT] marital has 3 unique and 0 NA
[CAT] education has 4 unique and 0 NA
[CAT] default has 2 unique and 0 NA
[NUM] balance has 8590 unique and 0 NA
[CAT] housing has 2 unique and 0 NA
[CAT] loan has 2 unique and 0 NA
[CAT] contact has 3 unique and 0 NA
[NUM] day has 31 unique and 0 NA
[CAT] month has 12 unique and 0 NA
[NUM] duration has 1824 unique and 0 NA
[NUM] campaign has 52 unique and 0 NA
[NUM] pdays has 628 unique and 0 NA
[NUM] previous has 54 unique and 0 NA
[CAT] poutcome has 4 unique and 0 NA
CATS: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']
NUMS: ['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']


# Feature Engineer (LE - Label Encode)
We will label encode all categorical columns. Also we will make a duplicate of each numerical column and treat the copy as a categorical column.

In [6]:
CATS1 = []
SIZES = {}
for c in NUMS + CATS:
    n = c
    if c in NUMS: 
        n = f"{c}2"
        CATS1.append(n)
    combine[n],_ = combine[c].factorize()
    SIZES[n] = combine[n].max()+1

    combine[c] = combine[c].astype('int32')
    combine[n] = combine[n].astype('int32')

print("New CATS:", CATS1 )
print("Cardinality of all CATS:", SIZES )

New CATS: ['age2', 'balance2', 'day2', 'duration2', 'campaign2', 'pdays2', 'previous2']
Cardinality of all CATS: {'age2': 78, 'balance2': 8590, 'day2': 31, 'duration2': 1824, 'campaign2': 52, 'pdays2': 628, 'previous2': 54, 'job': 12, 'marital': 3, 'education': 4, 'default': 2, 'housing': 2, 'loan': 2, 'contact': 3, 'month': 12, 'poutcome': 4}


# Feature Engineer (Combine Column Pairs)
We will create a new categorical column from every pair of existing categorical columns. The original categorical columns have been label encoded into integers from `0 to N-1` each. Therefore we can create a new column with unique integers using the formula `new_cols[name] = combine[c1] * SIZES[c2] + combine[c2]`.

In [7]:
from itertools import combinations

pairs = combinations(CATS + CATS1, 2)
new_cols = {}
CATS2 = []

for c1, c2 in pairs:
    name = "_".join(sorted((c1, c2)))
    new_cols[name] = combine[c1] * SIZES[c2] + combine[c2]
    CATS2.append(name)
if new_cols:
    new_df = cudf.DataFrame(new_cols)         
    combine = cudf.concat([combine, new_df], axis=1) 

print(f"Created {len(CATS2)} new CAT columns")

Created 120 new CAT columns


# Feature Engineer (CE - Count Encoding)
We now have 136 categorical columns. We will count encode each of them and create 136 new columns.

In [8]:
CE = []
CC = CATS+CATS1+CATS2
combine['i'] = np.arange( len(combine) )

print(f"Processing {len(CC)} columns... ",end="")
for i,c in enumerate(CC):
    if i%10==0: print(f"{i}, ",end="")
    tmp = combine.groupby(c).y.count()
    tmp = tmp.astype('int32')
    tmp.name = f"CE_{c}"
    CE.append( f"CE_{c}" )
    combine = combine.merge(tmp, on=c, how='left')
combine = combine.sort_values('i')
print()

Processing 136 columns... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 


In [9]:
train = combine.iloc[:len(train)]
test = combine.iloc[len(train):len(train)+len(test)]
orig = combine.iloc[-len(orig):]
del combine
print("Train shape", train.shape,"Test shape", test.shape,"Original shape", orig.shape )

Train shape (750000, 281) Test shape (250000, 281) Original shape (45211, 281)


# Feature Engineering (TE - Original Data as Cols)
Below is a technique to add the original data as new columns.

In [10]:
TE = []
CC = CATS+CATS1+CATS2

#mn = orig.y.mean() # WE FILL NAN AFTER XGB
print(f"Processing {len(CC)} columns... ",end="")
for i,c in enumerate(CC):
    if i%10==0: print(f"{i}, ",end="")
    tmp = orig.groupby(c).y.mean()
    tmp = tmp.astype('float32')
    NAME = f"TE_ORIG_{c}"
    tmp.name = NAME
    TE.append( NAME )
    train = train.merge(tmp, on=c, how='left')
    #train[NAME] = train[NAME].fillna(mn) # WE FILL NAN AFTER XGB
    test = test.merge(tmp, on=c, how='left')
    #test[NAME] = test[NAME].fillna(mn) # WE FILL NAN AFTER XGB
train = train.sort_values('i')
test = test.sort_values('i')
print()

Processing 136 columns... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 


# Train More - Train XGB w/ Original Data as Rows
Previously our XGB learning rate was 0.1 for experiments. We will now train it with 0.02 for our final submission. Previously our XGB had 7 folds, we will now train it with 10 folds.

In [11]:
from cuml.preprocessing import TargetEncoder
from sklearn.model_selection import KFold
import xgboost as xgb

print(f"XGBoost version {xgb.__version__}")

FEATURES = NUMS+CATS+CATS1+CATS2+CE
print(f"We have {len(FEATURES)} features.")

FOLDS = 10
SEED = 42

params = {
    "objective": "binary:logistic",  
    "eval_metric": "auc",           
    "learning_rate": 0.02,
    "max_depth": 0,
    "subsample": 0.8,
    "colsample_bytree": 0.7,
    "seed": SEED,
    "device": "cuda",
    "grow_policy": "lossguide", 
    "max_leaves": 32,          
    "alpha": 2.0,
}

class IterLoadForDMatrix(xgb.core.DataIter):
    def __init__(self, df=None, features=None, target=None, batch_size=256*1024):
        self.features = features
        self.target = target
        self.df = df
        self.it = 0 
        self.batch_size = batch_size
        self.batches = int( np.ceil( len(df) / self.batch_size ) )
        super().__init__()

    def reset(self):
        '''Reset the iterator'''
        self.it = 0

    def next(self, input_data):
        '''Yield next batch of data.'''
        if self.it == self.batches:
            return 0 # Return 0 when there's no more batch.
        
        a = self.it * self.batch_size
        b = min( (self.it + 1) * self.batch_size, len(self.df) )
        #dt = cudf.DataFrame(self.df.iloc[a:b])
        dt = self.df.iloc[a:b]
        input_data(data=dt[self.features], label=dt[self.target]) 
        self.it += 1
        return 1

oof_preds = np.zeros(len(train))
test_preds = np.zeros(len(test))

REPEATS = 2
for kk in range(REPEATS):
    kf = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
    for fold, (train_idx, val_idx) in enumerate(kf.split(train)):
        print("#"*25)
        print(f"### REPEAT {kk+1}, Fold {fold+1} ###")
        print("#"*25)
    
        #Xy_train = train.iloc[train_idx][ FEATURES+['y'] ].copy()
        Xy_train = train[ FEATURES+['y'] ].copy()
        Xy_more = orig[ FEATURES+['y'] ]
        for k in range(1):
            Xy_train = cudf.concat([Xy_train,Xy_more],axis=0,ignore_index=True)
        
        X_valid = train.iloc[val_idx][FEATURES].copy()
        y_valid = train.iloc[val_idx]['y']
        X_test = test[FEATURES].copy()
    
        CC = CATS1+CATS2
        print(f"Target encoding {len(CC)} features... ",end="")
        for i,c in enumerate(CC):
            if i%10==0: print(f"{i}, ",end="")
            TE0 = TargetEncoder(n_folds=10, smooth=0, split_method='random', stat='mean')
            Xy_train[c] = TE0.fit_transform(Xy_train[c],Xy_train['y']).astype('float32')
            X_valid[c] = TE0.transform(X_valid[c]).astype('float32')
            X_test[c] = TE0.transform(X_test[c]).astype('float32')
        print()
    
        Xy_train[CATS] = Xy_train[CATS].astype('category')
        X_valid[CATS] = X_valid[CATS].astype('category')
        X_test[CATS] = X_test[CATS].astype('category')
    
        Xy_train = IterLoadForDMatrix(Xy_train, FEATURES, 'y')
        dtrain = xgb.QuantileDMatrix(Xy_train, enable_categorical=True, max_bin=256)
        dval   = xgb.DMatrix(X_valid, label=y_valid, enable_categorical=True)
        dtest  = xgb.DMatrix(X_test, enable_categorical=True)
    
        params['seed'] = kk*FOLDS + fold
        model = xgb.train(
            params=params,
            dtrain=dtrain,
            num_boost_round=4_800, 
            evals=[(dtrain, "train"), (dval, "valid")],
            #early_stopping_rounds=250,
            verbose_eval=500
        )
    
        oof_preds[val_idx] += model.predict(dval) / REPEATS
        test_preds += model.predict(dtest) / FOLDS / REPEATS

XGBoost version 2.0.3
We have 279 features.
#########################
### REPEAT 1, Fold 1 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94654	valid-auc:0.95365
[500]	train-auc:0.97530	valid-auc:0.98574
[1000]	train-auc:0.97755	valid-auc:0.98722
[1500]	train-auc:0.97923	valid-auc:0.98812
[2000]	train-auc:0.98070	valid-auc:0.98883
[2500]	train-auc:0.98203	valid-auc:0.98935
[3000]	train-auc:0.98324	valid-auc:0.98976
[3500]	train-auc:0.98437	valid-auc:0.99021
[4000]	train-auc:0.98542	valid-auc:0.99064
[4500]	train-auc:0.98640	valid-auc:0.99099
[4799]	train-auc:0.98697	valid-auc:0.99123
#########################
### REPEAT 1, Fold 2 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94871	valid-auc:0.95648
[500]	train-auc:0.97528	valid-auc:0.98598
[1000]	train-auc:0.97752	valid-auc:0.98745
[1500]	train-auc:0.97919	valid-auc:0.98823
[2000]	train-auc:0.98065	valid-auc:0.98890
[2500]	train-auc:0.98198	valid-auc:0.98943
[3000]	train-auc:0.98321	valid-auc:0.98993
[3500]	train-auc:0.98435	valid-auc:0.99037
[4000]	train-auc:0.98542	valid-auc:0.99080
[4500]	train-auc:0.98640	valid-auc:0.99117
[4799]	train-auc:0.98697	valid-auc:0.99139
#########################
### REPEAT 1, Fold 3 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94824	valid-auc:0.95607
[500]	train-auc:0.97525	valid-auc:0.98582
[1000]	train-auc:0.97752	valid-auc:0.98735
[1500]	train-auc:0.97921	valid-auc:0.98816
[2000]	train-auc:0.98067	valid-auc:0.98879
[2500]	train-auc:0.98198	valid-auc:0.98936
[3000]	train-auc:0.98319	valid-auc:0.98984
[3500]	train-auc:0.98432	valid-auc:0.99030
[4000]	train-auc:0.98540	valid-auc:0.99070
[4500]	train-auc:0.98639	valid-auc:0.99106
[4799]	train-auc:0.98695	valid-auc:0.99127
#########################
### REPEAT 1, Fold 4 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95528	valid-auc:0.96074
[500]	train-auc:0.97527	valid-auc:0.98532
[1000]	train-auc:0.97754	valid-auc:0.98689
[1500]	train-auc:0.97921	valid-auc:0.98768
[2000]	train-auc:0.98066	valid-auc:0.98837
[2500]	train-auc:0.98200	valid-auc:0.98893
[3000]	train-auc:0.98322	valid-auc:0.98940
[3500]	train-auc:0.98435	valid-auc:0.98986
[4000]	train-auc:0.98539	valid-auc:0.99026
[4500]	train-auc:0.98638	valid-auc:0.99067
[4799]	train-auc:0.98693	valid-auc:0.99089
#########################
### REPEAT 1, Fold 5 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94751	valid-auc:0.95675
[500]	train-auc:0.97528	valid-auc:0.98551
[1000]	train-auc:0.97753	valid-auc:0.98711
[1500]	train-auc:0.97921	valid-auc:0.98797
[2000]	train-auc:0.98068	valid-auc:0.98859
[2500]	train-auc:0.98200	valid-auc:0.98915
[3000]	train-auc:0.98321	valid-auc:0.98962
[3500]	train-auc:0.98435	valid-auc:0.99010
[4000]	train-auc:0.98540	valid-auc:0.99048
[4500]	train-auc:0.98638	valid-auc:0.99084
[4799]	train-auc:0.98694	valid-auc:0.99105
#########################
### REPEAT 1, Fold 6 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95079	valid-auc:0.96164
[500]	train-auc:0.97523	valid-auc:0.98601
[1000]	train-auc:0.97751	valid-auc:0.98752
[1500]	train-auc:0.97920	valid-auc:0.98840
[2000]	train-auc:0.98065	valid-auc:0.98901
[2500]	train-auc:0.98198	valid-auc:0.98956
[3000]	train-auc:0.98320	valid-auc:0.99004
[3500]	train-auc:0.98432	valid-auc:0.99042
[4000]	train-auc:0.98537	valid-auc:0.99080
[4500]	train-auc:0.98636	valid-auc:0.99122
[4799]	train-auc:0.98693	valid-auc:0.99141
#########################
### REPEAT 1, Fold 7 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94771	valid-auc:0.95913
[500]	train-auc:0.97530	valid-auc:0.98654
[1000]	train-auc:0.97756	valid-auc:0.98790
[1500]	train-auc:0.97923	valid-auc:0.98863
[2000]	train-auc:0.98068	valid-auc:0.98922
[2500]	train-auc:0.98201	valid-auc:0.98973
[3000]	train-auc:0.98323	valid-auc:0.99024
[3500]	train-auc:0.98438	valid-auc:0.99067
[4000]	train-auc:0.98543	valid-auc:0.99108
[4500]	train-auc:0.98643	valid-auc:0.99145
[4799]	train-auc:0.98699	valid-auc:0.99166
#########################
### REPEAT 1, Fold 8 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94730	valid-auc:0.95533
[500]	train-auc:0.97526	valid-auc:0.98572
[1000]	train-auc:0.97750	valid-auc:0.98717
[1500]	train-auc:0.97918	valid-auc:0.98800
[2000]	train-auc:0.98064	valid-auc:0.98861
[2500]	train-auc:0.98197	valid-auc:0.98914
[3000]	train-auc:0.98318	valid-auc:0.98964
[3500]	train-auc:0.98431	valid-auc:0.99006
[4000]	train-auc:0.98537	valid-auc:0.99050
[4500]	train-auc:0.98638	valid-auc:0.99089
[4799]	train-auc:0.98694	valid-auc:0.99110
#########################
### REPEAT 1, Fold 9 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95304	valid-auc:0.95918
[500]	train-auc:0.97528	valid-auc:0.98627
[1000]	train-auc:0.97754	valid-auc:0.98772
[1500]	train-auc:0.97924	valid-auc:0.98856
[2000]	train-auc:0.98069	valid-auc:0.98915
[2500]	train-auc:0.98202	valid-auc:0.98968
[3000]	train-auc:0.98323	valid-auc:0.99013
[3500]	train-auc:0.98436	valid-auc:0.99054
[4000]	train-auc:0.98543	valid-auc:0.99093
[4500]	train-auc:0.98640	valid-auc:0.99129
[4799]	train-auc:0.98697	valid-auc:0.99151
#########################
### REPEAT 1, Fold 10 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94817	valid-auc:0.95843
[500]	train-auc:0.97525	valid-auc:0.98694
[1000]	train-auc:0.97753	valid-auc:0.98836
[1500]	train-auc:0.97919	valid-auc:0.98915
[2000]	train-auc:0.98066	valid-auc:0.98976
[2500]	train-auc:0.98197	valid-auc:0.99024
[3000]	train-auc:0.98321	valid-auc:0.99072
[3500]	train-auc:0.98435	valid-auc:0.99114
[4000]	train-auc:0.98541	valid-auc:0.99151
[4500]	train-auc:0.98640	valid-auc:0.99184
[4799]	train-auc:0.98696	valid-auc:0.99208
#########################
### REPEAT 2, Fold 1 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94769	valid-auc:0.95457
[500]	train-auc:0.97527	valid-auc:0.98582
[1000]	train-auc:0.97756	valid-auc:0.98733
[1500]	train-auc:0.97921	valid-auc:0.98815
[2000]	train-auc:0.98067	valid-auc:0.98876
[2500]	train-auc:0.98201	valid-auc:0.98929
[3000]	train-auc:0.98322	valid-auc:0.98979
[3500]	train-auc:0.98436	valid-auc:0.99024
[4000]	train-auc:0.98541	valid-auc:0.99067
[4500]	train-auc:0.98640	valid-auc:0.99103
[4799]	train-auc:0.98697	valid-auc:0.99128
#########################
### REPEAT 2, Fold 2 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95526	valid-auc:0.96194
[500]	train-auc:0.97527	valid-auc:0.98599
[1000]	train-auc:0.97753	valid-auc:0.98744
[1500]	train-auc:0.97922	valid-auc:0.98828
[2000]	train-auc:0.98068	valid-auc:0.98891
[2500]	train-auc:0.98200	valid-auc:0.98945
[3000]	train-auc:0.98322	valid-auc:0.98991
[3500]	train-auc:0.98435	valid-auc:0.99034
[4000]	train-auc:0.98539	valid-auc:0.99075
[4500]	train-auc:0.98639	valid-auc:0.99114
[4799]	train-auc:0.98694	valid-auc:0.99137
#########################
### REPEAT 2, Fold 3 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94840	valid-auc:0.95749
[500]	train-auc:0.97526	valid-auc:0.98588
[1000]	train-auc:0.97752	valid-auc:0.98739
[1500]	train-auc:0.97920	valid-auc:0.98821
[2000]	train-auc:0.98066	valid-auc:0.98881
[2500]	train-auc:0.98198	valid-auc:0.98936
[3000]	train-auc:0.98320	valid-auc:0.98981
[3500]	train-auc:0.98433	valid-auc:0.99024
[4000]	train-auc:0.98540	valid-auc:0.99066
[4500]	train-auc:0.98638	valid-auc:0.99102
[4799]	train-auc:0.98694	valid-auc:0.99123
#########################
### REPEAT 2, Fold 4 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95077	valid-auc:0.96058
[500]	train-auc:0.97527	valid-auc:0.98522
[1000]	train-auc:0.97752	valid-auc:0.98673
[1500]	train-auc:0.97920	valid-auc:0.98760
[2000]	train-auc:0.98067	valid-auc:0.98825
[2500]	train-auc:0.98199	valid-auc:0.98884
[3000]	train-auc:0.98320	valid-auc:0.98934
[3500]	train-auc:0.98432	valid-auc:0.98977
[4000]	train-auc:0.98538	valid-auc:0.99018
[4500]	train-auc:0.98637	valid-auc:0.99058
[4799]	train-auc:0.98693	valid-auc:0.99081
#########################
### REPEAT 2, Fold 5 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94535	valid-auc:0.95573
[500]	train-auc:0.97521	valid-auc:0.98544
[1000]	train-auc:0.97750	valid-auc:0.98692
[1500]	train-auc:0.97918	valid-auc:0.98779
[2000]	train-auc:0.98064	valid-auc:0.98843
[2500]	train-auc:0.98195	valid-auc:0.98898
[3000]	train-auc:0.98317	valid-auc:0.98943
[3500]	train-auc:0.98431	valid-auc:0.98989
[4000]	train-auc:0.98538	valid-auc:0.99031
[4500]	train-auc:0.98637	valid-auc:0.99070
[4799]	train-auc:0.98693	valid-auc:0.99092
#########################
### REPEAT 2, Fold 6 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95474	valid-auc:0.96226
[500]	train-auc:0.97526	valid-auc:0.98601
[1000]	train-auc:0.97754	valid-auc:0.98752
[1500]	train-auc:0.97921	valid-auc:0.98836
[2000]	train-auc:0.98068	valid-auc:0.98898
[2500]	train-auc:0.98200	valid-auc:0.98951
[3000]	train-auc:0.98321	valid-auc:0.98997
[3500]	train-auc:0.98434	valid-auc:0.99043
[4000]	train-auc:0.98540	valid-auc:0.99079
[4500]	train-auc:0.98639	valid-auc:0.99116
[4799]	train-auc:0.98695	valid-auc:0.99135
#########################
### REPEAT 2, Fold 7 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94801	valid-auc:0.95952
[500]	train-auc:0.97522	valid-auc:0.98656
[1000]	train-auc:0.97753	valid-auc:0.98788
[1500]	train-auc:0.97920	valid-auc:0.98866
[2000]	train-auc:0.98066	valid-auc:0.98927
[2500]	train-auc:0.98199	valid-auc:0.98979
[3000]	train-auc:0.98320	valid-auc:0.99024
[3500]	train-auc:0.98434	valid-auc:0.99065
[4000]	train-auc:0.98539	valid-auc:0.99104
[4500]	train-auc:0.98640	valid-auc:0.99140
[4799]	train-auc:0.98695	valid-auc:0.99161
#########################
### REPEAT 2, Fold 8 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95261	valid-auc:0.95888
[500]	train-auc:0.97526	valid-auc:0.98574
[1000]	train-auc:0.97752	valid-auc:0.98717
[1500]	train-auc:0.97920	valid-auc:0.98802
[2000]	train-auc:0.98066	valid-auc:0.98861
[2500]	train-auc:0.98198	valid-auc:0.98914
[3000]	train-auc:0.98319	valid-auc:0.98961
[3500]	train-auc:0.98432	valid-auc:0.99006
[4000]	train-auc:0.98539	valid-auc:0.99043
[4500]	train-auc:0.98639	valid-auc:0.99086
[4799]	train-auc:0.98695	valid-auc:0.99107
#########################
### REPEAT 2, Fold 9 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95195	valid-auc:0.95798
[500]	train-auc:0.97527	valid-auc:0.98615
[1000]	train-auc:0.97752	valid-auc:0.98766
[1500]	train-auc:0.97919	valid-auc:0.98852
[2000]	train-auc:0.98066	valid-auc:0.98917
[2500]	train-auc:0.98198	valid-auc:0.98968
[3000]	train-auc:0.98321	valid-auc:0.99011
[3500]	train-auc:0.98433	valid-auc:0.99055
[4000]	train-auc:0.98538	valid-auc:0.99091
[4500]	train-auc:0.98638	valid-auc:0.99128
[4799]	train-auc:0.98693	valid-auc:0.99150
#########################
### REPEAT 2, Fold 10 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94775	valid-auc:0.95848
[500]	train-auc:0.97524	valid-auc:0.98696
[1000]	train-auc:0.97751	valid-auc:0.98839
[1500]	train-auc:0.97920	valid-auc:0.98920
[2000]	train-auc:0.98064	valid-auc:0.98976
[2500]	train-auc:0.98197	valid-auc:0.99026
[3000]	train-auc:0.98318	valid-auc:0.99070
[3500]	train-auc:0.98432	valid-auc:0.99108
[4000]	train-auc:0.98537	valid-auc:0.99142
[4500]	train-auc:0.98636	valid-auc:0.99177
[4799]	train-auc:0.98693	valid-auc:0.99198


In [12]:
from sklearn.metrics import roc_auc_score

m = roc_auc_score(train.y.to_numpy(), oof_preds)
print(f"XGB (Train More) with Original Data as rows CV = {m}")

XGB (Train More) with Original Data as rows CV = 0.9913863313933691


# Train More - Train XGB w/ Original Data as Cols
Previously our XGB learning rate was 0.1 for experiments. We will now train it with 0.02 for our final submission. Previously our XGB had 7 folds, we will now train it with 10 folds.

In [13]:
FEATURES += TE
print(f"We have {len(FEATURES)} features.")

oof_preds2 = np.zeros(len(train))
test_preds2 = np.zeros(len(test))

REPEATS = 2
for kk in range(REPEATS):
    kf = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
    for fold, (train_idx, val_idx) in enumerate(kf.split(train)):
        print("#"*25)
        print(f"### REPEAT {kk+1}, Fold {fold+1} ###")
        print("#"*25)
    
        #Xy_train = train.iloc[train_idx][ FEATURES+['y'] ].copy()  
        Xy_train = train[ FEATURES+['y'] ].copy()
        X_valid = train.iloc[val_idx][FEATURES].copy()
        y_valid = train.iloc[val_idx]['y']
        X_test = test[FEATURES].copy()
    
        CC = CATS1+CATS2
        print(f"Target encoding {len(CC)} features... ",end="")
        for i,c in enumerate(CC):
            if i%10==0: print(f"{i}, ",end="")
            TE0 = TargetEncoder(n_folds=10, smooth=0, split_method='random', stat='mean')
            Xy_train[c] = TE0.fit_transform(Xy_train[c],Xy_train['y']).astype('float32')
            X_valid[c] = TE0.transform(X_valid[c]).astype('float32')
            X_test[c] = TE0.transform(X_test[c]).astype('float32')
        print()
    
        Xy_train[CATS] = Xy_train[CATS].astype('category')
        X_valid[CATS] = X_valid[CATS].astype('category')
        X_test[CATS] = X_test[CATS].astype('category')
    
        Xy_train = IterLoadForDMatrix(Xy_train, FEATURES, 'y')
        dtrain = xgb.QuantileDMatrix(Xy_train, enable_categorical=True, max_bin=256)
        dval   = xgb.DMatrix(X_valid, label=y_valid, enable_categorical=True)
        dtest  = xgb.DMatrix(X_test, enable_categorical=True)
    
        params['seed'] = kk*FOLDS + fold 
        model = xgb.train(
            params=params,
            dtrain=dtrain,
            num_boost_round=4_200, 
            evals=[(dtrain, "train"), (dval, "valid")],
            #early_stopping_rounds=250,
            verbose_eval=500
        )
    
        oof_preds2[val_idx] += model.predict(dval) / REPEATS
        test_preds2 += model.predict(dtest) / FOLDS / REPEATS

# FILL NAN FOR NN BELOW (we skipped this above for xgb)
CC = CATS+CATS1+CATS2
mn = orig.y.mean()
for i,c in enumerate(CC):
    NAME = f"TE_ORIG_{c}"
    train[NAME] = train[NAME].fillna(mn)
    test[NAME] = test[NAME].fillna(mn)

We have 415 features.
#########################
### REPEAT 1, Fold 1 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95284	valid-auc:0.95584
[500]	train-auc:0.97691	valid-auc:0.98416
[1000]	train-auc:0.97925	valid-auc:0.98590
[1500]	train-auc:0.98100	valid-auc:0.98700
[2000]	train-auc:0.98252	valid-auc:0.98792
[2500]	train-auc:0.98389	valid-auc:0.98867
[3000]	train-auc:0.98516	valid-auc:0.98931
[3500]	train-auc:0.98632	valid-auc:0.98991
[4000]	train-auc:0.98741	valid-auc:0.99044
[4199]	train-auc:0.98781	valid-auc:0.99064
#########################
### REPEAT 1, Fold 2 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95063	valid-auc:0.95748
[500]	train-auc:0.97689	valid-auc:0.98430
[1000]	train-auc:0.97927	valid-auc:0.98610
[1500]	train-auc:0.98098	valid-auc:0.98716
[2000]	train-auc:0.98249	valid-auc:0.98805
[2500]	train-auc:0.98387	valid-auc:0.98887
[3000]	train-auc:0.98513	valid-auc:0.98950
[3500]	train-auc:0.98631	valid-auc:0.99009
[4000]	train-auc:0.98739	valid-auc:0.99065
[4199]	train-auc:0.98781	valid-auc:0.99084
#########################
### REPEAT 1, Fold 3 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95073	valid-auc:0.95598
[500]	train-auc:0.97688	valid-auc:0.98423
[1000]	train-auc:0.97924	valid-auc:0.98603
[1500]	train-auc:0.98098	valid-auc:0.98710
[2000]	train-auc:0.98248	valid-auc:0.98795
[2500]	train-auc:0.98384	valid-auc:0.98869
[3000]	train-auc:0.98511	valid-auc:0.98933
[3500]	train-auc:0.98627	valid-auc:0.98995
[4000]	train-auc:0.98736	valid-auc:0.99050
[4199]	train-auc:0.98777	valid-auc:0.99070
#########################
### REPEAT 1, Fold 4 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95274	valid-auc:0.95692
[500]	train-auc:0.97686	valid-auc:0.98345
[1000]	train-auc:0.97923	valid-auc:0.98523
[1500]	train-auc:0.98096	valid-auc:0.98637
[2000]	train-auc:0.98248	valid-auc:0.98727
[2500]	train-auc:0.98385	valid-auc:0.98804
[3000]	train-auc:0.98512	valid-auc:0.98871
[3500]	train-auc:0.98628	valid-auc:0.98934
[4000]	train-auc:0.98734	valid-auc:0.98989
[4199]	train-auc:0.98776	valid-auc:0.99009
#########################
### REPEAT 1, Fold 5 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95018	valid-auc:0.95719
[500]	train-auc:0.97684	valid-auc:0.98384
[1000]	train-auc:0.97921	valid-auc:0.98563
[1500]	train-auc:0.98094	valid-auc:0.98672
[2000]	train-auc:0.98246	valid-auc:0.98763
[2500]	train-auc:0.98387	valid-auc:0.98838
[3000]	train-auc:0.98515	valid-auc:0.98906
[3500]	train-auc:0.98632	valid-auc:0.98964
[4000]	train-auc:0.98741	valid-auc:0.99020
[4199]	train-auc:0.98782	valid-auc:0.99042
#########################
### REPEAT 1, Fold 6 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95195	valid-auc:0.95628
[500]	train-auc:0.97686	valid-auc:0.98431
[1000]	train-auc:0.97926	valid-auc:0.98610
[1500]	train-auc:0.98100	valid-auc:0.98718
[2000]	train-auc:0.98252	valid-auc:0.98804
[2500]	train-auc:0.98389	valid-auc:0.98878
[3000]	train-auc:0.98516	valid-auc:0.98943
[3500]	train-auc:0.98633	valid-auc:0.99001
[4000]	train-auc:0.98741	valid-auc:0.99058
[4199]	train-auc:0.98782	valid-auc:0.99079
#########################
### REPEAT 1, Fold 7 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95337	valid-auc:0.95881
[500]	train-auc:0.97688	valid-auc:0.98500
[1000]	train-auc:0.97925	valid-auc:0.98659
[1500]	train-auc:0.98098	valid-auc:0.98756
[2000]	train-auc:0.98250	valid-auc:0.98840
[2500]	train-auc:0.98387	valid-auc:0.98910
[3000]	train-auc:0.98514	valid-auc:0.98974
[3500]	train-auc:0.98632	valid-auc:0.99034
[4000]	train-auc:0.98740	valid-auc:0.99086
[4199]	train-auc:0.98782	valid-auc:0.99105
#########################
### REPEAT 1, Fold 8 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95634	valid-auc:0.96086
[500]	train-auc:0.97687	valid-auc:0.98421
[1000]	train-auc:0.97922	valid-auc:0.98596
[1500]	train-auc:0.98096	valid-auc:0.98700
[2000]	train-auc:0.98249	valid-auc:0.98785
[2500]	train-auc:0.98389	valid-auc:0.98857
[3000]	train-auc:0.98514	valid-auc:0.98923
[3500]	train-auc:0.98632	valid-auc:0.98981
[4000]	train-auc:0.98739	valid-auc:0.99034
[4199]	train-auc:0.98780	valid-auc:0.99056
#########################
### REPEAT 1, Fold 9 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95472	valid-auc:0.95837
[500]	train-auc:0.97686	valid-auc:0.98454
[1000]	train-auc:0.97922	valid-auc:0.98628
[1500]	train-auc:0.98096	valid-auc:0.98741
[2000]	train-auc:0.98248	valid-auc:0.98826
[2500]	train-auc:0.98385	valid-auc:0.98895
[3000]	train-auc:0.98509	valid-auc:0.98960
[3500]	train-auc:0.98628	valid-auc:0.99018
[4000]	train-auc:0.98738	valid-auc:0.99071
[4199]	train-auc:0.98779	valid-auc:0.99091
#########################
### REPEAT 1, Fold 10 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95360	valid-auc:0.95897
[500]	train-auc:0.97684	valid-auc:0.98537
[1000]	train-auc:0.97921	valid-auc:0.98700
[1500]	train-auc:0.98096	valid-auc:0.98806
[2000]	train-auc:0.98248	valid-auc:0.98887
[2500]	train-auc:0.98384	valid-auc:0.98960
[3000]	train-auc:0.98513	valid-auc:0.99024
[3500]	train-auc:0.98629	valid-auc:0.99082
[4000]	train-auc:0.98738	valid-auc:0.99133
[4199]	train-auc:0.98779	valid-auc:0.99153
#########################
### REPEAT 2, Fold 1 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95072	valid-auc:0.95549
[500]	train-auc:0.97689	valid-auc:0.98410
[1000]	train-auc:0.97925	valid-auc:0.98589
[1500]	train-auc:0.98097	valid-auc:0.98700
[2000]	train-auc:0.98248	valid-auc:0.98784
[2500]	train-auc:0.98387	valid-auc:0.98866
[3000]	train-auc:0.98513	valid-auc:0.98928
[3500]	train-auc:0.98630	valid-auc:0.98990
[4000]	train-auc:0.98739	valid-auc:0.99046
[4199]	train-auc:0.98780	valid-auc:0.99068
#########################
### REPEAT 2, Fold 2 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95279	valid-auc:0.95619
[500]	train-auc:0.97686	valid-auc:0.98431
[1000]	train-auc:0.97924	valid-auc:0.98611
[1500]	train-auc:0.98100	valid-auc:0.98725
[2000]	train-auc:0.98252	valid-auc:0.98810
[2500]	train-auc:0.98389	valid-auc:0.98884
[3000]	train-auc:0.98515	valid-auc:0.98949
[3500]	train-auc:0.98632	valid-auc:0.99008
[4000]	train-auc:0.98739	valid-auc:0.99062
[4199]	train-auc:0.98781	valid-auc:0.99081
#########################
### REPEAT 2, Fold 3 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95213	valid-auc:0.95719
[500]	train-auc:0.97689	valid-auc:0.98416
[1000]	train-auc:0.97922	valid-auc:0.98600
[1500]	train-auc:0.98097	valid-auc:0.98705
[2000]	train-auc:0.98248	valid-auc:0.98790
[2500]	train-auc:0.98386	valid-auc:0.98865
[3000]	train-auc:0.98513	valid-auc:0.98930
[3500]	train-auc:0.98630	valid-auc:0.98986
[4000]	train-auc:0.98739	valid-auc:0.99045
[4199]	train-auc:0.98781	valid-auc:0.99065
#########################
### REPEAT 2, Fold 4 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.94754	valid-auc:0.94979
[500]	train-auc:0.97687	valid-auc:0.98341
[1000]	train-auc:0.97923	valid-auc:0.98526
[1500]	train-auc:0.98098	valid-auc:0.98641
[2000]	train-auc:0.98247	valid-auc:0.98730
[2500]	train-auc:0.98385	valid-auc:0.98802
[3000]	train-auc:0.98512	valid-auc:0.98871
[3500]	train-auc:0.98629	valid-auc:0.98930
[4000]	train-auc:0.98738	valid-auc:0.98988
[4199]	train-auc:0.98779	valid-auc:0.99010
#########################
### REPEAT 2, Fold 5 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95239	valid-auc:0.95831
[500]	train-auc:0.97682	valid-auc:0.98383
[1000]	train-auc:0.97920	valid-auc:0.98561
[1500]	train-auc:0.98095	valid-auc:0.98667
[2000]	train-auc:0.98247	valid-auc:0.98751
[2500]	train-auc:0.98386	valid-auc:0.98829
[3000]	train-auc:0.98513	valid-auc:0.98897
[3500]	train-auc:0.98630	valid-auc:0.98956
[4000]	train-auc:0.98738	valid-auc:0.99014
[4199]	train-auc:0.98779	valid-auc:0.99034
#########################
### REPEAT 2, Fold 6 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95266	valid-auc:0.95825
[500]	train-auc:0.97687	valid-auc:0.98428
[1000]	train-auc:0.97926	valid-auc:0.98620
[1500]	train-auc:0.98098	valid-auc:0.98733
[2000]	train-auc:0.98249	valid-auc:0.98815
[2500]	train-auc:0.98385	valid-auc:0.98890
[3000]	train-auc:0.98511	valid-auc:0.98957
[3500]	train-auc:0.98629	valid-auc:0.99017
[4000]	train-auc:0.98737	valid-auc:0.99069
[4199]	train-auc:0.98778	valid-auc:0.99091
#########################
### REPEAT 2, Fold 7 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95214	valid-auc:0.96014
[500]	train-auc:0.97689	valid-auc:0.98492
[1000]	train-auc:0.97926	valid-auc:0.98656
[1500]	train-auc:0.98098	valid-auc:0.98757
[2000]	train-auc:0.98250	valid-auc:0.98838
[2500]	train-auc:0.98389	valid-auc:0.98910
[3000]	train-auc:0.98514	valid-auc:0.98973
[3500]	train-auc:0.98631	valid-auc:0.99033
[4000]	train-auc:0.98739	valid-auc:0.99088
[4199]	train-auc:0.98779	valid-auc:0.99108
#########################
### REPEAT 2, Fold 8 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95058	valid-auc:0.95709
[500]	train-auc:0.97685	valid-auc:0.98410
[1000]	train-auc:0.97922	valid-auc:0.98595
[1500]	train-auc:0.98095	valid-auc:0.98700
[2000]	train-auc:0.98246	valid-auc:0.98785
[2500]	train-auc:0.98383	valid-auc:0.98859
[3000]	train-auc:0.98509	valid-auc:0.98926
[3500]	train-auc:0.98626	valid-auc:0.98989
[4000]	train-auc:0.98736	valid-auc:0.99045
[4199]	train-auc:0.98778	valid-auc:0.99068
#########################
### REPEAT 2, Fold 9 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95286	valid-auc:0.95677
[500]	train-auc:0.97685	valid-auc:0.98438
[1000]	train-auc:0.97921	valid-auc:0.98623
[1500]	train-auc:0.98097	valid-auc:0.98735
[2000]	train-auc:0.98249	valid-auc:0.98819
[2500]	train-auc:0.98386	valid-auc:0.98893
[3000]	train-auc:0.98514	valid-auc:0.98960
[3500]	train-auc:0.98632	valid-auc:0.99021
[4000]	train-auc:0.98740	valid-auc:0.99074
[4199]	train-auc:0.98783	valid-auc:0.99094
#########################
### REPEAT 2, Fold 10 ###
#########################
Target encoding 127 features... 0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 


  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()
  feature_names = data.columns.format()


[0]	train-auc:0.95441	valid-auc:0.95983
[500]	train-auc:0.97685	valid-auc:0.98533
[1000]	train-auc:0.97923	valid-auc:0.98704
[1500]	train-auc:0.98095	valid-auc:0.98805
[2000]	train-auc:0.98247	valid-auc:0.98886
[2500]	train-auc:0.98384	valid-auc:0.98955
[3000]	train-auc:0.98511	valid-auc:0.99018
[3500]	train-auc:0.98629	valid-auc:0.99076
[4000]	train-auc:0.98738	valid-auc:0.99125
[4199]	train-auc:0.98780	valid-auc:0.99147


In [14]:
from sklearn.metrics import roc_auc_score

m = roc_auc_score(train.y.to_numpy(), oof_preds2)
print(f"XGB (Train More) with Original Data as rows CV = {m}")

XGB (Train More) with Original Data as rows CV = 0.9908152692075539


# Normalize
NN prefer numerical columns to be Gaussian distributed. Therefore we first apply log transform to skewed distributions and then standardize with subtract mean divide standard deviation. (The standardization code is included in the NN code below).

In [15]:
LOG = ['balance','duration','campaign','pdays','previous']

for c in LOG+CE:
    if c in LOG: 
        mn = min( (train[c].min(), test[c].min()) )
        train[c] = train[c]-mn
        test[c] = test[c]-mn
    train[c] = np.log1p( train[c] )
    test[c] = np.log1p( test[c] )

In [16]:
FEATURES = CATS+NUMS+CATS1+CE+TE
TARGET_COL = 'y'
print(f"We have {len( FEATURES )} features.")

We have 295 features.


# Train More - NN by GPT5 - Vibe Coding!
I used my features from my XGBoost starter notebook, and then asked GPT5 to build me a NN MLP. Below is what GPT5 produced! Here is the prompt I asked GPT5:

>I have dataframe with 16 categorical features. And i have 279 numeric features. This is in a Pandas dataframe. Can you write me code that builds and trains a Pytorch MLP that uses embeddings for the categoricals and predicts a binary target? Please make the model, set the training schedule. Write a KFold training loop, save the OOF, etc

After it produced the code, I tuned the batch size, learning rate, and learning schedule. Everything else was created by GPT5! Wow!

Previously our NN had 7 folds, we will now train it with 10 folds. And we will train the same 10 folds 5x times. Each time we will use a different seed. Then we will average all the models.

In [17]:
# =========================
# PyTorch MLP w/ Cat Embeddings + KFold OOF
# =========================
import os, math, random, gc, json
import numpy as np
import pandas as pd
from tqdm import tqdm

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

from sklearn.model_selection import KFold
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import roc_auc_score, log_loss, accuracy_score

# -------------------------
# Config
# -------------------------
SEED          = 42
FOLDS         = 10
EPOCHS        = 4
BATCH_SIZE    = 512
LR_MAX        = 3e-3          # peak LR for OneCycle
WD            = 1e-4          # weight decay (AdamW)
EARLY_STOP    = 4             # epochs with no val AUC improvement
GRAD_CLIP     = 1.0
NUM_WORKERS   = 4
MODEL_DIR     = "./mlp_catemb_models"
OOF_PATH      = "./oof_catemb.csv"

os.makedirs(MODEL_DIR, exist_ok=True)

KNOWN_CARDINALITIES = SIZES
df = train[ FEATURES+['y'] ].to_pandas()
df2 = test[ FEATURES+['y'] ].to_pandas()

# -------------------------
# Repro
# -------------------------
def seed_everything(seed=SEED):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = False
    torch.backends.cudnn.benchmark = True

seed_everything(SEED)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Device:", device)

# -------------------------
# Load/define your dataframe `df`
# -------------------------
# df = pd.read_parquet("your_data.parquet")
# assert TARGET_COL in df.columns

categorical_cols = list( KNOWN_CARDINALITIES.keys() )
numeric_cols = [c for c in df.columns if c not in categorical_cols + [TARGET_COL]]

print(f"#categoricals={len(categorical_cols)}, #numerics={len(numeric_cols)}")

# -------------------------
# Label-encode categoricals (reserve 0=UNK)
# -------------------------
encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    tmp = pd.concat([ df[[col]],df2[[col]] ],axis=0)
    tmp[col] = tmp[col].astype(str).fillna("NaN")
    le.fit(tmp[col].values)
    df[col] = le.transform(df[col].astype(str).values) + 1
    df2[col] = le.transform(df2[col].astype(str).values) + 1
    encoders[col] = le
    del tmp

cardinalities = {}
for col in categorical_cols:
    if col in KNOWN_CARDINALITIES:
        cardinalities[col] = KNOWN_CARDINALITIES[col] + 1  # +1 for UNK
        df[col] = np.clip(df[col], 0, KNOWN_CARDINALITIES[col])
        df2[col] = np.clip(df2[col], 0, KNOWN_CARDINALITIES[col])
    else:
        cardinalities[col] = int( max(df[col].max(),df2[col].max()) ) + 1

# -------------------------
# Scale numerics
# -------------------------
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols].astype(np.float32))
df2[numeric_cols] = scaler.transform(df2[numeric_cols].astype(np.float32))

# -------------------------
# Embedding dims
# -------------------------
def emb_dim_from_card(n):
    return int(min(50, round(1.6 * (n**0.56))))

emb_info = [(cardinalities[c], emb_dim_from_card(cardinalities[c])) for c in categorical_cols]
total_emb_dim = sum(d for _, d in emb_info)
print("Embedding config:", dict(zip(categorical_cols, emb_info)))
print("Total embedding dim:", total_emb_dim, " + numeric:", len(numeric_cols))

# -------------------------
# Dataset
# -------------------------
class TabDataset(Dataset):
    def __init__(self, df, cat_cols, num_cols, target_col=None, idx=None):
        self.cat_cols = cat_cols
        self.num_cols = num_cols
        self.target_col = target_col
        self.df = df if idx is None else df.iloc[idx]
        self.cats = self.df[self.cat_cols].values.astype(np.int64)
        self.nums = self.df[self.num_cols].values.astype(np.float32)
        self.y = None if self.target_col is None else self.df[self.target_col].values.astype(np.float32)

    def __len__(self):
        return len(self.df)

    def __getitem__(self, i):
        cats = torch.from_numpy(self.cats[i])
        nums = torch.from_numpy(self.nums[i])
        if self.y is None:
            return cats, nums
        return cats, nums, torch.tensor(self.y[i])

# -------------------------
# Model
# -------------------------
class MLPWithCatEmb(nn.Module):
    def __init__(self, emb_info, n_num, hidden=[512, 256, 128], dropout=0.15):
        super().__init__()
        self.emb_layers = nn.ModuleList(
            [nn.Embedding(num_embeddings=card, embedding_dim=dim, padding_idx=0)
             for card, dim in emb_info]
        )
        emb_total = sum(dim for _, dim in emb_info)
        in_dim = emb_total + n_num

        self.bn_nums = nn.BatchNorm1d(n_num) if n_num > 0 else nn.Identity()
        self.dropout = nn.Dropout(dropout)

        layers = []
        last = in_dim
        for h in hidden:
            layers += [
                nn.Linear(last, h),
                nn.BatchNorm1d(h),
                nn.SiLU(),
                nn.Dropout(dropout)
            ]
            last = h
        self.mlp = nn.Sequential(*layers)
        self.head = nn.Linear(last, 1)

        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
                if m.bias is not None:
                    nn.init.zeros_(m.bias)

    def forward(self, x_cat, x_num):
        emb = [emb_layer(x_cat[:, i]) for i, emb_layer in enumerate(self.emb_layers)]
        x_emb = torch.cat(emb, dim=1) if emb else None
        if x_num is not None and x_num.shape[1] > 0:
            x_num = self.bn_nums(x_num)
            x = torch.cat([x_emb, x_num], dim=1) if x_emb is not None else x_num
        else:
            x = x_emb
        x = self.mlp(x)
        logit = self.head(x).squeeze(1)
        return logit

# -------------------------
# Train / Eval helpers
# -------------------------
def train_one_epoch(model, loader, optimizer, scaler, scheduler=None):
    model.train()
    running = 0.0
    for cats, nums, y in loader:
        cats = cats.to(device, non_blocking=True)
        nums = nums.to(device, non_blocking=True)
        y = y.to(device, non_blocking=True)

        optimizer.zero_grad(set_to_none=True)
        with torch.amp.autocast('cuda', enabled=True):
            logits = model(cats, nums)
            loss = F.binary_cross_entropy_with_logits(logits, y)
        scaler.scale(loss).backward()
        if GRAD_CLIP is not None:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), GRAD_CLIP)
        scaler.step(optimizer)
        scaler.update()
        if scheduler is not None:
            scheduler.step()

        running += loss.detach().item() * y.size(0)
    return running / len(loader.dataset)

@torch.no_grad()
def validate(model, loader):
    model.eval()
    all_logits, all_y = [], []
    for batch in loader:
        if len(batch) == 3:
            cats, nums, y = batch
            all_y.append(y.detach().cpu())
        else:
            cats, nums = batch
        cats = cats.to(device, non_blocking=True)
        nums = nums.to(device, non_blocking=True)
        all_logits.append(model(cats, nums).detach().cpu())

    logits = torch.cat(all_logits).numpy()
    probs  = 1.0 / (1.0 + np.exp(-logits))
    probs_c = np.clip(probs, 1e-7, 1 - 1e-7)  # clip instead of log_loss(..., eps=...)

    if all_y:
        y_true = torch.cat(all_y).numpy().astype(np.int64)
        auc = roc_auc_score(y_true, probs)
        ll  = log_loss(y_true, probs_c)
        acc = accuracy_score(y_true, (probs >= 0.5).astype(int))
        return probs, {"auc": auc, "logloss": ll, "acc": acc}
    return probs, {}

oof = np.zeros(len(df), dtype=np.float32)
preds = np.zeros(len(df2), dtype=np.float32)

REPEATS = 10
for kk in range(REPEATS):
    print(f"##### REPEAT {kk+1} of {REPEATS} #####")
    seed_everything(SEED+kk)

    # -------------------------
    # KFold training
    # -------------------------
    skf = KFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
    
    y = df[TARGET_COL].astype(int).values
    oof2 = np.zeros(len(df), dtype=np.float32)
    fold_metrics = []
    
    for fold, (trn_idx, val_idx) in enumerate(skf.split(df, y), start=1):
        print(f"\n========== Fold {fold}/{FOLDS} ==========")
        #trn_ds = TabDataset(df, categorical_cols, numeric_cols, TARGET_COL, trn_idx)
        trn_ds = TabDataset(df, categorical_cols, numeric_cols, TARGET_COL, np.arange(len(df)) )
        val_ds = TabDataset(df, categorical_cols, numeric_cols, TARGET_COL, val_idx)
        test_ds = TabDataset(df2, categorical_cols, numeric_cols, TARGET_COL, np.arange(len(df2)) )
    
        trn_loader = DataLoader(
            trn_ds, batch_size=BATCH_SIZE, shuffle=True,
            num_workers=NUM_WORKERS, pin_memory=True, drop_last=True
        )
        val_loader = DataLoader(
            val_ds, batch_size=BATCH_SIZE, shuffle=False,
            num_workers=NUM_WORKERS, pin_memory=True
        )
        test_loader = DataLoader(
            test_ds, batch_size=BATCH_SIZE, shuffle=False,
            num_workers=NUM_WORKERS, pin_memory=True
        )
    
        model = MLPWithCatEmb(emb_info=emb_info, n_num=len(numeric_cols)).to(device)
        optimizer = torch.optim.AdamW(model.parameters(), lr=LR_MAX, weight_decay=WD)
    
        total_steps = EPOCHS * len(trn_loader)
        scheduler = torch.optim.lr_scheduler.OneCycleLR(
            optimizer, max_lr=LR_MAX, total_steps=total_steps,
            pct_start=0.0, div_factor=25.0, final_div_factor=10.5, anneal_strategy='cos'
        )
    
        scaler = torch.amp.GradScaler('cuda', enabled=True)
    
        best_auc = -1.0
        best_epoch = -1
        epochs_no_improve = 0
        best_path = os.path.join(MODEL_DIR, f"fold{fold}.pt")
    
        for epoch in range(1, EPOCHS+1):
            train_loss = train_one_epoch(model, trn_loader, optimizer, scaler, scheduler)
            _, val_stats = validate(model, val_loader)
            auc = val_stats.get("auc", float("nan"))
            print(f"Epoch {epoch:02d}: train_loss={train_loss:.4f} | "
                  f"val_auc={auc:.5f} val_logloss={val_stats['logloss']:.5f} val_acc={val_stats['acc']:.4f}")
    
            if 1: #auc > best_auc:
                best_auc = auc
                best_epoch = epoch
                epochs_no_improve = 0
                torch.save({
                    "model_state": model.state_dict(),
                    "config": {
                        "emb_info": emb_info,
                        "numeric_cols": numeric_cols,
                        "categorical_cols": categorical_cols
                    }
                }, best_path)
            else:
                epochs_no_improve += 1
                if epochs_no_improve >= EARLY_STOP:
                    print(f"Early stopping at epoch {epoch}. Best AUC {best_auc:.5f} @ epoch {best_epoch}")
                    break
    
        # Load best
        ckpt = torch.load(best_path, map_location="cpu", weights_only=False)
        model.load_state_dict(ckpt["model_state"])
        model.to(device)
    
        # OOF for this fold
        val_probs, val_stats = validate(model, val_loader)
        oof[val_idx] += val_probs / REPEATS
        oof2[val_idx] = val_probs
        fold_metrics.append({"fold": fold, **val_stats})
        print(f"[Fold {fold}] AUC={val_stats['auc']:.5f}  LogLoss={val_stats['logloss']:.5f}  Acc={val_stats['acc']:.4f}")
    
        test_probs, _ = validate(model, test_loader)
        preds += test_probs / FOLDS / REPEATS
    
    # -------------------------
    # Overall OOF metrics
    # -------------------------
    oof_c = np.clip(oof2, 1e-7, 1 - 1e-7)
    oof_auc = roc_auc_score(y, oof2)
    oof_ll  = log_loss(y, oof_c)     # no eps kwarg
    oof_acc = accuracy_score(y, (oof2>=0.5).astype(int))
    print("\n========== Overall OOF ==========")
    print(f"OOF AUC={oof_auc:.5f}  LogLoss={oof_ll:.5f}  Acc={oof_acc:.4f}")
    
    # Save OOF and metrics
    #pd.DataFrame({
    #    "oof_pred": oof,
    #    TARGET_COL: y
    #}).to_csv(OOF_PATH, index=False)
    
    #with open(os.path.join(MODEL_DIR, "fold_metrics.json"), "w") as f:
    #    json.dump({"folds": fold_metrics, "oof": {"auc": oof_auc, "logloss": oof_ll, "acc": oof_acc}}, f, indent=2)
    
    #print(f"Saved OOF -> {OOF_PATH}")
    #print(f"Saved models -> {MODEL_DIR}")

Device: cuda
#categoricals=16, #numerics=279
Embedding config: {'age2': (79, 18), 'balance2': (8591, 50), 'day2': (32, 11), 'duration2': (1825, 50), 'campaign2': (53, 15), 'pdays2': (629, 50), 'previous2': (55, 15), 'job': (13, 7), 'marital': (4, 3), 'education': (5, 4), 'default': (3, 3), 'housing': (3, 3), 'loan': (3, 3), 'contact': (4, 3), 'month': (13, 7), 'poutcome': (5, 4)}
Total embedding dim: 246  + numeric: 279
##### REPEAT 1 of 10 #####

Epoch 01: train_loss=0.1489 | val_auc=0.97468 val_logloss=0.12957 val_acc=0.9439
Epoch 02: train_loss=0.1311 | val_auc=0.97720 val_logloss=0.12341 val_acc=0.9469
Epoch 03: train_loss=0.1245 | val_auc=0.97905 val_logloss=0.11845 val_acc=0.9488
Epoch 04: train_loss=0.1196 | val_auc=0.97959 val_logloss=0.11701 val_acc=0.9493
[Fold 1] AUC=0.97959  LogLoss=0.11701  Acc=0.9493

Epoch 01: train_loss=0.1498 | val_auc=0.97491 val_logloss=0.12934 val_acc=0.9443
Epoch 02: train_loss=0.1313 | val_auc=0.97766 val_logloss=0.12252 val_acc=0.9471
Epoch 03: t

In [18]:
m = roc_auc_score(train.y.to_numpy(), oof)
print(f"NN (Train More) CV = {m}")

NN (Train More) CV = 0.980565911748762


# Create NN (Train More) Submission CSV

In [19]:
sub = pd.read_csv(f"{PATH}sample_submission.csv")
sub['y'] = preds
sub.to_csv("submission_nn_train_more.csv",index=False)
np.save("oof_nn_train_more",oof)
print('Submission shape',sub.shape)
#sub.head()

Submission shape (250000, 2)


# Create XGB (Train More) Submission CSV

In [20]:
sub = pd.read_csv(f"{PATH}sample_submission.csv")
preds_xgb = (test_preds+test_preds2)/2. 
sub['y'] = preds_xgb
sub.to_csv("submission_xgb_train_more.csv",index=False)
np.save("oof_xgb_rows_train_more",oof_preds)
np.save("oof_xgb_cols_train_more",oof_preds2)
print('Submission shape',sub.shape)
#sub.head()

Submission shape (250000, 2)


# Ensemble - XGB and NN (Train More) - CV Score

In [21]:
oof_xgb = (oof_preds+oof_preds2)/2.
m = roc_auc_score(train.y.to_numpy(), oof_xgb)
print(f"Both XGB rows and XGB cols (Train More) CV = {m}")

Both XGB rows and XGB cols (Train More) CV = 0.9912372919961618


In [22]:
best_m = 0
best_w = 0
for w in np.arange(0,1.01,0.01):
    oof_ensemble = (1-w)*oof_xgb + w*oof
    m = roc_auc_score(train.y.to_numpy(), oof_ensemble)
    if m>best_m:
        best_m = m
        best_w = w
        
oof_ensemble = (1-best_w)*oof_xgb + best_w*oof
m = roc_auc_score(train.y.to_numpy(), oof_ensemble)
print(f"Ensemble XGB and NN (Train More) CV = {m}")
print(f" using best NN weight = {best_w}")

Ensemble XGB and NN (Train More) CV = 0.9912372919961618
 using best NN weight = 0.0


# Create - XGB and NN (Train More) - Submission CSV

In [23]:
best_w = 0.27
xgb_preds = preds_xgb 
sub = pd.read_csv(f"{PATH}sample_submission.csv")
sub['y'] = (1-best_w)*xgb_preds + best_w*preds
sub.to_csv("submission_ensemble_train_more.csv",index=False)
print('Submission shape',sub.shape)
sub.head()

Submission shape (250000, 2)


Unnamed: 0,id,y
0,750000,0.000748
1,750001,0.071478
2,750002,0.00016
3,750003,5e-05
4,750004,0.005871
