### Baseline solution

- Missing data is filled with outlier values like 'missing', 99 etc
- Label Encoding of all the categorical variables
- LGBM

In [1]:
import sys

import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder

sys.path.insert(0, "/home/jupyter/kaggle/cat_in_dat_2/kaggle_cat_in_dat_2/src")
import utility

DATA_DIR = '/home/jupyter/kaggle/cat_in_dat_2/kaggle_cat_in_dat_2/data/read_only'
SEED = 42


utility.set_seed(SEED)

#Read the data file
train, test, submission = utility.read_files(DATA_DIR, index_col='id')

combined_df = pd.concat([train.drop('target', axis=1), test])
print(f'Shape of the combined DF {combined_df.shape}')

train_index = train.shape[0]

# Fill the missing values
nom_features = utility.get_fetaure_names(train, 'nom')
print(f'Number of nominal features {len(nom_features)}')
print(f'Nominal Features : {nom_features}')

binary_features = utility.get_fetaure_names(train, 'bin')
print(f'Number of binary features {len(binary_features)}')
print(f'Binary Features : {binary_features}')

ordinal_fetaures = utility.get_fetaure_names(train, 'ord')
print(f'Number of ordinal features {len(ordinal_fetaures)}')
print(f'Ordinal Features : {ordinal_fetaures}')

# Filling nominal variables with missing values
combined_df[nom_features] = combined_df[nom_features].fillna('missing_nom')
# ord_0 has apparently value fo type integer. 
combined_df['ord_0'] = combined_df['ord_0'].fillna(99)
# Fill missing values for other ordinal values
combined_df[['ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5']] = combined_df[['ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5']].fillna('missing_ord')
combined_df['day'] = combined_df['day'].fillna(99) 
combined_df['month'] = combined_df['month'].fillna(99)
combined_df[['bin_3', 'bin_4']] = combined_df[['bin_3', 'bin_4']].fillna('missing_binary')
combined_df[['bin_0', 'bin_1', 'bin_2']] = combined_df[['bin_0', 'bin_1', 'bin_2']].fillna(9)

# Convert all the datatypes into category
for name in combined_df.columns:
    combined_df[name] = combined_df[name].astype('category')
    
for name in combined_df.columns:
    lb = LabelEncoder()
    combined_df[name] = lb.fit_transform(combined_df[name])
    
train_Y = train.target
train_X = combined_df[:train_index]
test_X = combined_df[train_index:]

print(train_X.shape)
print(test_X.shape)
print(train_Y.shape)

lgb_params = {
    'objective':'binary',
    'boosting_type':'gbdt',
    'metric':'auc',
    'n_jobs':-1,
    'verbose':-1,
    'seed': SEED,
    'num_trees':10000,
    'early_stopping_rounds':100,
    }

Loading Data...
Shape of train.csv : (600000, 24)
Shape of test.csv : (400000, 23)
Shape of sample_submission.csv : (400000, 2)
Data Loaded...
Shape of the combined DF (1000000, 23)
Number of nominal features 10
Nominal Features : ['nom_0', 'nom_1', 'nom_2', 'nom_3', 'nom_4', 'nom_5', 'nom_6', 'nom_7', 'nom_8', 'nom_9']
Number of binary features 5
Binary Features : ['bin_0', 'bin_1', 'bin_2', 'bin_3', 'bin_4']
Number of ordinal features 6
Ordinal Features : ['ord_0', 'ord_1', 'ord_2', 'ord_3', 'ord_4', 'ord_5']
(600000, 23)
(400000, 23)
(600000,)


## 5 Folds Stratified

In [2]:
result_dict = utility.make_prediction(train_X, train_Y, test_X, params=lgb_params, n_splits=5, seed=SEED)

fold 1 of 5



Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.752872	valid_1's auc: 0.746997
[100]	training's auc: 0.766817	valid_1's auc: 0.758228
[150]	training's auc: 0.77573	valid_1's auc: 0.764322
[200]	training's auc: 0.782054	valid_1's auc: 0.76746
[250]	training's auc: 0.787363	valid_1's auc: 0.769147
[300]	training's auc: 0.791631	valid_1's auc: 0.769861
[350]	training's auc: 0.79581	valid_1's auc: 0.770528
[400]	training's auc: 0.799554	valid_1's auc: 0.770746
[450]	training's auc: 0.803446	valid_1's auc: 0.77096
[500]	training's auc: 0.80704	valid_1's auc: 0.770986
[550]	training's auc: 0.810387	valid_1's auc: 0.771003
[600]	training's auc: 0.813865	valid_1's auc: 0.77125
[650]	training's auc: 0.817067	valid_1's auc: 0.771284
[700]	training's auc: 0.820194	valid_1's auc: 0.771338
Early stopping, best iteration is:
[624]	training's auc: 0.815426	valid_1's auc: 0.771383
CV OOF Score for fold 1 is 0.77138314373759
fold 2 of 5



Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.753036	valid_1's auc: 0.747343
[100]	training's auc: 0.767346	valid_1's auc: 0.758771
[150]	training's auc: 0.775843	valid_1's auc: 0.76458
[200]	training's auc: 0.78192	valid_1's auc: 0.767096
[250]	training's auc: 0.78726	valid_1's auc: 0.768671
[300]	training's auc: 0.791569	valid_1's auc: 0.769409
[350]	training's auc: 0.795797	valid_1's auc: 0.770009
[400]	training's auc: 0.799844	valid_1's auc: 0.770532
[450]	training's auc: 0.803655	valid_1's auc: 0.771067
[500]	training's auc: 0.807274	valid_1's auc: 0.771309
[550]	training's auc: 0.810747	valid_1's auc: 0.771398
[600]	training's auc: 0.814034	valid_1's auc: 0.771215
Early stopping, best iteration is:
[539]	training's auc: 0.809985	valid_1's auc: 0.771432
CV OOF Score for fold 2 is 0.7714319004102621
fold 3 of 5



Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.75359	valid_1's auc: 0.742898
[100]	training's auc: 0.768322	valid_1's auc: 0.755093
[150]	training's auc: 0.77696	valid_1's auc: 0.761322
[200]	training's auc: 0.782808	valid_1's auc: 0.763824
[250]	training's auc: 0.787847	valid_1's auc: 0.76561
[300]	training's auc: 0.792367	valid_1's auc: 0.766581
[350]	training's auc: 0.796585	valid_1's auc: 0.767592
[400]	training's auc: 0.800289	valid_1's auc: 0.767578
[450]	training's auc: 0.803988	valid_1's auc: 0.76796
[500]	training's auc: 0.807432	valid_1's auc: 0.768171
[550]	training's auc: 0.810975	valid_1's auc: 0.768419
[600]	training's auc: 0.814297	valid_1's auc: 0.768448
[650]	training's auc: 0.817497	valid_1's auc: 0.768566
[700]	training's auc: 0.820701	valid_1's auc: 0.768516
[750]	training's auc: 0.82384	valid_1's auc: 0.76851
[800]	training's auc: 0.826811	valid_1's auc: 0.768438
Early stopping, best iteration is:
[728]	training's auc: 0.822485


Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.752602	valid_1's auc: 0.747662
[100]	training's auc: 0.765767	valid_1's auc: 0.757815
[150]	training's auc: 0.775159	valid_1's auc: 0.764495
[200]	training's auc: 0.781419	valid_1's auc: 0.767529
[250]	training's auc: 0.786812	valid_1's auc: 0.769616
[300]	training's auc: 0.791335	valid_1's auc: 0.770759
[350]	training's auc: 0.795443	valid_1's auc: 0.771272
[400]	training's auc: 0.799178	valid_1's auc: 0.771492
[450]	training's auc: 0.802921	valid_1's auc: 0.771652
[500]	training's auc: 0.806569	valid_1's auc: 0.772041
[550]	training's auc: 0.810261	valid_1's auc: 0.772586
[600]	training's auc: 0.813473	valid_1's auc: 0.77244
[650]	training's auc: 0.816749	valid_1's auc: 0.772717
[700]	training's auc: 0.819821	valid_1's auc: 0.772621
[750]	training's auc: 0.82306	valid_1's auc: 0.772628
Early stopping, best iteration is:
[653]	training's auc: 0.816954	valid_1's auc: 0.772754
CV OOF Score for fold 4 is


Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.751895	valid_1's auc: 0.750255
[100]	training's auc: 0.765501	valid_1's auc: 0.761657
[150]	training's auc: 0.775143	valid_1's auc: 0.76858
[200]	training's auc: 0.781781	valid_1's auc: 0.77196
[250]	training's auc: 0.78651	valid_1's auc: 0.772959
[300]	training's auc: 0.790923	valid_1's auc: 0.773781
[350]	training's auc: 0.795098	valid_1's auc: 0.774132
[400]	training's auc: 0.79897	valid_1's auc: 0.774519
[450]	training's auc: 0.80271	valid_1's auc: 0.774928
[500]	training's auc: 0.80624	valid_1's auc: 0.774939
[550]	training's auc: 0.809541	valid_1's auc: 0.775034
[600]	training's auc: 0.812842	valid_1's auc: 0.775008
[650]	training's auc: 0.816161	valid_1's auc: 0.775134
[700]	training's auc: 0.819306	valid_1's auc: 0.775
Early stopping, best iteration is:
[609]	training's auc: 0.813532	valid_1's auc: 0.77519
CV OOF Score for fold 5 is 0.7751897836695921
Combined OOF score : 0.77186
Average of 5 f

## 10 Folds Stratified

In [4]:
result_dict = utility.make_prediction(train_X, train_Y, test_X, params=lgb_params, n_splits=10, seed=SEED)

fold 1 of 10



Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.752405	valid_1's auc: 0.745551
[100]	training's auc: 0.765956	valid_1's auc: 0.757162
[150]	training's auc: 0.774735	valid_1's auc: 0.763724
[200]	training's auc: 0.780904	valid_1's auc: 0.767034
[250]	training's auc: 0.785686	valid_1's auc: 0.768623
[300]	training's auc: 0.789869	valid_1's auc: 0.76967
[350]	training's auc: 0.793841	valid_1's auc: 0.770547
[400]	training's auc: 0.797318	valid_1's auc: 0.770736
[450]	training's auc: 0.800745	valid_1's auc: 0.771197
[500]	training's auc: 0.803925	valid_1's auc: 0.771184
[550]	training's auc: 0.807129	valid_1's auc: 0.771226
Early stopping, best iteration is:
[456]	training's auc: 0.801155	valid_1's auc: 0.771377
CV OOF Score for fold 1 is 0.7713773403548505
fold 2 of 10



Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.752333	valid_1's auc: 0.74862
[100]	training's auc: 0.766344	valid_1's auc: 0.759875
[150]	training's auc: 0.774467	valid_1's auc: 0.765279
[200]	training's auc: 0.780352	valid_1's auc: 0.768324
[250]	training's auc: 0.784668	valid_1's auc: 0.769347
[300]	training's auc: 0.789331	valid_1's auc: 0.770981
[350]	training's auc: 0.79314	valid_1's auc: 0.771697
[400]	training's auc: 0.796875	valid_1's auc: 0.772293
[450]	training's auc: 0.800222	valid_1's auc: 0.772474
[500]	training's auc: 0.803583	valid_1's auc: 0.772639
[550]	training's auc: 0.806656	valid_1's auc: 0.772607
[600]	training's auc: 0.809751	valid_1's auc: 0.772681
[650]	training's auc: 0.812768	valid_1's auc: 0.77277
[700]	training's auc: 0.815582	valid_1's auc: 0.772704
[750]	training's auc: 0.818505	valid_1's auc: 0.772653
Early stopping, best iteration is:
[658]	training's auc: 0.813206	valid_1's auc: 0.772816
CV OOF Score for fold 2 is 


Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.752342	valid_1's auc: 0.746467
[100]	training's auc: 0.766133	valid_1's auc: 0.758628
[150]	training's auc: 0.774748	valid_1's auc: 0.764801
[200]	training's auc: 0.7803	valid_1's auc: 0.767415
[250]	training's auc: 0.78536	valid_1's auc: 0.769582
[300]	training's auc: 0.789876	valid_1's auc: 0.770881
[350]	training's auc: 0.793554	valid_1's auc: 0.771503
[400]	training's auc: 0.797117	valid_1's auc: 0.772118
[450]	training's auc: 0.80063	valid_1's auc: 0.77241
[500]	training's auc: 0.804011	valid_1's auc: 0.772593
[550]	training's auc: 0.807146	valid_1's auc: 0.772563
[600]	training's auc: 0.810121	valid_1's auc: 0.772536
[650]	training's auc: 0.81305	valid_1's auc: 0.7728
[700]	training's auc: 0.815963	valid_1's auc: 0.772694
[750]	training's auc: 0.818833	valid_1's auc: 0.772956
[800]	training's auc: 0.821586	valid_1's auc: 0.772907
[850]	training's auc: 0.824329	valid_1's auc: 0.772819
Early stoppi


Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.752296	valid_1's auc: 0.748307
[100]	training's auc: 0.765623	valid_1's auc: 0.758661
[150]	training's auc: 0.774892	valid_1's auc: 0.765363
[200]	training's auc: 0.780695	valid_1's auc: 0.767853
[250]	training's auc: 0.785313	valid_1's auc: 0.769246
[300]	training's auc: 0.789717	valid_1's auc: 0.770298
[350]	training's auc: 0.793561	valid_1's auc: 0.770949
[400]	training's auc: 0.797215	valid_1's auc: 0.771513
[450]	training's auc: 0.800628	valid_1's auc: 0.771847
[500]	training's auc: 0.803785	valid_1's auc: 0.772157
[550]	training's auc: 0.806975	valid_1's auc: 0.772128
[600]	training's auc: 0.810003	valid_1's auc: 0.771994
Early stopping, best iteration is:
[523]	training's auc: 0.805315	valid_1's auc: 0.772342
CV OOF Score for fold 4 is 0.7723419150430715
fold 5 of 10



Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.752445	valid_1's auc: 0.74363
[100]	training's auc: 0.765904	valid_1's auc: 0.755277
[150]	training's auc: 0.775028	valid_1's auc: 0.762165
[200]	training's auc: 0.781047	valid_1's auc: 0.765707
[250]	training's auc: 0.785884	valid_1's auc: 0.767905
[300]	training's auc: 0.789639	valid_1's auc: 0.768347
[350]	training's auc: 0.793554	valid_1's auc: 0.769228
[400]	training's auc: 0.797096	valid_1's auc: 0.769644
[450]	training's auc: 0.800411	valid_1's auc: 0.769899
[500]	training's auc: 0.80356	valid_1's auc: 0.769972
[550]	training's auc: 0.806784	valid_1's auc: 0.770427
[600]	training's auc: 0.809942	valid_1's auc: 0.770331
Early stopping, best iteration is:
[542]	training's auc: 0.806291	valid_1's auc: 0.770515
CV OOF Score for fold 5 is 0.7705146220245773
fold 6 of 10



Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.75268	valid_1's auc: 0.741418
[100]	training's auc: 0.766339	valid_1's auc: 0.753064
[150]	training's auc: 0.774816	valid_1's auc: 0.759227
[200]	training's auc: 0.781067	valid_1's auc: 0.762922
[250]	training's auc: 0.785689	valid_1's auc: 0.764397
[300]	training's auc: 0.789904	valid_1's auc: 0.765882
[350]	training's auc: 0.793724	valid_1's auc: 0.766521
[400]	training's auc: 0.797265	valid_1's auc: 0.766939
[450]	training's auc: 0.800752	valid_1's auc: 0.767165
[500]	training's auc: 0.803853	valid_1's auc: 0.767217
[550]	training's auc: 0.807079	valid_1's auc: 0.767458
[600]	training's auc: 0.81004	valid_1's auc: 0.767494
[650]	training's auc: 0.813053	valid_1's auc: 0.767496
Early stopping, best iteration is:
[589]	training's auc: 0.80939	valid_1's auc: 0.767588
CV OOF Score for fold 6 is 0.7675882991234502
fold 7 of 10



Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.752112	valid_1's auc: 0.747814
[100]	training's auc: 0.765821	valid_1's auc: 0.758949
[150]	training's auc: 0.774433	valid_1's auc: 0.76533
[200]	training's auc: 0.780643	valid_1's auc: 0.768549
[250]	training's auc: 0.785064	valid_1's auc: 0.76995
[300]	training's auc: 0.789235	valid_1's auc: 0.771123
[350]	training's auc: 0.793121	valid_1's auc: 0.772017
[400]	training's auc: 0.79668	valid_1's auc: 0.772626
[450]	training's auc: 0.800186	valid_1's auc: 0.77306
[500]	training's auc: 0.803427	valid_1's auc: 0.773274
[550]	training's auc: 0.806555	valid_1's auc: 0.7732
[600]	training's auc: 0.809716	valid_1's auc: 0.773572
[650]	training's auc: 0.812685	valid_1's auc: 0.773546
[700]	training's auc: 0.815496	valid_1's auc: 0.773596
[750]	training's auc: 0.818298	valid_1's auc: 0.773595
[800]	training's auc: 0.821089	valid_1's auc: 0.773718
[850]	training's auc: 0.823722	valid_1's auc: 0.773603
[900]	trai


Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.752079	valid_1's auc: 0.747101
[100]	training's auc: 0.76623	valid_1's auc: 0.758924
[150]	training's auc: 0.774194	valid_1's auc: 0.764747
[200]	training's auc: 0.780051	valid_1's auc: 0.767921
[250]	training's auc: 0.784947	valid_1's auc: 0.76973
[300]	training's auc: 0.789026	valid_1's auc: 0.770845
[350]	training's auc: 0.792641	valid_1's auc: 0.77128
[400]	training's auc: 0.796186	valid_1's auc: 0.771626
[450]	training's auc: 0.799848	valid_1's auc: 0.772245
[500]	training's auc: 0.803033	valid_1's auc: 0.772399
[550]	training's auc: 0.806172	valid_1's auc: 0.772648
[600]	training's auc: 0.809197	valid_1's auc: 0.772698
[650]	training's auc: 0.812194	valid_1's auc: 0.772839
[700]	training's auc: 0.815135	valid_1's auc: 0.772774
[750]	training's auc: 0.817991	valid_1's auc: 0.772925
[800]	training's auc: 0.820737	valid_1's auc: 0.772742
Early stopping, best iteration is:
[749]	training's auc: 0.817


Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.752148	valid_1's auc: 0.75044
[100]	training's auc: 0.765837	valid_1's auc: 0.76185
[150]	training's auc: 0.774332	valid_1's auc: 0.767492
[200]	training's auc: 0.780473	valid_1's auc: 0.77068
[250]	training's auc: 0.785298	valid_1's auc: 0.772153
[300]	training's auc: 0.789493	valid_1's auc: 0.77318
[350]	training's auc: 0.793295	valid_1's auc: 0.773714
[400]	training's auc: 0.796827	valid_1's auc: 0.773981
[450]	training's auc: 0.800155	valid_1's auc: 0.774106
[500]	training's auc: 0.803487	valid_1's auc: 0.77422
[550]	training's auc: 0.806705	valid_1's auc: 0.774255
[600]	training's auc: 0.809676	valid_1's auc: 0.774078
Early stopping, best iteration is:
[528]	training's auc: 0.805361	valid_1's auc: 0.774312
CV OOF Score for fold 9 is 0.774311543678508
fold 10 of 10



Found `num_trees` in params. Will use it instead of argument


Found `early_stopping_rounds` in params. Will use it instead of argument



Training until validation scores don't improve for 100 rounds
[50]	training's auc: 0.751989	valid_1's auc: 0.75028
[100]	training's auc: 0.766454	valid_1's auc: 0.76312
[150]	training's auc: 0.774022	valid_1's auc: 0.768548
[200]	training's auc: 0.780636	valid_1's auc: 0.772335
[250]	training's auc: 0.785016	valid_1's auc: 0.773496
[300]	training's auc: 0.789442	valid_1's auc: 0.774891
[350]	training's auc: 0.79321	valid_1's auc: 0.775138
[400]	training's auc: 0.796805	valid_1's auc: 0.775541
[450]	training's auc: 0.80017	valid_1's auc: 0.775881
[500]	training's auc: 0.803407	valid_1's auc: 0.775875
[550]	training's auc: 0.806603	valid_1's auc: 0.775976
[600]	training's auc: 0.809645	valid_1's auc: 0.775942
[650]	training's auc: 0.812737	valid_1's auc: 0.775882
Early stopping, best iteration is:
[565]	training's auc: 0.80755	valid_1's auc: 0.776092
CV OOF Score for fold 10 is 0.7760915132261278
Combined OOF score : 0.77247
Average of 10 folds OOF score 0.77248
std of 10 folds OOF score

In [None]:
submission.head()

In [None]:
submission.target = result_dict['prediction']
submission.to_csv('submission_1.csv', index=False)

In [None]:
submission.head()

In [None]:
# ! kaggle competitions submit -c cat-in-the-dat -f submission_1.csv -m "Baseline solutions"