Thsi notebook will first serve as a baseline LightGBM run, which I will type by hand - based on "LightGBM Starter" by firefliesqn. Once I've established this sort of baseline, I'll branch out on my own and experiment with some additional FE and maybe some other models. This notebook will probably NOT include any NN; I might come back to this later. 

I would ideally fork the notebook, but I would like to practice by typing out all the code myself.

In [21]:
#The initial dependencies; will be updated if needed


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
import datatable as dt
import optuna

from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.preprocessing import RobustScaler #I'll look into other scaling methods in the future versions
from sklearn.metrics import roc_auc_score

import lightgbm as lgb

import warnings
warnings.simplefilter('ignore')


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/tabular-playground-series-sep-2021/train.csv
/kaggle/input/tabular-playground-series-sep-2021/test.csv
/kaggle/input/tabular-playground-series-sep-2021/sample_solution.csv


In [2]:
train_df = dt.fread('/kaggle/input/tabular-playground-series-sep-2021/train.csv').to_pandas()
test_df = dt.fread('/kaggle/input/tabular-playground-series-sep-2021/test.csv').to_pandas()
sample_df = dt.fread('/kaggle/input/tabular-playground-series-sep-2021/sample_solution.csv').to_pandas()

Datasets imported, let's now check the shape of those datasets:

In [3]:
print(f'Shape of train_df: {train_df.shape}')
print(f'Shape of test_df: {test_df.shape}')
print(f'Shape of sample_df: {sample_df.shape}')

Shape of train_df: (957919, 120)
Shape of test_df: (493474, 119)
Shape of sample_df: (493474, 2)


In [4]:
train_df.head()

Unnamed: 0,id,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f110,f111,f112,f113,f114,f115,f116,f117,f118,claim
0,0,0.10859,0.004314,-37.566,0.017364,0.28915,-10.251,135.12,168900.0,399240000000000.0,...,-12.228,1.7482,1.9096,-7.1157,4378.8,1.2096,861340000000000.0,140.1,1.0177,True
1,1,0.1009,0.29961,11822.0,0.2765,0.4597,-0.83733,1721.9,119810.0,3874100000000000.0,...,-56.758,4.1684,0.34808,4.142,913.23,1.2464,7575100000000000.0,1861.0,0.28359,False
2,2,0.17803,-0.00698,907.27,0.27214,0.45948,0.17327,2298.0,360650.0,12245000000000.0,...,-5.7688,1.2042,0.2629,8.1312,45119.0,1.1764,321810000000000.0,3838.2,0.4069,True
3,3,0.15236,0.007259,780.1,0.025179,0.51947,7.4914,112.51,259490.0,77814000000000.0,...,-34.858,2.0694,0.79631,-16.336,4952.4,1.1784,4533000000000.0,4889.1,0.51486,True
4,4,0.11623,0.5029,-109.15,0.29791,0.3449,-0.40932,2538.9,65332.0,1907200000000000.0,...,-13.641,1.5298,1.1464,-0.43124,3856.5,1.483,-8991300000000.0,,0.23049,True


Now we can split the datasets into the appropriate Xs and ys:

In [5]:
X_train = train_df.drop(['id', 'claim'], axis=1)
y_train = train_df['claim'].copy()

X_test = test_df.drop('id', axis=1)

Soo... Browsing through the discussions, I've learned that a big lesson to be learned from this dataset is that the missing values are not put there at random. Or, put another way, it is a feature itself which may help establish if there was a claim or not. Well, let's add this feature as well as the standard deviation to our dataset to help us make a better prediction:

In [6]:
for column in X_train.columns:
    print(f'{column}: {X_train[column].isna().sum()}')

f1: 15247
f2: 15190
f3: 15491
f4: 15560
f5: 15405
f6: 15521
f7: 15504
f8: 15373
f9: 15249
f10: 15223
f11: 15425
f12: 15593
f13: 15464
f14: 15222
f15: 15509
f16: 15444
f17: 15427
f18: 15325
f19: 15474
f20: 15455
f21: 15454
f22: 15278
f23: 15356
f24: 15630
f25: 15506
f26: 15358
f27: 15444
f28: 15265
f29: 15415
f30: 15392
f31: 15678
f32: 15529
f33: 15492
f34: 15248
f35: 15334
f36: 15363
f37: 15312
f38: 15434
f39: 15559
f40: 15368
f41: 15396
f42: 15416
f43: 15455
f44: 15463
f45: 15483
f46: 15633
f47: 15523
f48: 15446
f49: 15380
f50: 15562
f51: 15432
f52: 15315
f53: 15462
f54: 15425
f55: 15422
f56: 15467
f57: 15573
f58: 15455
f59: 15400
f60: 15560
f61: 15431
f62: 15518
f63: 15410
f64: 15578
f65: 15414
f66: 15414
f67: 15486
f68: 15619
f69: 15552
f70: 15262
f71: 15482
f72: 15219
f73: 15537
f74: 15570
f75: 15456
f76: 15569
f77: 15251
f78: 15449
f79: 15378
f80: 15320
f81: 15346
f82: 15485
f83: 15627
f84: 15385
f85: 15449
f86: 15523
f87: 15316
f88: 15548
f89: 15445
f90: 15482
f91: 15507
f92: 154

Approximately 15k out of nearly 1M rows; I think it's safe to impute these values with the median value and not affect the data too severely, but at least get rid of the NaNs

In [7]:
X_train['n_miss'] = X_train.isna().sum(axis=1)
X_test['n_miss'] = X_test.isna().sum(axis=1)

#Now I realize why there was a list of columns, if we compute the standard deviation as is, 'n_miss' will be also taken into account
#I'll stick to using a slice of the dataframe:
X_train['std'] = X_train[:-1].std(axis=1)
X_test['std'] = X_test[:-1].std(axis=1)

In [8]:
X_train.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f111,f112,f113,f114,f115,f116,f117,f118,n_miss,std
0,0.10859,0.004314,-37.566,0.017364,0.28915,-10.251,135.12,168900.0,399240000000000.0,86.489,...,1.7482,1.9096,-7.1157,4378.8,1.2096,861340000000000.0,140.1,1.0177,1,1008624000000000.0
1,0.1009,0.29961,11822.0,0.2765,0.4597,-0.83733,1721.9,119810.0,3874100000000000.0,9953.6,...,4.1684,0.34808,4.142,913.23,1.2464,7575100000000000.0,1861.0,0.28359,0,2933337000000000.0
2,0.17803,-0.00698,907.27,0.27214,0.45948,0.17327,2298.0,360650.0,12245000000000.0,15827.0,...,1.2042,0.2629,8.1312,45119.0,1.1764,321810000000000.0,3838.2,0.4069,5,30921070000000.0
3,0.15236,0.007259,780.1,0.025179,0.51947,7.4914,112.51,259490.0,77814000000000.0,-36.837,...,2.0694,0.79631,-16.336,4952.4,1.1784,4533000000000.0,4889.1,0.51486,2,175554200000000.0
4,0.11623,0.5029,-109.15,0.29791,0.3449,-0.40932,2538.9,65332.0,1907200000000000.0,144.12,...,1.5298,1.1464,-0.43124,3856.5,1.483,-8991300000000.0,,0.23049,8,977199300000000.0


In [9]:
#Now in the original author's notebook, the NaN's were imputed with mean values, but I would like to stick to median values:
X_train = X_train.fillna(X_train.median())
X_test = X_test.fillna(X_test.median())

print(f'NaNs in X_train: {X_train.isna().sum().sum()}')
print(f'NaNs in X_test: {X_test.isna().sum().sum()}')

NaNs in X_train: 0
NaNs in X_test: 0


Great! Now that the missing values have been taken care of, we can handle some scaling efforts:

In [10]:
scaler = RobustScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

In [11]:
'''A function to reduce the amount of memory taken up by each feature by compressing it to the appropriate datatype
verbose parameter is used to output a message regarding the exact memory usage reduction'''
def reduce_memory_usage(df, verbose=True):
    numerics = ['int8', 'int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024 ** 2 #initial memory usage to compare to
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            #extract the min and max values
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                else:
                #elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2 #new memory_usage
    if verbose:
        print(
            "Memory usage decreased to: {:.2f} Mb - {:.1f}% reduction".format(
                end_mem, 100 * (start_mem - end_mem) / start_mem
                
                )
            )
    return df

In [12]:
X_train = pd.DataFrame(X_train)
X_test = pd.DataFrame(X_test)

In [13]:
print("X_train redution:")
X_train = reduce_memory_usage(X_train)
print("X_test reduction:")
X_test = reduce_memory_usage(X_test)

X_train redution:
Memory usage decreased to: 219.25 Mb - 75.0% reduction
X_test reduction:
Memory usage decreased to: 112.95 Mb - 75.0% reduction


In [14]:
#X_train_df = pd.DataFrame(X_train)
#X_train_df.hist(bins=50, figsize=(20,15))
#plt.show()

Lots of different distributions, some normal, some bi- or multimodal; tough luck... Probably the best solution would be to apply a transformation across the board.


Below are some initial params of the LightGBM algorhithm; I'll write them down for now. But probably I'll extend them to be lists of hyperparameters to tune:

The initial n_estimators did not result in early stopping, so it's probably wise to continue with the estimators - Early stopping will help us achieve this goal

In [15]:
x_tra, x_val, y_tra, y_val = train_test_split(X_train, y_train, test_size=0.2, shuffle=True, random_state=42)

In [30]:
def objective(trial):
    num_leaves = trial.suggest_int("num_leaves", 20, 40)
    n_estimators = trial.suggest_int("n_estimators", 500, 2000)
    max_depth = trial.suggest_int('max_depth', 3, 8)
    min_child_samples = trial.suggest_int('min_child_samples', 200, 750)
    learning_rate = trial.suggest_uniform('learning_rate', 0.10, 0.30)
    bagging_fraction = trial.suggest_uniform('bagging_fraction', 0.50, 1.0)
    colsample_bytree = trial.suggest_uniform('colsample_bytree', 0.50, 1.0)
    
    model = lgb.LGBMClassifier(
        objective='binary',
        metric='auc',
        num_leaves=num_leaves,
        n_estimators=n_estimators, 
        max_depth=max_depth, 
        min_child_samples=min_child_samples, 
        learning_rate=learning_rate,
        colsample_bytree=colsample_bytree,
        random_state=42,
    )
    
    model.fit(x_tra, y_tra)
    #see link in markdown above for this next line
    score = roc_auc_score(y_val, model.predict_proba(x_val)[:,1])
    return score

In [31]:
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=10)
params = study.best_params #getting best params from study

[32m[I 2021-09-30 18:50:31,913][0m A new study created in memory with name: no-name-7d57ea69-beb5-46b7-a29c-87980b0f3fef[0m
[32m[I 2021-09-30 18:55:04,911][0m Trial 0 finished with value: 0.8014599383458805 and parameters: {'num_leaves': 33, 'n_estimators': 1512, 'max_depth': 8, 'min_child_samples': 549, 'learning_rate': 0.27673148003635156, 'bagging_fraction': 0.6906505015910112, 'colsample_bytree': 0.593482370921205}. Best is trial 0 with value: 0.8014599383458805.[0m
[32m[I 2021-09-30 18:59:10,705][0m Trial 1 finished with value: 0.8140771976000376 and parameters: {'num_leaves': 32, 'n_estimators': 1418, 'max_depth': 3, 'min_child_samples': 729, 'learning_rate': 0.1524242132456086, 'bagging_fraction': 0.8207267301624811, 'colsample_bytree': 0.7754814041080009}. Best is trial 1 with value: 0.8140771976000376.[0m
[32m[I 2021-09-30 19:03:17,183][0m Trial 2 finished with value: 0.8082687435976919 and parameters: {'num_leaves': 23, 'n_estimators': 1116, 'max_depth': 7, 'min_ch

In [None]:
lgb_params = {
    'objective': 'binary',
    'n_estimators': 20000, #worth tuning
    'random_state': 42,
    'learning_rate': 4e-3, #worth tuning
    'subsample': 0.6,
    'subsample_freq': 1,
    'colsample_bytree': 0.4,
    'reg_alpha': 12.0,
    'reg_lambda': 1e-1,
    'min_child_weight': 256,
    'min_child_samples': 20,
}

In [32]:
#lowercase and shortened to distinguish from the 'original' train sets
#x_tra, x_val, y_tra, y_val = train_test_split(X_train, y_train, test_size=0.2, shuffle=True, random_state=42)

lgb_classifier = lgb.LGBMClassifier(**params)

lgb_classifier.fit(x_tra, y_tra, eval_set=[(x_val, y_val)],
                  eval_metric='auc', early_stopping_rounds=200,
                  verbose=500,
                  )
y_pred = lgb_classifier.predict_proba(X_test)

Training until validation scores don't improve for 200 rounds
[500]	valid_0's auc: 0.813253	valid_0's binary_logloss: 0.509945
[1000]	valid_0's auc: 0.814254	valid_0's binary_logloss: 0.50938
Early stopping, best iteration is:
[1042]	valid_0's auc: 0.814317	valid_0's binary_logloss: 0.509341


In [33]:
sample_df['claim'] = y_pred[:,1]

In [34]:
sample_df.to_csv('submission_8.csv', index=False)