### LightGBM and Catboost hybrid

A simple walkthrough of some essential steps in creating first set of predictions for a competition. There's normally more work involved in each of these steps (filling NaNs, feature engineering, parameter tuning, validation  ensembling) before going for next step, but here we just go through each step briefly.

In [None]:
import pandas as pd
import numpy as np
from catboost import CatBoostClassifier
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import tqdm

In [None]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

In [None]:
# check missing values per column
train.isnull().sum(axis=0) / train.shape[0]

# input missing values
train['siteid'].fillna(-999, inplace=True)
test['siteid'].fillna(-999, inplace=True)

train['browserid'].fillna("None", inplace=True)
test['browserid'].fillna("None", inplace=True)

train['devid'].fillna("None", inplace=True)
test['devid'].fillna("None", inplace=True)

In [None]:
# set datetime dtype to the column
train['datetime'] = pd.to_datetime(train['datetime'])
test['datetime'] = pd.to_datetime(test['datetime'])

Note: Better way to do the above is to set `parse_dates` argument to list of datetime column names when importing csv in pandas.

In [None]:
# create datetime variable
train['tweekday'] = train['datetime'].dt.weekday
train['thour'] = train['datetime'].dt.hour
train['tminute'] = train['datetime'].dt.minute

test['tweekday'] = test['datetime'].dt.weekday
test['thour'] = test['datetime'].dt.hour
test['tminute'] = test['datetime'].dt.minute

In [None]:
cols = ['siteid', 'offerid', 'category', 'merchant']

for x in cols:
    train[x] = train[x].astype('object')
    test[x] = test[x].astype('object')

In [None]:
cols_to_use = list(set(train.columns) - set(['ID', 'datetime', 'click']))


### Model 1 - Catboost


In [None]:
cat_cols = range(10)

models = []
np.random.seed(42)
for i in tqdm.tqdm_notebook(range(10)):
    rows = np.random.choice(train.index.values, int(1e6))
    sampled_train = train.loc[rows]
    trainX = sampled_train[cols_to_use]
    trainY = sampled_train['click']
    X_train, X_test, y_train, y_test = train_test_split(trainX, trainY, test_size=0.5)

    cat_boost_params = {
        'depth': 13 + np.random.randint(5)
        'iterations': 50 + np.random.randint(10),
        'learning_rate': 0.1 + (np.random.rand() * 1e-1),
        'eval_metric': 'AUC',
        'random_seed': np.random.randint(10 ** 10),
        'verbose': True
    }
    model = CatBoostClassifier(**cat_boost_params)
    model.fit(X_train, y_train, cat_features=cat_cols, eval_set=(X_test, y_test), use_best_model=True)
    models.append(model)

Almost all of the parameter values are an overkill for a baseline predictions, but these are just to remember that there are more parameters to try if we need them.

In general, `learning rate` and `depth` matter the most. Consult library docs for what are default values.

In [None]:
print models[0]
models[0].save_model('first_model', format='coreml')


### Model 2 - LightGBM


In [None]:
cat_cols = cols + ['countrycode', 'browserid', 'devid']

models2 = []
np.random.seed(42)
for i in tqdm.tqdm_notebook(range(10)):
    for col in cat_cols:
        lbl = LabelEncoder()
        lbl.fit(list(train[col].values) + list(test[col].values))
        train[col] = lbl.transform(list(train[col].values))
        test[col] = lbl.transform(list(test[col].values))

    lgbm_cols_to_use = list(set(train.columns) - set(['ID', 'datetime', 'click']))
    X_train, X_test, y_train, y_test = train_test_split(train[lgbm_cols_to_use], train['click'], test_size=0.5)
    dtrain = lgb.Dataset(X_train, y_train)
    dval = lgb.Dataset(X_test, y_test)

    lightgbm_params = {
        'seed': np.random.randint(10 ** 10),
        'num_leaves': 384 + np.random.randint(0, 128),
        'learning_rate': 0.05 + (np.random.rand() * 1e-2),
        'metric': 'auc',
        'objective': 'binary',
        'early_stopping_round': 40,
        'max_depth': 12 + np.random.randint(0, 10),
        'bagging_fraction': 0.5,
        'feature_fraction': 0.6,
        'bagging_seed': 2017,
        'feature_fraction_seed': 2017,
        'verbose': 1,
        'boosting': 'goss'
    }
    model = lgb.train(lightgbm_params, dtrain, num_boost_round=500, valid_sets=dval, verbose_eval=20)
    models2.append(model)

Again, the parameters are an overkill! Same as with catboost model.

## Average Ensemble



In [None]:
predictions = []

for _model in tqdm.tqdm_notebook(models):
    predictions.append(_model.predict_proba(test[cols_to_use])[:, 1])

predictions = np.vstack(predictions).T

In [None]:
predictions2 = []

for _model in tqdm.tqdm_notebook(models2):
    predictions2.append(_model.predict(test[cols_to_use])[:, 1])

predictions2 = np.vstack(predictions2).T

In [None]:
prediction = np.hstack([predictions, predictions2]).mean(axis=1)

In [None]:
sub = pd.DataFrame({'ID': test['ID'], 'click': prediction})
sub.to_csv('prediction.csv', index=False)