<a href="https://colab.research.google.com/github/afdebbas/DataScience/blob/master/Classification_Pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# https://gitlab.com/ppleskov/kaggle-days-dubai

# Ultimate Binary Classification Pipeline

## by Pavel Pleskov - Data Scientist at Point API (https://pointapi.com/)


https://hackernoon.com/interview-with-kaggle-grandmaster-data-scientist-at-point-api-pavel-pleskov-cc8ca67de249

## Example

- Humpback Whale Identification https://www.kaggle.com/c/humpback-whale-identification/discussion/82430#latest-482102
- Gendered Pronoun Resolution https://www.kaggle.com/c/gendered-pronoun-resolution/discussion/90417#latest-522210

## Motivation

- a lot of algorithms in sklearn and in other libraries, how to choose?
- trade-off between computation time and quality/diversity
- proper parameter tuning
- blending/stacking

## What will not be covered

- Naive Bayes, FFM
- imbalanced classes
- feature engineering
- setting up GPU for boosters

In [0]:
import pandas as pd
import numpy as np
import lightgbm as lgb
import xgboost as xgb
import catboost as cb
from catboost import CatBoostClassifier, Pool
import random 

from os import listdir
from tqdm import tqdm
from os.path import isfile

import sklearn
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, MinMaxScaler 
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn import preprocessing
from sklearn.externals import joblib
from sklearn.decomposition import TruncatedSVD

from bayes_opt import BayesianOptimization
from bayes_opt.observer import JSONLogger
from bayes_opt.event import Events
from bayes_opt.util import load_logs

print("pandas:", pd.__version__)
print("numpy:", np.__version__)
print("sklearn:", sklearn.__version__)
print()
print("lightgbm:", lgb.__version__)
print("xgboost:", xgb.__version__)
print("catboost:", cb.__version__)

# pip install bayesian-optimization 
# pip install catboost

## Downloading the data

Homesite Quote Conversion

https://www.kaggle.com/c/homesite-quote-conversion/data

Please follow the link and accept the rules!

In [0]:
!unzip input/homesite-quote-conversion.zip

## Preprocessing

In [0]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [0]:
train.tail()

### 1. How to use 10% subsample of train set?

In [0]:
# PUT YOUR CODE HERE

Dataset seems to be balanced enough

In [0]:
train.QuoteConversion_Flag.value_counts()

Dealing with dates and categorical variables

In [0]:
# https://www.kaggle.com/tunguz/xgboost-benchmark-1

y = train.QuoteConversion_Flag.values
train = train.drop(['QuoteNumber', 'QuoteConversion_Flag'], axis=1)
test = test.drop('QuoteNumber', axis=1)

# Lets play with some dates
train['Date'] = pd.to_datetime(pd.Series(train['Original_Quote_Date']))
train = train.drop('Original_Quote_Date', axis=1)

test['Date'] = pd.to_datetime(pd.Series(test['Original_Quote_Date']))
test = test.drop('Original_Quote_Date', axis=1)

train['Year'] = train['Date'].apply(lambda x: int(str(x)[:4]))
train['Month'] = train['Date'].apply(lambda x: int(str(x)[5:7]))
train['weekday'] = train['Date'].dt.dayofweek

test['Year'] = test['Date'].apply(lambda x: int(str(x)[:4]))
test['Month'] = test['Date'].apply(lambda x: int(str(x)[5:7]))
test['weekday'] = test['Date'].dt.dayofweek

train = train.drop('Date', axis=1)
test = test.drop('Date', axis=1)

train = train.fillna(-1)
test = test.fillna(-1)

for f in tqdm(train.columns):
    if train[f].dtype=='object':
        #print(f)
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(train[f].values) + list(test[f].values))
        train[f] = lbl.transform(list(train[f].values))
        test[f] = lbl.transform(list(test[f].values))
        
# train-test discrepancy analysis is skipped

In [0]:
train.tail()

In [0]:
cols = list(train.columns)
len(cols), cols

## Reducing dimensionality

In [0]:
# PUT YOUR CODE HERE

Using column subsample in order to speed calulation up

In [0]:
# Naive approach
# random.seed(42)
# cols = random.sample(cols, 50) 

### 2. Why explained variance is always close to 100%?

In [0]:
N = 50

svd = TruncatedSVD(n_components=N, random_state=42)
X = svd.fit_transform(train[cols], y)  
svd.explained_variance_ratio_.sum()

In [0]:
df = pd.DataFrame()
df["target"] = y

for i in range(50):
    df[i] = X[:,i]
    
df.to_csv("partial_train_"+str(N)+".csv", index=False)

df.tail()

## LogReg

### 3. What is wrong with this code?

In [0]:
%%time

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
arch = "reg"

train[arch] = 0

for i, (train_index, valid_index) in enumerate(skf.split(X, y)):
    
    X_train = X[train_index]
    X_valid = X[valid_index]

    y_train = y[train_index]
    y_valid = y[valid_index]
    
    reg = LogisticRegression().fit(X_train, y_train) 
    
    y_pred = reg.predict_proba(X_valid)[:,1]
    train.loc[valid_index, arch] = y_pred
    print(i, "ROC AUC:", round(roc_auc_score(y_valid, y_pred), 5))

print()
print("OOF ROC AUC:", round(roc_auc_score(y, train[arch]), 5))
print()

## SVM

### 4. What is wrong with this algorithm? (at least two things)

In [0]:
# https://github.com/Xtra-Computing/thundergbm
# pip install thundersvm-cpu-0.2.0-py3-none-linux_x86_64.whl
# from thundersvm import SVC

In [0]:
%%time

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
arch = "svc"

train[arch] = 0

for i, (train_index, valid_index) in enumerate(skf.split(X, y)):
    
    X_train = X[train_index]
    X_valid = X[valid_index]

    y_train = y[train_index]
    y_valid = y[valid_index]
    
    svc = SVC().fit(X_train, y_train) 
    
    y_pred = svc.predict_proba(X_valid)[:,1]
    train.loc[valid_index, arch] = y_pred
    print(i, "ROC AUC:", round(roc_auc_score(y_valid, y_pred), 5))

print()
print("OOF ROC AUC:", round(roc_auc_score(y, train[arch]), 5))
print()

## KNN

### 5. How can we produce more models with KNN? 

In [0]:
%%time

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
arch = "nei"

train[arch] = 0

for i, (train_index, valid_index) in enumerate(skf.split(X, y)):
    
    X_train = X[train_index]
    X_valid = X[valid_index]

    y_train = y[train_index]
    y_valid = y[valid_index]
    
    nei = KNeighborsClassifier(p=1, n_jobs=-1).fit(X_train, y_train) 
    
    y_pred = nei.predict_proba(X_valid)[:,1]
    train.loc[valid_index, arch] = y_pred
    print(i, "ROC AUC:", round(roc_auc_score(y_valid, y_pred), 5))
    
print()
print("OOF ROC AUC:", round(roc_auc_score(y, train[arch]), 5))
print()

## Random Forest

### 6. Which optimization metric is used by RFC?

In [0]:
%%time

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
arch = "rfc"

train[arch] = 0
test[arch] = 0

for i, (train_index, valid_index) in enumerate(skf.split(X, y)):
    
    X_train = X[train_index]
    X_valid = X[valid_index]

    y_train = y[train_index]
    y_valid = y[valid_index]
    
    rfc = RandomForestClassifier(n_estimators=100,
                                 n_jobs=-1).fit(X_train, y_train) 
    
    y_pred = rfc.predict_proba(X_valid)[:,1]
    train.loc[valid_index, arch] = y_pred
    print(i, "ROC AUC:", round(roc_auc_score(y_valid, y_pred), 5))

print()
print("OOF ROC AUC:", round(roc_auc_score(y, train[arch]), 5))
print()

## LGBM

### 7. How to save LGBM model? What is it anyway?

In [0]:
%%time

arch = "lgb"

train[arch] = 0

rounds = 10000
early_stop_rounds = 300

params = {'objective': 'binary',
          'boosting_type': 'gbrt',
          'metric': 'auc',
          'seed': 42,
          'max_depth': -1,
          'verbose': -1,
          'n_jobs': -1}

# best_params = {'feature_fraction': 0.8752556106728574,
#               'lambda_l1': 1.3735569040447826,
#               'lambda_l2': 8.04774809406042,
#               'learning_rate': 0.024553401275571943,
#               'min_data_in_leaf': 16.31456193667883,
#               'min_sum_hessian_in_leaf': 10.489617646270466,
#               'num_leaves': 14.623398745206696}

# best_params['num_leaves'] = int(best_params['num_leaves'])
# best_params['min_data_in_leaf'] = int(best_params['min_data_in_leaf'])
# params.update(best_params)

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for i, (train_index, valid_index) in enumerate(skf.split(X, y)):
    
    X_train = X[train_index]
    X_valid = X[valid_index]

    y_train = y[train_index]
    y_valid = y[valid_index]

    d_train = lgb.Dataset(X_train, y_train)
    d_valid = lgb.Dataset(X_valid, y_valid)    

    model = lgb.train(params,
                      d_train,
                      num_boost_round=rounds,
                      valid_sets=[d_train, d_valid],
                      valid_names=['train','valid'],
                      early_stopping_rounds=early_stop_rounds,
                      verbose_eval=0) 
    
    # PUT YOUR CODE HERE

    y_pred = model.predict(X_valid)
    train.loc[valid_index, arch] = y_pred
    auc = roc_auc_score(y_valid, y_pred)
    print(i, "ROC AUC:", round(auc, 5))

print()
print("OOF ROC AUC:", round(roc_auc_score(y, train[arch]), 5))
print()

## Feature importance

### 8. How to load the model?

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline

# PUT YOUR CODE HERE

fig, ax = plt.subplots(figsize=(15, 10))
lgb.plot_importance(model, max_num_features=50, ax=ax)
plt.title("Light GBM Feature Importance")

## HPO

In [0]:
def evaluate(**new_params):

    rounds = 10000
    early_stop_rounds = 300
    
    new_params['num_leaves'] = int(new_params['num_leaves'])
    new_params['min_data_in_leaf'] = int(new_params['min_data_in_leaf'])
    
    params.update(new_params)
    print(new_params)
    
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    oof = np.zeros(len(train))
    
    for i, (train_index, valid_index) in enumerate(skf.split(X, y)):
    
        X_train = X[train_index]
        X_valid = X[valid_index]

        y_train = y[train_index]
        y_valid = y[valid_index]

        d_train = lgb.Dataset(X_train, y_train)
        d_valid = lgb.Dataset(X_valid, y_valid)    

        model = lgb.train(params,
                          d_train,
                          num_boost_round=rounds,
                          valid_sets=[d_train, d_valid],
                          valid_names=['train','valid'],
                          early_stopping_rounds=early_stop_rounds,
                          verbose_eval=0) 
        
        oof[valid_index] = model.predict(X_valid)
        auc = roc_auc_score(y[valid_index], oof[valid_index])
        print(i, "ROC AUC:", round(auc, 5))
    
    auc = roc_auc_score(y, oof)
    print()
    print("ROC AUC VALID:", round(auc, 5))
    print()
        
    return auc

### 9. Which parameters to optimize?

In [0]:
# https://github.com/fmfn/BayesianOptimization

params = {'objective': 'binary',
          'boosting_type': 'gbrt',
          'metric': 'auc',
          'bagging_freq': 1,
          'bagging_fraction': 0.9,
          'bagging_seed': 42,
          'seed': 42,
          'max_bin': 1023, #255
          'max_depth': -1,
          'verbose': -1,
          'n_jobs': -1}

bounds = {'num_leaves': (4,32), 
          'learning_rate': (0.01,0.05)
          # PUT YOUR CODE HERE
         }

bo = BayesianOptimization(evaluate, pbounds=bounds)

log_file = "./output/hpo_lgbm_logs.json"
logger = JSONLogger(path=log_file)
bo.subscribe(Events.OPTMIZATION_STEP, logger)

bo.maximize(init_points=2, n_iter=8)

In [0]:
bo.max

## XGB

### 10. Pros and cons of using XGB?

In [0]:
%%time

arch = "xgb"

train[arch] = 0

rounds = 10000
early_stop_rounds = 100

params = {'eval_metric': 'auc',
          'booster': 'gbtree',
          'tree_method': 'hist',
          'objective': 'binary:logistic',
          'subsample': 0.9,
          'colsample_bytree': 0.3,
          'eta': 0.1,
          'max_depth': 4,
          'seed': 42,
          'verbosity': 0}

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for i, (train_index, valid_index) in enumerate(skf.split(X, y)):
    
    X_train = X[train_index]
    X_valid = X[valid_index]

    y_train = y[train_index]
    y_valid = y[valid_index]

    d_train = xgb.DMatrix(X_train, y_train)
    d_valid = xgb.DMatrix(X_valid, y_valid)

    model = xgb.train(params,
                      d_train,
                      rounds,
                      [(d_train, 'train'), (d_valid, 'eval')],
                      early_stopping_rounds=early_stop_rounds,
                      verbose_eval=0)
    
    #joblib.dump(model, f"./output/xgb_{i}.pkl")

    best = model.best_iteration + 1

    y_pred = model.predict(d_valid, ntree_limit=best)
    train.loc[valid_index, arch] = y_pred
    auc = roc_auc_score(y_valid, y_pred)
    print(i, "ROC AUC:", round(auc, 5))

print()
print("OOF ROC AUC:", round(roc_auc_score(y, train[arch]), 5))
print()

## CatBoost

### 11. Pros and cons of using CatBoost?

In [0]:
%%time

arch = "cat"

train[arch] = 0

rounds = 10000
early_stop_rounds = 100

params = {'task_type': 'CPU', #GPU
          'iterations': rounds,
          'loss_function': 'Logloss',
          'eval_metric':'AUC',
          'random_seed': 42,
          'learning_rate': 0.5,
          'depth': 2}

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for i, (train_index, valid_index) in enumerate(skf.split(X, y)):
    
    X_train = X[train_index]
    X_valid = X[valid_index]

    y_train = y[train_index]
    y_valid = y[valid_index]
    
    trn_data = Pool(X_train, y_train)
    val_data = Pool(X_valid, y_valid)
    
    clf = CatBoostClassifier(**params)
    clf.fit(trn_data,
            eval_set=val_data,
            use_best_model=True,
            early_stopping_rounds=early_stop_rounds,
            verbose=0)
    
    y_pred = clf.predict_proba(X_valid)[:, 1]
    train.loc[valid_index, arch] = y_pred
    auc = roc_auc_score(y_valid, y_pred)
    print(i, "ROC AUC:", round(auc, 5))

print()
print("OOF ROC AUC:", round(roc_auc_score(y, train[arch]), 5))
print()

## NN

### 12. What is wrong with this code? (answer also applies to all boosters)

In [0]:
import fastai
import torch

from fastai.basic_data import load_data
from fastai.tabular import *

print("fastai:", fastai.__version__)
print("torch:", torch.__version__)

In [0]:
path = Path('../output/')
bs = 2048

procs = [FillMissing, Categorify, Normalize]

dep_var = "target"

cat_names = []
cont_names = []

df = pd.read_csv("partial_train_"+str(N)+".csv")
cont_names = list(df.columns)
cont_names.remove('target')

In [0]:
def auc_score(y_pred,y_true,tens=True):
    score = roc_auc_score(y_true, torch.sigmoid(y_pred)[:, 1])
    if tens:
        score = tensor(score)
    else:
        score = score
    return score

random_seed = 42

np.random.seed(random_seed)
random.seed(random_seed)
torch.manual_seed(random_seed)
torch.cuda.manual_seed(random_seed)

In [0]:
%%time

arch = "nn"

train[arch] = 0

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for i, (train_index, valid_index) in enumerate(skf.split(X, y)):
    
    data = (TabularList.from_df(df, path=path, cat_names=cat_names,
                                cont_names=cont_names, procs=procs)
                       .split_by_idx(valid_index)
                       .label_from_df(cols=dep_var)
                       .databunch(bs=bs))
    
    learn = tabular_learner(data, 
                            layers=[20, 10],
                            ps=0.5,
                            emb_drop=0.5,
                            y_range=[0, 1],
                            use_bn=True,
                            metrics=[auc_score])
    lr = 1e-2
    
    learn.fit_one_cycle(10, lr, moms=(0.8, 0.7))
    learn.fit_one_cycle(10, lr/2, moms=(0.8, 0.7))
        
    preds = learn.get_preds(ds_type=DatasetType.Valid)
    y_pred = [float(preds[0][i][1]) for i in range(len(preds[0]))]
    
    train.loc[valid_index, arch] = y_pred
    auc = roc_auc_score(y[valid_index], y_pred)
    print(i, "ROC AUC:", round(auc, 5))

print()
print("OOF ROC AUC:", round(roc_auc_score(y, train[arch]), 5))
print()

## Correlation

### 13. How to fix this code?

In [0]:
models = ["reg", "nei", "rfc", "lgb", "xgb", "cat", "nn"]

# PUT YOUR CODE HERE

train[models].corr()

In [0]:
train[models].tail()

## Blending

In [0]:
for arch in models:
    print(arch, round(roc_auc_score(y, train[arch]), 5))

Let's try traditional arithmetic mean

In [0]:
train["avg"] = train[models].mean(axis=1)
print("avg", round(roc_auc_score(y, train["avg"]), 5))

Now some more advanced means

In [0]:
from scipy.stats.mstats import gmean

def power_mean(x, p=1):
    if p==0:
        return gmean(x, axis=1)
    return np.power(np.mean(np.power(x,p), axis=1), 1/p)

### 14. Which powers to check?

In [0]:
# PUT YOUR CODE HERE

### 15. How to choose weights?

In [0]:
w = [1,1,1,1,1,1,1,1] # CHANGE YOUR CODE HERE

train["w_avg"] = 0
for i, model in enumerate(models):
    train["w_avg"] += train[model] * w[i]

print("w_avg", round(roc_auc_score(y, train["w_avg"]), 5))

## Stacking

### 16. How hard could it be?

In [0]:
# PUT YOUR CODE HERE

### 17. How to get weights?

In [0]:
# PUT YOUR CODE HERE

## H2O

### 18. What is the point anyway? 

In [0]:
# pip install h2o
# https://github.com/h2oai/h2o-tutorials/blob/master/h2o-world-2017/automl/Python/automl_binary_classification_product_backorders.ipynb

import h2o
from h2o.automl import H2OAutoML
h2o.init()

In [0]:
data_path = "partial_train_"+str(N)+".csv"
df = h2o.import_file(data_path)

In [0]:
df.describe()

In [0]:
target = "C1"
x = df.columns
x.remove(target)

df[target] = df[target].asfactor() # important line

In [0]:
aml = H2OAutoML(max_models = 10, seed = 1)
aml.train(x = x, y = target, training_frame = df)

In [0]:
lb = aml.leaderboard
lb.head()

In [0]:
# Get model ids for all models in the AutoML Leaderboard
model_ids = list(aml.leaderboard['model_id'].as_data_frame().iloc[:,0])
# Get the "All Models" Stacked Ensemble model
se = h2o.get_model([mid for mid in model_ids if "StackedEnsemble_AllModels" in mid][0])
# Get the Stacked Ensemble metalearner model
metalearner = h2o.get_model(se.metalearner()['name'])

In [0]:
metalearner.coef_norm()

In [0]:
%matplotlib inline
metalearner.std_coef_plot()