# часть 2. Сравнение скорости и точности LGBM, XGB and Catboost

&nbsp;

В данном ноутбуке приведено сравнение алгоритмов `LightGBM`, `XGBoost` and `Catboost`  на реальном наборе данных [Google Analytics Customer Revenue Prediction](https://www.kaggle.com/c/google-analytics-customer-revenue-prediction) с соревнования из Kaggle


&nbsp;

| Model        | Rounds | Train RMSE           | Validation RMSE | Train time | Public Score|
| ------------- |------:|-----:|-----:| -----:| -----:|
| `LightGBM`      | 5000| 1.505 | <span style='color:green'>1.60372 </span> | 7min 48s | <span style='color:green'>1.6717</span> |
| `XGBoost`      | 2000| 1.568 | 1.64924 | <span style='color:red'>54min 54s </span> | 1.6946 |
| `Catboost`      | 1000| 1.52184 | 1.61231  | <span style='color:green'>2min 24s</span> | 1.6722 |
| `Ensemble`      | -- | --| -- | -- | <span style='color:green'>1.6677</span>|




Пдан ноутбука:

1. [Предобработка](#preprocessing)
2. [Модели](#models)
  - 2.1. [LightGBM](#lightgbm)
  - 2.2. [XGBoost](#xgboost)
  - 2.3. [Catboost](#catboost)
3. [Ensemble and submissions](#ensemble)
4. [Выводы](#conclusions)
5. [Ссылки](#references)

<a id='preprocessing'></a>
## 1. Preprocessing

&nbsp;

Предобработка взята из  [LGBM (RF) starter [LB: 1.70]](https://www.kaggle.com/fabiendaniel/lgbm-rf-starter-lb-1-70)

Маунтим наш гугл диск

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [35]:
INPUT_TRAIN = "/content/drive/My Drive/Sber/lecture_15/data/train.csv"
INPUT_TEST = "/content/drive/My Drive/Sber/lecture_15/data/test.csv"

# Укажем пути, куда сохранять преобработанные датасеты
TRAIN='train-processed.csv'
TEST='test-processed.csv'
Y='y.csv'

In [6]:
import os
import gc
import json
import time
from datetime import datetime
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize

import warnings
warnings.filterwarnings('ignore')

# Reference: https://www.kaggle.com/julian3833/1-quick-start-read-csv-and-flatten-json-fields
def load_df(csv_path=INPUT_TRAIN, nrows=None):
    print(f"Loading {csv_path}")
    JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
    
    df = pd.read_csv(csv_path, 
                     converters={column: json.loads for column in JSON_COLUMNS}, 
                     dtype={'fullVisitorId': 'str'}, # Important!!
                     nrows=nrows)
    for column in JSON_COLUMNS:
        column_as_df = json_normalize(df[column])
        column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
        df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
    print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
    return df


# This function is just a packaged version of this kernel:
# https://www.kaggle.com/fabiendaniel/lgbm-rf-starter-lb-1-70
def process_dfs(train_df, test_df):
    print("Processing dfs...")
    print("Dropping repeated columns...")
    columns = [col for col in train_df.columns if train_df[col].nunique() > 1]
    
    train_df = train_df[columns]
    test_df = test_df[columns]

    trn_len = train_df.shape[0]
    merged_df = pd.concat([train_df, test_df])

    merged_df['diff_visitId_time'] = merged_df['visitId'] - merged_df['visitStartTime']
    merged_df['diff_visitId_time'] = (merged_df['diff_visitId_time'] != 0).astype(int)
    del merged_df['visitId']

    del merged_df['sessionId']

    print("Generating date columns...")
    format_str = '%Y%m%d' 
    merged_df['formated_date'] = merged_df['date'].apply(lambda x: datetime.strptime(str(x), format_str))
    merged_df['WoY'] = merged_df['formated_date'].apply(lambda x: x.isocalendar()[1])
    merged_df['month'] = merged_df['formated_date'].apply(lambda x:x.month)
    merged_df['quarter_month'] = merged_df['formated_date'].apply(lambda x:x.day//8)
    merged_df['weekday'] = merged_df['formated_date'].apply(lambda x:x.weekday())

    del merged_df['date']
    del merged_df['formated_date']

    merged_df['formated_visitStartTime'] = merged_df['visitStartTime'].apply(
        lambda x: time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(x)))
    merged_df['formated_visitStartTime'] = pd.to_datetime(merged_df['formated_visitStartTime'])
    merged_df['visit_hour'] = merged_df['formated_visitStartTime'].apply(lambda x: x.hour)

    del merged_df['visitStartTime']
    del merged_df['formated_visitStartTime']

    print("Encoding columns with pd.factorize()")
    for col in merged_df.columns:
        if col in ['fullVisitorId', 'month', 'quarter_month', 'weekday', 'visit_hour', 'WoY']: continue
        if merged_df[col].dtypes == object or merged_df[col].dtypes == bool:
            merged_df[col], indexer = pd.factorize(merged_df[col])

    print("Splitting back...")
    train_df = merged_df[:trn_len]
    test_df = merged_df[trn_len:]
    return train_df, test_df

def preprocess():
    train_df = load_df()
    test_df = load_df(INPUT_TEST)

    target = train_df['totals.transactionRevenue'].fillna(0).astype(float)
    target = target.apply(lambda x: np.log1p(x))
    del train_df['totals.transactionRevenue']

    train_df, test_df = process_dfs(train_df, test_df)
    train_df.to_csv(TRAIN, index=False)
    test_df.to_csv(TEST, index=False)
    target.to_csv(Y, index=False)


In [76]:
original_df = pd.read_csv(INPUT_TRAIN )
preprocessed_df = pd.read_csv('train-processed.csv')
y_res = pd.read_csv('y.csv')

In [73]:
original_df

Unnamed: 0,channelGrouping,date,device,fullVisitorId,geoNetwork,sessionId,socialEngagementType,totals,trafficSource,visitId,visitNumber,visitStartTime
0,Organic Search,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",1131660440785968503,"{""continent"": ""Asia"", ""subContinent"": ""Western...",1131660440785968503_1472830385,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472830385,1,1472830385
1,Organic Search,20160902,"{""browser"": ""Firefox"", ""browserVersion"": ""not ...",377306020877927890,"{""continent"": ""Oceania"", ""subContinent"": ""Aust...",377306020877927890_1472880147,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472880147,1,1472880147
2,Organic Search,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",3895546263509774583,"{""continent"": ""Europe"", ""subContinent"": ""South...",3895546263509774583_1472865386,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472865386,1,1472865386
3,Organic Search,20160902,"{""browser"": ""UC Browser"", ""browserVersion"": ""n...",4763447161404445595,"{""continent"": ""Asia"", ""subContinent"": ""Southea...",4763447161404445595_1472881213,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472881213,1,1472881213
4,Organic Search,20160902,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",27294437909732085,"{""continent"": ""Europe"", ""subContinent"": ""North...",27294437909732085_1472822600,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""1"", ""pageviews"": ""1"",...","{""campaign"": ""(not set)"", ""source"": ""google"", ...",1472822600,2,1472822600
...,...,...,...,...,...,...,...,...,...,...,...,...
903648,Social,20170104,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",5123779100307500332,"{""continent"": ""Americas"", ""subContinent"": ""Car...",5123779100307500332_1483554750,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""17"", ""pageviews"": ""15...","{""referralPath"": ""/yt/about/"", ""campaign"": ""(n...",1483554750,1,1483554750
903649,Social,20170104,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",7231728964973959842,"{""continent"": ""Asia"", ""subContinent"": ""Souther...",7231728964973959842_1483543798,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""18"", ""pageviews"": ""13...","{""referralPath"": ""/yt/about/"", ""campaign"": ""(n...",1483543798,1,1483543798
903650,Social,20170104,"{""browser"": ""Android Webview"", ""browserVersion...",5744576632396406899,"{""continent"": ""Asia"", ""subContinent"": ""Eastern...",5744576632396406899_1483526434,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""24"", ""pageviews"": ""21...","{""referralPath"": ""/yt/about/ko/"", ""campaign"": ...",1483526434,1,1483526434
903651,Social,20170104,"{""browser"": ""Chrome"", ""browserVersion"": ""not a...",2709355455991750775,"{""continent"": ""Asia"", ""subContinent"": ""Southea...",2709355455991750775_1483592857,Not Socially Engaged,"{""visits"": ""1"", ""hits"": ""24"", ""pageviews"": ""22...","{""referralPath"": ""/l.php"", ""campaign"": ""(not s...",1483592857,1,1483592864


In [74]:
preprocessed_df

Unnamed: 0,channelGrouping,fullVisitorId,visitNumber,device.browser,device.operatingSystem,device.isMobile,device.deviceCategory,geoNetwork.continent,geoNetwork.subContinent,geoNetwork.country,geoNetwork.region,geoNetwork.metro,geoNetwork.city,geoNetwork.networkDomain,totals.hits,totals.pageviews,trafficSource.campaign,trafficSource.source,trafficSource.medium,trafficSource.keyword,trafficSource.referralPath,trafficSource.adwordsClickInfo.page,trafficSource.adwordsClickInfo.slot,trafficSource.adwordsClickInfo.gclId,trafficSource.adwordsClickInfo.adNetworkType,trafficSource.adContent,diff_visitId_time,WoY,month,quarter_month,weekday,visit_hour
0,0,1131660440785968503,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,0,35,9,0,4,15
1,0,377306020877927890,1,1,1,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,0,35,9,0,4,5
2,0,3895546263509774583,1,0,0,0,0,2,2,2,2,0,2,2,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,0,35,9,0,4,1
3,0,4763447161404445595,1,2,2,0,0,0,3,3,1,1,1,2,0,0,0,0,0,1,-1,-1,-1,-1,-1,-1,0,35,9,0,4,5
4,0,27294437909732085,2,0,3,1,1,2,4,4,1,1,1,2,0,0,0,0,0,0,-1,-1,-1,-1,-1,-1,0,35,9,0,4,13
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
903648,6,5123779100307500332,1,0,0,0,0,3,15,55,1,1,1,1242,16,14,0,25,1,-1,61,-1,-1,-1,-1,-1,0,1,1,0,2,18
903649,6,7231728964973959842,1,0,3,1,1,0,5,45,1,1,1,2,17,12,0,25,1,-1,61,-1,-1,-1,-1,-1,0,1,1,0,2,15
903650,6,5744576632396406899,1,16,3,1,1,0,8,26,15,0,16,2,23,20,0,25,1,-1,34,-1,-1,-1,-1,-1,0,1,1,0,2,10
903651,6,2709355455991750775,1,0,0,0,0,0,3,3,1,1,1,2,23,22,0,27,1,-1,77,-1,-1,-1,-1,-1,1,1,1,0,2,5


In [7]:
%%time
preprocess()

Loading /content/drive/My Drive/Sber/lecture_15/data/train.csv
Loaded train.csv. Shape: (903653, 55)
Loading /content/drive/My Drive/Sber/lecture_15/data/test.csv
Loaded test.csv. Shape: (804684, 53)
Processing dfs...
Dropping repeated columns...
Generating date columns...
Encoding columns with pd.factorize()
Splitting back...
CPU times: user 6min 25s, sys: 18.2 s, total: 6min 43s
Wall time: 6min 55s


<a id='models'></a>
## 2. Models

&nbsp;


```python
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
```

Помимо импортов необходимых моделей, мы определяем метрику качества `rmse` на основе sklearn` mean_squared_error` и вспомогательную функцию для загрузки предварительно обработанных табличных данных,` load_preprocessed_dfs () ` 


In [9]:
!pip install catboost

Collecting catboost
[?25l  Downloading https://files.pythonhosted.org/packages/20/37/bc4e0ddc30c07a96482abf1de7ed1ca54e59bba2026a33bca6d2ef286e5b/catboost-0.24.4-cp36-none-manylinux1_x86_64.whl (65.7MB)
[K     |████████████████████████████████| 65.8MB 72kB/s 
Installing collected packages: catboost
Successfully installed catboost-0.24.4


In [59]:
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return round(np.sqrt(mean_squared_error(y_true, y_pred)), 5)

def load_preprocessed_dfs(drop_full_visitor_id=True):
    """
    Loads files `TRAIN`, `TEST` and `Y` generated by preprocess() into variables
    """
    global TRAIN, TEST, Y
    X_train = pd.read_csv('train-processed.csv')[:100000]
    X_test = pd.read_csv('test-processed.csv')[:100000]
    y_train = pd.read_csv('y.csv')[:100000]
    y_train = y_train.T.squeeze()
    
    # This is the only `object` column, we drop it for train and evaluation
    if drop_full_visitor_id: 
        X_train.drop(['fullVisitorId'], axis = 1, inplace = True)
        X_test.drop(['fullVisitorId'], axis = 1, inplace = True)
    return X_train, np.array(y_train), X_test

In [60]:
X, y, X_test = load_preprocessed_dfs()
print(X.shape, y.shape, X_test.shape)

(100000, 31) (100000,) (100000, 31)


In [61]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=1)

print(f"Train shape: {X_train.shape}")
print(f"Validation shape: {X_val.shape}")
print(f"Test (submit) shape: {X_test.shape}")

Train shape: (80000, 31)
Validation shape: (20000, 31)
Test (submit) shape: (100000, 31)


<a id='lightgbm'></a>
### 2.1. LightGBM


In [62]:
def run_lgb(X_train, y_train, X_val, y_val, X_test):
    
    params = {
        "objective" : "regression",
        "metric" : "rmse",
        "num_leaves" : 40,
        "learning_rate" : 0.005,
        "bagging_fraction" : 0.6,
        "feature_fraction" : 0.6,
        "bagging_frequency" : 6,
        "bagging_seed" : 42,
        "verbosity" : -1,
        "seed": 42
    }
    
    lgb_train_data = lgb.Dataset(X_train, label=y_train)
    lgb_val_data = lgb.Dataset(X_val, label=y_val)

    model = lgb.train(params, lgb_train_data, 
                      num_boost_round=5000,
                      valid_sets=[lgb_train_data, lgb_val_data],
                      early_stopping_rounds=100,
                      verbose_eval=500)

    y_pred_train = model.predict(X_train, num_iteration=model.best_iteration)
    y_pred_val = model.predict(X_val, num_iteration=model.best_iteration)
    y_pred_submit = model.predict(X_test, num_iteration=model.best_iteration)

    print(f"LGBM: RMSE val: {rmse(y_val, y_pred_val)}  - RMSE train: {rmse(y_train, y_pred_train)}")
    return y_pred_submit, model

In [63]:
%%time
# Train LGBM and generate predictions
lgb_preds, lgb_model = run_lgb(X_train, y_train, X_val, y_val, X_test)

Training until validation scores don't improve for 100 rounds.
[500]	training's rmse: 1.594	valid_1's rmse: 1.73742
[1000]	training's rmse: 1.4735	valid_1's rmse: 1.71885
Early stopping, best iteration is:
[1114]	training's rmse: 1.4549	valid_1's rmse: 1.71847
LGBM: RMSE val: 1.71847  - RMSE train: 1.4549
CPU times: user 33.8 s, sys: 282 ms, total: 34.1 s
Wall time: 17.4 s


In [64]:
print("LightGBM features importance...")
gain = lgb_model.feature_importance('gain')
featureimp = pd.DataFrame({'feature': lgb_model.feature_name(), 
                   'split': lgb_model.feature_importance('split'), 
                   'gain': 100 * gain / gain.sum()}).sort_values('gain', ascending=False)
print(featureimp[:10])

LightGBM features importance...
                    feature  split       gain
14         totals.pageviews   5417  31.626183
13              totals.hits   5804  17.792763
1               visitNumber   3577   9.903403
26                      WoY   4102   5.061798
30               visit_hour   3830   4.315512
10         geoNetwork.metro   2042   3.487535
6      geoNetwork.continent    402   3.424348
16     trafficSource.source   1559   2.726150
11          geoNetwork.city   2449   2.624966
7   geoNetwork.subContinent    557   2.311796


Литература по `LightGBM`:
* [What is LightGBM, How to implement it? How to fine tune the parameters?](https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc)
* [Documentation](https://lightgbm.readthedocs.io/en/latest/).
   - [Python Quick Start](https://lightgbm.readthedocs.io/en/latest/Python-Intro.html)
   - [Python API](https://lightgbm.readthedocs.io/en/latest/Python-API.html)
   - [Parameters section](https://lightgbm.readthedocs.io/en/latest/Parameters.html)

<a id='xgboost'></a>
### 2.2. XGBoost

&nbsp;

`XGBoost` jpyfxftn `Extreme Grandient Boosting`, который является своего рода `sklearn.ensemble` `GradientBoostingRegressor` на стероидах.


> "XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable."


In [65]:
def run_xgb(X_train, y_train, X_val, y_val, X_test):
    params = {'objective': 'reg:linear',
              'eval_metric': 'rmse',
              'eta': 0.001,
              'max_depth': 10,
              'subsample': 0.6,
              'colsample_bytree': 0.6,
              'alpha':0.001,
              'random_state': 42,
              'silent': True}

    xgb_train_data = xgb.DMatrix(X_train, y_train)
    xgb_val_data = xgb.DMatrix(X_val, y_val)
    xgb_submit_data = xgb.DMatrix(X_test)

    model = xgb.train(params, xgb_train_data, 
                      num_boost_round=2000, 
                      evals= [(xgb_train_data, 'train'), (xgb_val_data, 'valid')],
                      early_stopping_rounds=100, 
                      verbose_eval=500
                     )

    y_pred_train = model.predict(xgb_train_data, ntree_limit=model.best_ntree_limit)
    y_pred_val = model.predict(xgb_val_data, ntree_limit=model.best_ntree_limit)
    y_pred_submit = model.predict(xgb_submit_data, ntree_limit=model.best_ntree_limit)

    print(f"XGB : RMSE val: {rmse(y_val, y_pred_val)}  - RMSE train: {rmse(y_train, y_pred_train)}")
    return y_pred_submit, model

In [66]:
%%time
xgb_preds, xgb_model = run_xgb(X_train, y_train, X_val, y_val, X_test)

[0]	train-rmse:2.11289	valid-rmse:2.11689
Multiple eval metrics have been passed: 'valid-rmse' will be used for early stopping.

Will train until valid-rmse hasn't improved in 100 rounds.
[500]	train-rmse:1.80742	valid-rmse:1.91666
[1000]	train-rmse:1.60255	valid-rmse:1.81377
[1500]	train-rmse:1.45802	valid-rmse:1.76286
[1999]	train-rmse:1.35183	valid-rmse:1.73787
XGB : RMSE val: 1.7379  - RMSE train: 1.35184
CPU times: user 13min 43s, sys: 987 ms, total: 13min 44s
Wall time: 6min 58s


Further readings about `XGBoost`:
* [Documentation](https://xgboost.readthedocs.io/en/latest/), in particular:
 - [Python Intro](https://xgboost.readthedocs.io/en/latest/python/python_intro.html)
* [A Gentle Introduction to XGBoost for Applied Machine Learning](https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/)

<a id='catboost'></a>
### 2.3. Catboost

&nbsp;

> `CatBoost` is a state-of-the-art open-source gradient boosting on decision trees library.



 У `lgb`  `xgb` есть "Scikit-learn API" чекните [тут](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api) для `LightGBM` и[тут](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn) для `XGBoost`.

In [67]:
def run_catboost(X_train, y_train, X_val, y_val, X_test):
    model = CatBoostRegressor(iterations=1000,
                             learning_rate=0.05,
                             depth=10,
                             eval_metric='RMSE',
                             random_seed = 42,
                             bagging_temperature = 0.2,
                             od_type='Iter',
                             metric_period = 50,
                             od_wait=20)
    model.fit(X_train, y_train,
              eval_set=(X_val, y_val),
              use_best_model=True,
              verbose=True)
    
    y_pred_train = model.predict(X_train)
    y_pred_val = model.predict(X_val)
    y_pred_submit = model.predict(X_test)

    print(f"CatB: RMSE val: {rmse(y_val, y_pred_val)}  - RMSE train: {rmse(y_train, y_pred_train)}")
    return y_pred_submit, model

In [68]:
%%time
# Train Catboost and generate predictions
cat_preds, cat_model = run_catboost(X_train, y_train, X_val, y_val,  X_test)



0:	learn: 2.0668713	test: 2.0732701	best: 2.0732701 (0)	total: 236ms	remaining: 3m 55s
50:	learn: 1.6157053	test: 1.7354951	best: 1.7354951 (50)	total: 4.4s	remaining: 1m 21s
100:	learn: 1.5239498	test: 1.7238999	best: 1.7238999 (100)	total: 8.57s	remaining: 1m 16s
150:	learn: 1.4639515	test: 1.7147455	best: 1.7147455 (150)	total: 12.8s	remaining: 1m 11s
200:	learn: 1.4045678	test: 1.7074302	best: 1.7072383 (199)	total: 17s	remaining: 1m 7s
Stopped by overfitting detector  (20 iterations wait)

bestTest = 1.705449882
bestIteration = 214

Shrink model to first 215 iterations.
CatB: RMSE val: 1.70545  - RMSE train: 1.38615
CPU times: user 37 s, sys: 1.86 s, total: 38.8 s
Wall time: 20.3 s


Литература по `Catboost`:
* [Documentation](https://tech.yandex.com/catboost/doc/dg/concepts/about-docpage/):
  - [Python Quickstart](https://tech.yandex.com/catboost/doc/dg/concepts/python-quickstart-docpage/)
* [CatBoost: A machine learning library to handle categorical (CAT) data automatically](https://www.analyticsvidhya.com/blog/2017/08/catboost-automated-categorical-data/)

<a id='ensemble'></a>
## 3. Ensemble and submissions

В этом разделе мы создадим тривиальный линейный ансамбль, используя  коэффициенты (70/30/0).

In [75]:
ensemble_preds_70_30_00 = 0.7 * lgb_preds + 0.3 * cat_preds + 0.0 * xgb_preds 
ensemble_preds_70_25_05 = 0.7 * lgb_preds + 0.25 * cat_preds + 0.05 * xgb_preds 

In [71]:
def submit(predictions, filename='submit.csv'):
    """
    Takes a (804684,) 1d-array of predictions and generates a submission file named filename
    """
    _, _, X_submit = load_preprocessed_dfs(drop_full_visitor_id=False)
    submission = X_submit[['fullVisitorId']].copy()
    
    submission.loc[:, 'PredictedLogRevenue'] = predictions
    grouped_test = submission[['fullVisitorId', 'PredictedLogRevenue']].groupby('fullVisitorId').sum().reset_index()
    grouped_test.to_csv(filename,index=False)

submit(lgb_preds, "submit-lgb.csv")
# Note: I disabled XGB to make the notebook run faster
submit(xgb_preds, "submit-xgb.csv")
submit(cat_preds, "submit-cat.csv")
submit(ensemble_preds_70_30_00, "submit-ensemble-70_30_00.csv")
submit(ensemble_preds_70_25_05, "submit-ensemble-70_25_05.csv")

ensemble_preds_70_30_00_pos = np.where(ensemble_preds_70_30_00 < 0, 0, ensemble_preds_70_30_00)
submit(ensemble_preds_70_30_00_pos, "submit-ensemble-70_30_00-positive.csv")

ensemble_preds_70_25_05_pos = np.where(ensemble_preds_70_25_05 < 0, 0, ensemble_preds_70_25_05)
submit(ensemble_preds_70_25_05_pos, "submit-ensemble-70_25_05-positive.csv")

<a id='conclusions'></a>
## 4. Conclusions

НА ПОЛНОМ ДАТАСЕТЕ получены следующие результаты: 

| Model        | Rounds | Train RMSE           | Validation RMSE | Train time | Submit Score|
| ------------- |------:|-----:|-----:| -----:| -----:|
| `LightGBM`      | 5000| 1.505 | <span style='color:green'>1.60372 </span> | 7min 48s | 1.6717 |
| `XGBoost`      | 2000| 1.568 | 1.64924 | <span style='color:red'>54min 54s </span> | 1.6946|
| `Catboost`      | 1000| 1.52184 | 1.61231  | <span style='color:green'>2min 24s</span> | 1.6722|
| `Ensemble`      | -- | --| -- | -- | 1.6677|

&nbsp;

LightGBM добился лучших результатов в train, validation и public оценке, в то время как Catboost показал лучшее время обучения и очень конкурентоспособный результат. «XGBoost», с другой стороны, занял гораздо больше времени (от минут до часов) и не показал особо хороших результатов. 



<a id='references'></a>
## 5. Ссылки

#### Kernels
* [LGBM (RF) starter [LB: 1.70]](https://www.kaggle.com/fabiendaniel/lgbm-rf-starter-lb-1-70)
* [LightGBM + XGBoost + Catboost](https://www.kaggle.com/samratp/lightgbm-xgboost-catboost) 

#### Статьи
* [What is LightGBM, How to implement it? How to fine tune the parameters?](https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc)
* [A Gentle Introduction to XGBoost for Applied Machine Learning](https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/)
* [CatBoost: A machine learning library to handle categorical (CAT) data automatically](https://www.analyticsvidhya.com/blog/2017/08/catboost-automated-categorical-data/)
* [CatBoost vs. Light GBM vs. XGBoost](https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db?gi=4e06b8e37886)
* [Machine Learning Challenge Winning Solutions](https://github.com/Microsoft/LightGBM/blob/master/examples/README.md#machine-learning-challenge-winning-solutions) - a list of challenges won by some version of LightGBM.

#### Документация
*  [LightGBM Documentation: Python Quick Start](https://lightgbm.readthedocs.io/en/latest/Python-Intro.html)
*  [LightGBM Documentation: Python API](https://lightgbm.readthedocs.io/en/latest/Python-API.html)
* [LightGBM Documentation: Parameters section](https://lightgbm.readthedocs.io/en/latest/Parameters.html)
* [XGBoost Documentation: Python Intro](https://xgboost.readthedocs.io/en/latest/python/python_intro.html)
* [Catboost Documentation: Python Quickstart](https://tech.yandex.com/catboost/doc/dg/concepts/python-quickstart-docpage/)
* [LightGBM Documentation: Scikit-learn API](https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api)
* [XGBoost Documentation: Scikit-learn API](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn)


In [None]:
# Delete the files created by catboost. 
!rm -rf ./catboost_info