TODO:
- Try click rate encoding of device_model and C14.
- Try without interactions of device_id and site/app_id.
    Encode them separately.
- With for without click rates of site/app_id.

Columns to add:
- site/app_category cols (Done. Easy.)
- high-cardinality categoricals. site/app/device_id, C14, and device_model. No problem adding site/app_id (does this scale for a larger data set?).
- hour

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

from models.base import ClickRateEncoder, tune_logistic_regression_pipeline

## Stats

Question: which of ['site_id', 'app_id', 'device_id', 'device_model', 'C14'], and their click rates to include.

- site/app_id click rate, w/o device_id, device_model, C14:
        Tuning time:  324.64227628707886 for 8 parameters
        Best C:  {'logistic_regression__C': 0.00026826957952797245}
        {1e-05: -0.4487456132140043, 1.9306977288832496e-05: -0.4402482522196934, 3.727593720314938e-05: -0.4332713236725724, 7.196856730011514e-05: -0.42758329625340535, 0.00013894954943731373: -0.4236208752579353, 0.00026826957952797245: -0.4219944456061655, 0.0005179474679231213: -0.4229531107039379, 0.001: -0.4263861627269825}
        Test score:  -0.4176864114518759
        
- no click rate, w/o device_id, device_model, C14:  
        Tuning time:  1298.9161353111267
        Best C:  {'logistic_regression__C': 0.6309573444801934}
        {0.1: -0.4058935483163846, 0.251188643150958: -0.40519699039169943, 0.6309573444801934: -0.40499613789519184, 1.584893192461114: -0.4052634663570519, 3.981071705534973: -0.40601006659652167, 10.0: -0.4072469227171066}
        Test score:  -0.4043248005586979
- site/app_id click rate, w/o device_model, C14:

In [2]:
categorical_features = ['C1',
                'banner_pos',
                'app_id',
                'site_id',
                'app_category',
                'site_category',
                #'device_id',
                         #'device_model',
                'device_type',
                'device_conn_type',
                         #'C14',
                'C15',
                'C16',
                         'C17',
                'C18',
                'C19',
                         'C20',
                'C21']

cr_site_cols = ['click','site_id','device_id']
cr_app_cols = ['click','app_id','device_id']

def get_model_three_plus_pipeline():
    
    cr_site_encoder = ClickRateEncoder(['site_id','device_id'], 'click_rate_by_site_id')
    cr_app_encoder = ClickRateEncoder(['app_id','device_id'], 'click_rate_by_app_id')
    oh_encoder = OneHotEncoder(handle_unknown='ignore')
    preprocessor = ColumnTransformer([
        ('one_hot_encoding', oh_encoder, categorical_features),
        #('click_rate_encoding_site', cr_site_encoder, cr_site_cols),
        #('click_rate_encoding_app', cr_app_encoder, cr_app_cols)
    ])

    lg = LogisticRegression(solver='liblinear')
    pipeline = Pipeline([
                    ('preprocessing', preprocessor),
                     ('logistic_regression', lg)])
    return pipeline

def tune_model_three_plus(df, params):
    
    pipeline = get_model_three_plus_pipeline()

    return tune_logistic_regression_pipeline(df, pipeline, params)

In [3]:
df_train = pd.read_csv('data/train_small.csv')
params = np.logspace(-1, 1, num=6)
params

array([ 0.1       ,  0.25118864,  0.63095734,  1.58489319,  3.98107171,
       10.        ])

In [4]:
best_C, params_dict_ls, scores, test_score = tune_model_three_plus(df_train, params)
print('Best C: ', best_C)
print(dict(zip(params, scores)))
print('Test score: ', test_score)

Tuning time:  1298.9161353111267
Best C:  {'logistic_regression__C': 0.6309573444801934}
{0.1: -0.4058935483163846, 0.251188643150958: -0.40519699039169943, 0.6309573444801934: -0.40499613789519184, 1.584893192461114: -0.4052634663570519, 3.981071705534973: -0.40601006659652167, 10.0: -0.4072469227171066}
Test score:  -0.4043248005586979
