Goal: use high-cardinality features (device_model, C14) and medium-cardinality features (C17, C20).
    
TODO:
- see one-hot encoding them just works. -> Uses too much space.
    Time is okay (31.946413278579712 on small).
    Logistic Regression has been dealing with site/app_id just fine.
    

Columns:
- id : Don't use.
- click : Using.
- hour : TODO.
- C1 : Using.
- banner_pos : Using.
- site_id : Using. (categorical + click rate)
- site_domain : Not Using. Interchangeable with site_id.
- site_category : Using. (categorical + click rate)
- app_id : Using. (categorical + click rate)
- app_domain : Not Using. Interchangeable with app_id.
- app_category : Using. (categorical + click rate)
- device_id : Using (click rate)
- device_ip : Not Using. Interchangeable with device_id.
- device_model : TODO.
- device_type : Using.
- device_conn_type : Using.
- C14 : TODO
- C15 : Using.
- C16 : Using.
- C17 : TODO
- C18 : Using.
- C19 : Using.
- C20 : TODO
- C21 : Using.

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

from models.base import tune_logistic_regression_pipeline, ClickRateEncoder

from tools.kaggle_tools import predict_on_test

## Stats

Question: which of ['site_id', 'app_id', 'device_id', 'device_model', 'C14'] to include.

Results:
- Models with site/app_id do well
- Models without C14 and device_model 
- Larger C (= weaker regularization) for models with more columns?

- Without device_model and C14 (liblinear):
        Tuning time:  1628.0956852436066 for 4 parameters.
        {'logistic_regression__C': 1.0},
        {0.01: -0.41000354517306076,
        0.1: -0.4057393021637744,
        1.0: -0.404804207471588,
        10.0: -0.40949817838128577},
        -0.40384372144399805
        
        Private 0.3978821 (1015th); Public 0.3999804
        Tuning time:  1126.0188071727753 for 4 parameters (desktop).
        {'logistic_regression__C': 0.6812920690579611},
         {0.03162277660168379: -0.40740833606066024,
          0.14677992676220694: -0.40537205603120113,
          0.6812920690579611: -0.404751139389108,
          3.1622776601683795: -0.4057957629490777},
         -0.40381649825841553

- Without device_model, C14, device_id (liblinear):
        Tuning time:  656.0594809055328 for 4 params (desktop)
        {'logistic_regression__C': 0.6812920690579611},
        {0.03162277660168379: -0.4075474720449697,
        0.14677992676220694: -0.40554311638919466,
        0.6812920690579611: -0.4050002236865907,
        3.1622776601683795: -0.40578508440354477},
        -0.40432593157007696
- Without device_model, C14, device_id, site_id, app_id (liblinear):
        Tuning time:  403.57573318481445 for 4 params (desktop)
        {'logistic_regression__C': 0.03162277660168379},
        {0.03162277660168379: -0.42208023606423384,
        0.14677992676220694: -0.4223748616833095,
        0.6812920690579611: -0.4227947131874169,
        3.1622776601683795: -0.4233896409530987},
        -0.42471286321920937
- Without device_id, site_id, app_id (liblinear):
        Tuning time:  620.0898985862732 for 4 params (desktop)
        {'logistic_regression__C': 0.03162277660168379},
        {0.03162277660168379: -0.4197374721267071,
        0.14677992676220694: -0.42015415864452665,
        0.6812920690579611: -0.42143679419507213,
        3.1622776601683795: -0.424161398647938},
        -0.4211988564711934
        
        Tuning time:  679.7000195980072 for 4 params (desktop)
        {'logistic_regression__C': 0.019306977288832496},
        {0.001: -0.4228964316565532,
        0.0026826957952797246: -0.4212029612290653,
        0.0071968567300115215: -0.42021321233701725,
        0.019306977288832496: -0.4197774854342116,
        0.0517947467923121: -0.4197868807900707,
        0.13894954943731375: -0.4201248109950096,
        0.3727593720314938: -0.4207986482164309,
        1.0: -0.4219660459812828},
        -0.4211098860786709

- Without site_id, app_id (liblinear):
        Tuning time:  805.5255260467529 for 4 params (desktop)
        {'logistic_regression__C': 0.0630957344480193},
        {0.001: -0.42282137497017536,
        0.003981071705534973: -0.4205917176025631,
        0.015848931924611134: -0.4196376654494177,
        0.0630957344480193: -0.41957091292859994,
        0.25118864315095796: -0.42003867241233384,
        1.0: -0.4212132084789877},
        -0.4213360171849131
- Without device_model and C14 (saga, sag): Doesn't converge.

In [2]:
model_one_plus_cols = ['C1',
                'banner_pos',
                       #'site_id',
                       'site_category',
                        #'app_id',
                        'app_category',
                        'device_id',
                       'device_model',
                        'device_type',
                        'device_conn_type',
                       'C14',
                        'C15',
                        'C16',
                       'C17',
                        'C18',
                        'C19',
                       'C20',
                        'C21']

In [3]:
oh_encoder = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer([
    ('one_hot_encoding', oh_encoder, model_one_plus_cols)
])

lg = LogisticRegression(solver='liblinear')
pipeline = Pipeline([
                ('preprocessing', preprocessor),
                 ('logistic_regression', lg)])

In [4]:
df_train = pd.read_csv('data/train_small.csv')

In [5]:
params = np.logspace(-3, 0, num=6)
params

array([0.001     , 0.00398107, 0.01584893, 0.06309573, 0.25118864,
       1.        ])

In [6]:
best_C, params_dict_ls, scores, test_score = \
    tune_logistic_regression_pipeline(df_train, pipeline, params)

Tuning time:  805.5255260467529


In [7]:
best_C, dict(zip(params, scores)), test_score

({'logistic_regression__C': 0.0630957344480193},
 {0.001: -0.42282137497017536,
  0.003981071705534973: -0.4205917176025631,
  0.015848931924611134: -0.4196376654494177,
  0.0630957344480193: -0.41957091292859994,
  0.25118864315095796: -0.42003867241233384,
  1.0: -0.4212132084789877},
 -0.4213360171849131)

In [8]:
raise

RuntimeError: No active exception to reraise

In [None]:
df_test = pd.read_csv('data/test_tiny.csv', dtype={'id': 'uint64'})
df_test.id.dtype

In [None]:
clicks = df_train.click
#param = {'logistic_regression__C': 0.021544346900318832}
predict_on_test(df_train, clicks, pipeline, best_C, df_test, fname=None)