# DMIA sport intro: How frequent is this password?
### X - word, y - number, metric - RMSLE
#### Участник: Тарасов Андрей (tarrapid)


Метрика соревнования:

$RMSLE = \sqrt{\sum_{i=1}^{N}(log(1+y_i) - log(1+y^{pr}_i))^2}$

In [2]:
import pandas as pd 
import numpy as np 
import datetime 

import xgboost as xgb
from sklearn.ensemble import RandomForestRegressor
from catboost import CatBoostRegressor, cv, Pool

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin, BaseEstimator

### Загрузка данных соревнования


In [3]:
df_train = pd.read_csv('../data/train.csv')
df_test = pd.read_csv('../data/Xtest.csv')

In [4]:
df_train.head()

Unnamed: 0,Password,Times
0,631XniVx2lS5I,2
1,LEGIT747,1
2,742364es,1
3,3846696477,1
4,laurahop,2


In [4]:
df_train.shape

(4151496, 2)

В топ-10 находятся пароли, которые можно было ожидать увидеть в популярных паролях

In [11]:
df_train.sort_values(by='Times', ascending=False).head(10)

Unnamed: 0,Password,Times
2715397,123456,55893
3136279,qwerty,13137
1175081,123456789,11696
2363307,12345,10938
2988373,1234,6432
2307329,111111,5682
3348280,1234567,4796
1795496,dragon,3927
3336606,123123,3845
3798071,baseball,3565


In [12]:
df_test.head()

Unnamed: 0,Id,Password
0,0,ThaisCunha
1,1,697775113
2,2,922a16922a
3,3,andy74
4,4,joemack


In [13]:
df_test.shape

(1037875, 2)

### Загрузка внешних источников 
Испльзуются только источники, которые были указаны в правилах соревнования

In [15]:
df_top_password = pd.read_csv('../data/10-million-password-list-top-1000000.txt')
df_top_password.drop(index=282774, inplace=True)

Формирование признака на основании позиции пароля в топ-листе

In [16]:
df_top_password['top_password'] = df_top_password['top_password'].astype(str)
df_top_password['RankPassword_TopRate'] = df_top_password.index

In [17]:
df_top_password.head()

Unnamed: 0,top_password,RankPassword_TopRate
0,123456,0
1,password,1
2,12345678,2
3,qwerty,3
4,123456789,4


### Трансформация данных

__Этапы трансформации:__

Target: Сдвиг и логарфмирование 
+ Позволяет лучше настроиться на оцениваемую метрику
+ Позволяет сгладить выбросы 

Признаки:
+ Join рейтинга пароля 
+ Пароли, которых нет в рейтинге ставим максимальный номер в рейтинге + 1

In [18]:
def transform(X, y=None):
    
    data = X.copy()
    target = None
    
    if y is not None:
        target = y.copy()
        target = np.log(1 + target)
    
    data['Password'] = data['Password'].astype(str)
    data = pd.merge(data, df_top_password, how='left', 
                    left_on=['Password'], right_on=['top_password'])
    data.fillna(value={'RankPassword_TopRate': np.max(data['RankPassword_TopRate']) + 1}, inplace=True)
    data.drop(columns=['Password', 'top_password'], inplace=True)
    
    return data, target

### Подбор наилучшего набора параметорв по сетке

In [10]:
param_grid = {
    'n_estimators' : [5, 10],
    'max_depth': [2, 3],
    'learning_rate' : [0.1, 0.3]
}

In [19]:
df_train_tf, y_tr_tf = transform(df_train.drop(columns=['Times']), df_train['Times'])

In [20]:
X_train, X_test, y_train, y_test = train_test_split(df_train_tf, y_tr_tf, test_size=0.3, random_state=0)

In [13]:
xgb_mod = xgb.XGBRegressor()
clf = GridSearchCV(xgb_mod, param_grid, verbose=20, scoring='neg_mean_squared_error', cv = 2)
clf.fit(X_train, y_train)

Fitting 2 folds for each of 8 candidates, totalling 16 fits
[CV] learning_rate=0.1, max_depth=2, n_estimators=5 ..................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV]  learning_rate=0.1, max_depth=2, n_estimators=5, score=-0.09926896616858434, total=   5.0s
[CV] learning_rate=0.1, max_depth=2, n_estimators=5 ..................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    5.3s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=2, n_estimators=5, score=-0.09944386059780846, total=   4.8s
[CV] learning_rate=0.1, max_depth=2, n_estimators=10 .................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   10.4s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=2, n_estimators=10, score=-0.03765117782135696, total=   9.0s
[CV] learning_rate=0.1, max_depth=2, n_estimators=10 .................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:   19.8s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=2, n_estimators=10, score=-0.037957857152822606, total=   7.9s
[CV] learning_rate=0.1, max_depth=3, n_estimators=5 ..................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:   28.0s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, n_estimators=5, score=-0.09484417091733735, total=   5.2s
[CV] learning_rate=0.1, max_depth=3, n_estimators=5 ..................


[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:   33.5s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, n_estimators=5, score=-0.09490401637936718, total=   5.1s
[CV] learning_rate=0.1, max_depth=3, n_estimators=10 .................


[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed:   38.9s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, n_estimators=10, score=-0.033602807337646624, total=   9.7s
[CV] learning_rate=0.1, max_depth=3, n_estimators=10 .................


[Parallel(n_jobs=1)]: Done   7 out of   7 | elapsed:   49.0s remaining:    0.0s


[CV]  learning_rate=0.1, max_depth=3, n_estimators=10, score=-0.03379687213527693, total=   8.7s
[CV] learning_rate=0.3, max_depth=2, n_estimators=5 ..................


[Parallel(n_jobs=1)]: Done   8 out of   8 | elapsed:   58.0s remaining:    0.0s


[CV]  learning_rate=0.3, max_depth=2, n_estimators=5, score=-0.010937386299054058, total=   4.8s
[CV] learning_rate=0.3, max_depth=2, n_estimators=5 ..................


[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  1.1min remaining:    0.0s


[CV]  learning_rate=0.3, max_depth=2, n_estimators=5, score=-0.011241397185423602, total=   4.8s
[CV] learning_rate=0.3, max_depth=2, n_estimators=10 .................


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  1.1min remaining:    0.0s


[CV]  learning_rate=0.3, max_depth=2, n_estimators=10, score=-0.0007454710278736513, total=   7.8s
[CV] learning_rate=0.3, max_depth=2, n_estimators=10 .................


[Parallel(n_jobs=1)]: Done  11 out of  11 | elapsed:  1.3min remaining:    0.0s


[CV]  learning_rate=0.3, max_depth=2, n_estimators=10, score=-0.0008290945041609998, total=   7.8s
[CV] learning_rate=0.3, max_depth=3, n_estimators=5 ..................


[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed:  1.4min remaining:    0.0s


[CV]  learning_rate=0.3, max_depth=3, n_estimators=5, score=-0.008098243002086883, total=   5.1s
[CV] learning_rate=0.3, max_depth=3, n_estimators=5 ..................


[Parallel(n_jobs=1)]: Done  13 out of  13 | elapsed:  1.5min remaining:    0.0s


[CV]  learning_rate=0.3, max_depth=3, n_estimators=5, score=-0.008227650976046625, total=   5.2s
[CV] learning_rate=0.3, max_depth=3, n_estimators=10 .................


[Parallel(n_jobs=1)]: Done  14 out of  14 | elapsed:  1.6min remaining:    0.0s


[CV]  learning_rate=0.3, max_depth=3, n_estimators=10, score=-0.0003258325824669023, total=   8.7s
[CV] learning_rate=0.3, max_depth=3, n_estimators=10 .................


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:  1.7min remaining:    0.0s


[CV]  learning_rate=0.3, max_depth=3, n_estimators=10, score=-0.00036408343044114645, total=   8.7s


[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:  1.9min remaining:    0.0s
[Parallel(n_jobs=1)]: Done  16 out of  16 | elapsed:  1.9min finished


GridSearchCV(cv=2, error_score='raise-deprecating',
       estimator=XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'n_estimators': [5, 10], 'max_depth': [2, 3], 'learning_rate': [0.1, 0.3]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='neg_mean_squared_error', verbose=20)

###  Обучение модели при наилучшем наборе параметров

In [21]:
xg = xgb.XGBRegressor(n_estimators=150, max_depth=3, learning_rate=0.3)
xg.fit(X_train, y_train)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.3, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=150,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

__Ошибка на train__

In [22]:
xg_pr_tr = xg.predict(X_train)
mean_squared_error(y_train, xg_pr_tr) ** 0.5

0.001542848204580708

__Ошибка на test__

In [24]:
xg_pr = xg.predict(X_test)
mean_squared_error(y_test, xg_pr) ** 0.5

0.0008548330633353787

### Submission

In [25]:
df_test_tf, _ = transform(df_test.drop(columns=['Id']))

In [26]:
xg_pr = xg.predict(df_test_tf)

__Обратная трансформация таргета__

In [27]:
df_test_new = df_test['Id'].to_frame()
df_test_new['Times'] = np.exp(xg_pr) - 1

In [28]:
df_test_new.head()

Unnamed: 0,Id,Times
0,0,1.000011
1,1,1.000011
2,2,1.000011
3,3,1.000011
4,4,2.999594


In [29]:
def save_submission(df):
   
    df.to_csv('../res/sub_' + datetime.datetime.today().strftime("%Y-%m-%d %H:%M:%S") +'.csv', 
                       encoding='utf-8', index=False)

In [30]:
save_submission(df_test_new)