## Спортивный анализ данных. Платформа Kaggle
### Feature Engineering, Feature Selection

### Практическое задание 4.

* __Задание 0:__ выбрать любую модель машнного обучения и зафиксировать любой тип валидации. Обучить базовую модель и зафиксировать базовое качество модели. В каждом следующем задании нужно будет обучить выбранную модель и оценивать ее качество на зафиксированной схеме валидации. После каждого задания, требуется сделать вывод о достигаемом качестве модели, по сравнению с качестом из предыдущего шага.

* __Задание 1:__ признак TransactionDT - это смещение в секундах относительно базовой даты. Базовая дата - 2017-12-01, преобразовать признак TransactionDT в datetime, прибавив к базовой дате исходное значение признака. Из полученного признака выделить год, месяц, день недели, час, день.

* __Задание 2:__ сделать конкатенацию признаков
* card1 + card2;
* card1 + card2 + card_3 + card_5;
* card1 + card2 + card_3 + card_5 + addr1 + addr2

* Рассматривать их как категориальных признаки.

* __Задание 3:__ Сделать FrequencyEncoder для признаков card1 - card6, addr1, addr2.

* __Задание 4:__ Создать признаки на основе отношения: TransactionAmt к вычисленной статистике. Статистика - среднее значение / стандартное отклонение TransactionAmt, сгруппированное по card1 - card6, addr1, addr2, и по признакам, созданным в задании 2.

* __Задание 5:__ Создать признаки на основе отношения: D15 к вычисленной статистике. Статистика - среднее значение / стандартное отклонение D15, сгруппированное по card1 - card6, addr1, addr2, и по признакам, созданным в задании 2.

* __Задание 6:__ выделить дробную часть и целую часть признака TransactionAmt в два отдельных признака. После создать отдельных признак - логарифм от TransactionAmt

* __Задание 7 (опция):__ выполнить предварительную подготовку / очистку признаков P_emaildomain и R_emaildomain (что и как делать - остается на ваше усмотрение) и сделать Frequency Encoding для очищенных признаков.


Ссылка на данные - https://drive.google.com/file/d/1GN6d4_QTYWY-qFdjz_TqxFHIJRi_oTRP/view?usp=sharing

In [1]:
import numpy as np
import pandas as pd

from datetime import datetime, timedelta
from time import mktime

import scipy.stats as st
from scipy.stats import probplot, ks_2samp
from sklearn.metrics import roc_auc_score, roc_curve, auc
from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted

import xgboost as xgb

import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import warnings
warnings.filterwarnings('ignore')

#### Пути к директориям и файлам

In [3]:
PATH = 'C:/Users/ASER/Desktop/GeekBrains/Kaggle/Lesson_2/data/'
TRAIN_DATASET_PATH = PATH + 'train.csv'
TEST_DATASET_PATH = PATH + 'test.csv'

#### Загрузка данных

In [4]:
train = pd.read_csv(TRAIN_DATASET_PATH)
print('train.shape', train.shape)
train.head(2)

train.shape (180000, 394)


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,...,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,...,,,,,,,,,,


In [5]:
test = pd.read_csv(TEST_DATASET_PATH)
print('test.shape', test.shape)
test.head(2)

test.shape (100001, 394)


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,3287000,1,7415038,226.0,W,12473,555.0,150.0,visa,226.0,...,,,,,,,,,,
1,3287001,0,7415054,3072.0,W,15651,417.0,150.0,visa,226.0,...,,,,,,,,,,


In [6]:
TARGET_NAME = 'isFraud'

#### Обзор распределения целевой переменной

In [7]:
train[TARGET_NAME].value_counts()

0    174859
1      5141
Name: isFraud, dtype: int64

#### Классификация признаков

In [8]:
INDEPENDENT_VARIABLE_NAMES = train.columns.to_list()[2:]
INDEPENDENT_VARIABLE_NAMES[:3]

['TransactionDT', 'TransactionAmt', 'ProductCD']

In [9]:
NUMERICAL_FEATURE_NAMES = train[INDEPENDENT_VARIABLE_NAMES].select_dtypes(include=[np.number]).columns.to_list()
CATEGORICAL_FEATURE_NAMES = train[INDEPENDENT_VARIABLE_NAMES].select_dtypes(include=[np.object]).columns.to_list()

print(f'count of numerical features {len(NUMERICAL_FEATURE_NAMES)}')
print(f'count of categorical features {len(CATEGORICAL_FEATURE_NAMES)}')

count of numerical features 378
count of categorical features 14


#### Обработка категориальных признаков

In [10]:
class FeatureGenerator:
    def __init__(self, CATEGORICAL_FEATURE_NAMES):
        self.CATEGORICAL_FEATURE_NAMES = CATEGORICAL_FEATURE_NAMES
        self.NEW_CATEGORICAL_FEATURE_NAMES = []
        self.LGB_CATEGORICAL_FEATURE_NAMES = []
        self.target_encodings = dict()
        self.ordinal_encoding = dict()
        
        
    def fit(self, train):
        df = train.copy()
        for feature in self.CATEGORICAL_FEATURE_NAMES: 
            new_feature = feature + '_'
            lgb_feature = feature + 'lgb'
            self.NEW_CATEGORICAL_FEATURE_NAMES.append(new_feature)
            self.LGB_CATEGORICAL_FEATURE_NAMES.append(lgb_feature)            
            self.target_encodings[feature] = {}
            self.ordinal_encoding[feature] = {}
            for ind, level in enumerate(df[feature].unique()):
                level_value = df.loc[df[feature]==level, TARGET_NAME].mean()
                self.target_encodings[feature][level] = level_value
                self.ordinal_encoding[feature][level] = ind
                
                
    def transform(self, df):
        for feature in self.CATEGORICAL_FEATURE_NAMES: 
            for level in self.target_encodings[feature].keys():
                new_feature = feature + '_'
                lgb_feature = feature + 'lgb'
                df.loc[df[feature] == level, new_feature] = self.target_encodings[feature][level]
                df.loc[df[feature] == level, lgb_feature] = self.ordinal_encoding[feature][level]
                
        df[CATEGORICAL_FEATURE_NAMES] = df[CATEGORICAL_FEATURE_NAMES].astype(str)  
        
        return df

#### Train_test_split

In [11]:
df = train.sort_values(by=['TransactionDT'])
df_train = df.loc[:125999, :]
df_valid = df.loc[126000:152999, :]
df_test = df.loc[153000:, :]

#### Генерация признаков

* a) oбучающий датасет + отложеная выборка

In [12]:
features = FeatureGenerator(CATEGORICAL_FEATURE_NAMES)
features.fit(df_train)
df_train = features.transform(df_train)
df_valid = features.transform(df_valid)
df_test = features.transform(df_test)

* b) тестовый датасет

In [13]:
features = FeatureGenerator(CATEGORICAL_FEATURE_NAMES)
features.fit(train)
train = features.transform(train)
test = features.transform(test)

#### Модель

In [14]:
SELECTED_FEATURE_NAMES = NUMERICAL_FEATURE_NAMES + features.NEW_CATEGORICAL_FEATURE_NAMES

In [15]:
params_xgb = {"booster": "gbtree", 
              "objective": "binary:logistic", 
              "eval_metric": "auc", 
              "learning_rate": 0.2,               
              "reg_lambda": 100, 
              "max_depth": 4, 
              "gamma": 10, 
              "nthread": 6, 
              "seed": 27}

In [16]:
def run_model():
    dtrain = xgb.DMatrix(data=df_train[SELECTED_FEATURE_NAMES], label=df_train[TARGET_NAME])
    dvalid = xgb.DMatrix(data=df_valid[SELECTED_FEATURE_NAMES], label=df_valid[TARGET_NAME])
    dtest = xgb.DMatrix(data=df_test[SELECTED_FEATURE_NAMES], label=df_test[TARGET_NAME])

    model_xgb = xgb.train(params=params_xgb,
                          dtrain=dtrain,
                          num_boost_round=1000,
                          early_stopping_rounds=50,
                          evals=[(dtrain, "train"), (dvalid, "valid"), (dtest, "test")],
                          verbose_eval=50,
                          maximize=True)

    dtest_final = xgb.DMatrix(data=test[SELECTED_FEATURE_NAMES])

    y_pred = model_xgb.predict(dtest_final, ntree_limit=model_xgb.best_ntree_limit)
    score = roc_auc_score(test[TARGET_NAME], y_pred)
    print(f'roc-auc score of prediction: {round(score, 5)}')

#### 0. Базовая модель

In [17]:
run_model()

[0]	train-auc:0.60237	valid-auc:0.61787	test-auc:0.62465
Multiple eval metrics have been passed: 'test-auc' will be used for early stopping.

Will train until test-auc hasn't improved in 50 rounds.
[50]	train-auc:0.91110	valid-auc:0.88956	test-auc:0.87570
[100]	train-auc:0.91960	valid-auc:0.89530	test-auc:0.88086
Stopping. Best iteration:
[86]	train-auc:0.91863	valid-auc:0.89538	test-auc:0.88172

roc-auc score of prediction: 0.86895


#### 1. Признак TransactionDT

In [18]:
def transform_TransactionDT(data):
    t = t0 + pd.to_timedelta(data['TransactionDT'], unit='s')
    data.loc[:, 'year'] = t.dt.year
    data.loc[:, 'month'] = t.dt.month
    data.loc[:, 'weekday'] = t.dt.weekday
    data.loc[:, 'hour'] = t.dt.hour
    data.loc[:, 'day'] = t.dt.day 
    
    return data

In [19]:
t0 = datetime.strptime('2017-12-01', "%Y-%m-%d")

df_list = [df_train, df_valid, df_test, test]
for df in df_list:
    df = transform_TransactionDT(df)
    
TIME_FEATURE_NAMES = ['year', 'month', 'weekday', 'hour', 'day']

In [20]:
SELECTED_FEATURE_NAMES.remove('TransactionDT')
SELECTED_FEATURE_NAMES += TIME_FEATURE_NAMES    
run_model()

# roc-auc score of prediction base: 0.86895

[0]	train-auc:0.60237	valid-auc:0.61787	test-auc:0.62465
Multiple eval metrics have been passed: 'test-auc' will be used for early stopping.

Will train until test-auc hasn't improved in 50 rounds.
[50]	train-auc:0.91000	valid-auc:0.88901	test-auc:0.87229
[100]	train-auc:0.91854	valid-auc:0.89632	test-auc:0.88231
Stopping. Best iteration:
[81]	train-auc:0.91854	valid-auc:0.89632	test-auc:0.88231

roc-auc score of prediction: 0.86449


__Вывод:__ Замена признака TransactionDT на признаки year, month, weekday, hour, day не улучшила качество предсказания на тестовой выборке.

#### 2. Соединение признаков
* card1 + card2;

* card1 + card2 + card_3 + card_5;

* card1 + card2 + card_3 + card_5 + addr1 + addr2

In [21]:
def merge_features(data, concat_lists, NEW_FEATURE_NAMES):
    for i, cols in enumerate(concat_lists):
        if i > 0:        
            cols = [NEW_FEATURE_NAMES[i - 1]] + cols
        data.loc[:, NEW_FEATURE_NAMES[i]] = data.loc[:, cols].\
                    apply(lambda x: ''.join(str(x.values))[1:-1].replace("'", ""), axis=1)
    return data

In [22]:
class FrequencyEncoder: 
    def __init__(self, features_list):
        self.features_list = features_list
        self.NEW_FEATURE_NAMES = []
        self.freq = {}
        
        
    def fit_transform(self, data):
        for feature in self.features_list:
            name_str = feature + '_freq'
            self.NEW_FEATURE_NAMES.append(name_str)
            
            self.freq[feature] = data[feature].value_counts() 
            
            for i, value in enumerate(self.freq[feature].keys()):
                data.loc[data[feature] == value, name_str] = self.freq[feature].values[i]               
    
        return data
    
    def transform(self, data):
        for feature in self.features_list:
            name_str = feature + '_freq'
            for i, value in enumerate(self.freq[feature].keys()):                
                data.loc[data[feature] == value, name_str] = self.freq[feature].values[i] 
                
        return data

In [23]:
concat_lists = [['card1', 'card2'], ['card3', 'card5'], ['addr1', 'addr2']]
NEW_FEATURE_NAMES = ['card1_2', 'card1_2_3_5', 'card1_2_3_5_addr1_2']

for df in df_list:
    df = merge_features(df, concat_lists, NEW_FEATURE_NAMES)

In [24]:
freq_encoder = FrequencyEncoder(NEW_FEATURE_NAMES)
df_train = freq_encoder.fit_transform(df_train)
for df in df_list[1:]:
    df = freq_encoder.transform(df)

In [25]:
SELECTED_FEATURE_NAMES += freq_encoder.NEW_FEATURE_NAMES
run_model()

[0]	train-auc:0.60237	valid-auc:0.61787	test-auc:0.62465
Multiple eval metrics have been passed: 'test-auc' will be used for early stopping.

Will train until test-auc hasn't improved in 50 rounds.
[50]	train-auc:0.91029	valid-auc:0.89099	test-auc:0.87708
[100]	train-auc:0.91803	valid-auc:0.89608	test-auc:0.88284
Stopping. Best iteration:
[77]	train-auc:0.91756	valid-auc:0.89619	test-auc:0.88311

roc-auc score of prediction: 0.86478


__Вывод:__ Соединение признаков card1 + card2; card1 + card2 + card_3 + card_5; card1 + card2 + card_3 + card_5 + addr1 + addr2
особо не добавляет точности предсказания на тестовой выборке.

#### 3. Frequency encoding - признаки card1 - card6, addr1, addr2.

In [26]:
features_list = ['card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'addr1', 'addr2']

freq_encoder = FrequencyEncoder(features_list)
df_train = freq_encoder.fit_transform(df_train)
for df in df_list[1:]:
    df = freq_encoder.transform(df)

In [27]:
SELECTED_FEATURE_NAMES += freq_encoder.NEW_FEATURE_NAMES
run_model()

[0]	train-auc:0.60237	valid-auc:0.61787	test-auc:0.62465
Multiple eval metrics have been passed: 'test-auc' will be used for early stopping.

Will train until test-auc hasn't improved in 50 rounds.
[50]	train-auc:0.91299	valid-auc:0.89223	test-auc:0.87639
[100]	train-auc:0.92198	valid-auc:0.89860	test-auc:0.88215
Stopping. Best iteration:
[81]	train-auc:0.92174	valid-auc:0.89855	test-auc:0.88258

roc-auc score of prediction: 0.86795


__Вывод:__ Frequency encoding признаков card1 - card6, addr1, addr2 улушает качество модели

#### 4. TransactionAmt

Создать признаки на основе отношения: TransactionAmt к вычисленной статистике. Статистика - среднее значение / стандартное отклонение TransactionAmt, сгруппированное по card1 - card6, addr1, addr2, и по признакам, созданным в задании 2.

In [28]:
class CompositFeatures:
    def __init__(self, base_feature, features_list):
        self.base_feature = base_feature
        self.features_list = features_list
        self.NEW_FEATURE_NAMES = []
        self.stat = {}
        
        
    def fit_transform(self, data):
        for feature in self.features_list:
            df = data[[self.base_feature, feature]]            
            self.stat[feature] = df.groupby(feature).mean() / df.groupby(feature).std()
            
            name_str_1 = 'stat_' + feature
            for i in self.stat[feature].index:            
                df.loc[df[feature] == i, name_str_1] = self.stat[feature].loc[i, base_feature]

            name_str_2 =  base_feature + '_' + name_str_1
            self.NEW_FEATURE_NAMES.append(name_str_2)
            
            data[name_str_2] = df[base_feature] / df[name_str_1]

        return data
    
    
    def transform(self, data):
        for feature in self.features_list:
            df = data[[self.base_feature, feature]]
            
            name_str_1 = 'stat_' + feature
            for i in self.stat[feature].index:            
                    df.loc[df[feature] == i, name_str_1] = self.stat[feature].loc[i, base_feature]
                    
            name_str_2 =  base_feature + '_' + name_str_1            
            data[name_str_2] = df[base_feature] / df[name_str_1]
            
        return data

In [29]:
features_list += ['card1_2', 'card1_2_3_5', 'card1_2_3_5_addr1_2']

In [30]:
base_feature = 'TransactionAmt'
comp_features = CompositFeatures(base_feature, features_list)
df_train = comp_features.fit_transform(df_train)
for df in df_list[1:]:
    df = comp_features.transform(df)

In [31]:
SELECTED_FEATURE_NAMES += comp_features.NEW_FEATURE_NAMES
run_model()

[0]	train-auc:0.60237	valid-auc:0.61788	test-auc:0.62471
Multiple eval metrics have been passed: 'test-auc' will be used for early stopping.

Will train until test-auc hasn't improved in 50 rounds.
[50]	train-auc:0.91376	valid-auc:0.89168	test-auc:0.87725
[100]	train-auc:0.92004	valid-auc:0.89508	test-auc:0.88188
Stopping. Best iteration:
[71]	train-auc:0.92004	valid-auc:0.89508	test-auc:0.88188

roc-auc score of prediction: 0.86443


__Вывод:__ Новые признаки на базе TransactionAmt не улучшают точность предсказания на тестовой выборке.

#### 5. D15
Создать признаки на основе отношения: D15 к вычисленной статистике. Статистика - среднее значение / стандартное отклонение D15, сгруппированное по card1 - card6, addr1, addr2, и по признакам, созданным в задании 2.

In [32]:
base_feature = 'D15'
comp_features = CompositFeatures(base_feature, features_list)
df_train = comp_features.fit_transform(df_train)
for df in df_list[1:]:
    df = comp_features.transform(df)
    
NEW_FEATURES_NAMES = [base_feature + '_stat_' + i for i in features_list]

In [33]:
SELECTED_FEATURE_NAMES += comp_features.NEW_FEATURE_NAMES
run_model()

[0]	train-auc:0.60237	valid-auc:0.61788	test-auc:0.62471
Multiple eval metrics have been passed: 'test-auc' will be used for early stopping.

Will train until test-auc hasn't improved in 50 rounds.
[50]	train-auc:0.91326	valid-auc:0.89220	test-auc:0.87540
[100]	train-auc:0.92273	valid-auc:0.89602	test-auc:0.88306
Stopping. Best iteration:
[82]	train-auc:0.92273	valid-auc:0.89602	test-auc:0.88306

roc-auc score of prediction: 0.8666


__Вывод:__ Новые признаки на базе D15 улучшают точность предсказания на тестовой выборке.

#### 6. TransactionAmt - целая и дробная часть, логарифм
Выделить дробную часть и целую часть признака TransactionAmt в два отдельных признака. После создать отдельных признак - логарифм от TransactionAmt

In [34]:
def transform_feature(data, name_str):
    name_1 = name_str + '_int'
    name_2 = name_str + '_frac'
    name_3 = name_str + '_log'
    data[name_1] = data[name_str].astype(int)
    data[name_2] = data[name_str] - data[name_1]
    data[name_3] = np.log(data[name_str])
    
    return data

In [35]:
name_str = 'TransactionAmt'
for df in df_list:
    df = transform_feature(df, name_str)

In [36]:
SELECTED_FEATURE_NAMES.remove('TransactionAmt')
SELECTED_FEATURE_NAMES += ['TransactionAmt_int', 'TransactionAmt_frac', 'TransactionAmt_log']
run_model()

[0]	train-auc:0.60237	valid-auc:0.61788	test-auc:0.62471
Multiple eval metrics have been passed: 'test-auc' will be used for early stopping.

Will train until test-auc hasn't improved in 50 rounds.
[50]	train-auc:0.91353	valid-auc:0.89170	test-auc:0.87702
[100]	train-auc:0.92317	valid-auc:0.89719	test-auc:0.88417
Stopping. Best iteration:
[82]	train-auc:0.92282	valid-auc:0.89712	test-auc:0.88446

roc-auc score of prediction: 0.86543


__Вывод:__ Новые признаки не улучшают точность предсказания на тестовой выборке.

#### 7. Frequency Encoding P_emaildomain и R_emaildomain
Выполнить предварительную подготовку / очистку признаков P_emaildomain и R_emaildomain (что и как делать - остается на ваше усмотрение) и сделать Frequency Encoding для очищенных признаков.

In [37]:
def classify_email_address(data, features_list, add_email):
    for feature in features_list:
        a = list(data[feature].value_counts().keys())
        gmail = [i for i in a if i[:5]=='gmail']
        yahoo = [i for i in a if i[:5]=='yahoo']
        hotmail = [i for i in a if i[:7]=='hotmail']
        live = [i for i in a if i[:4]=='live']            
        e_mail_list = gmail + yahoo + hotmail + live + add_email
        name_str = feature + '_'
        data.loc[data[feature].isin(e_mail_list), name_str] = 1
        data.loc[data[feature].isnull(), name_str] = -9
        
    return data

In [38]:
features_list = ['P_emaildomain', 'R_emaildomain']
add_email = ['aol.com', 'anonymous.com'] 
for df in df_list:
    df = classify_email_address(df, features_list, add_email)

In [39]:
features_list = ['P_emaildomain_', 'R_emaildomain_']

freq_encoder = FrequencyEncoder(features_list)
df_train = freq_encoder.fit_transform(df_train)
for df in df_list[1:]:
    df = freq_encoder.transform(df)

In [40]:
SELECTED_FEATURE_NAMES += freq_encoder.NEW_FEATURE_NAMES
run_model()

[0]	train-auc:0.60237	valid-auc:0.61788	test-auc:0.62471
Multiple eval metrics have been passed: 'test-auc' will be used for early stopping.

Will train until test-auc hasn't improved in 50 rounds.
[50]	train-auc:0.91051	valid-auc:0.88766	test-auc:0.86895
[100]	train-auc:0.91771	valid-auc:0.89242	test-auc:0.87688
Stopping. Best iteration:
[72]	train-auc:0.91753	valid-auc:0.89265	test-auc:0.87711

roc-auc score of prediction: 0.8654


__Вывод:__ Новые признаки не улучшают точность предсказания на тестовой выборке.