### Key idea:
The data is time series data but the challenge is to build a model that can predict unseen clients/credit card information not unseen time. We can convert detect fraud transactions to detect fraud client/credit card. Once the client has fraud, their entire acount is considered untrustworthy.

### How to find client (UID)
The training and test data have different sets of clients in each (some clients same some different) so we can find which columns help differentiate clients by performing adversarial validation. (i.e. Mix all the training and test data together. Then add a new boolean column "is_this_transaction_in_test_data?" Next train a model to classify whether a transaction is in test data or train data). If you do this on just the first 53 columns after transforming the D columns, you see AUC = 0.999 and these features as important:

D10n, D1n, D15n, C13, D4n, card1, D2n, card2, addr1, TransactionAmt, and dist1. These are the columns we must use to find the clients.


* key data:
card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.

addr: address

D1-D15: timedelta, such as days between previous transaction, etc.

P_ and (R__) emaildomain: purchaser and recipient email domain

* other data:

C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.

M1-M9: match, such as names on card and address, etc.

Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.


### Prevent overfitting
we can't use UID directly since a lot of clients are unseen and only appears in the test data. But we can use aggregated group features (e.g. df.groupby('uid')[CM_columns].agg(['mean'])). Then the model can classify clients that have never seen before.

we can't say X is a client. We have to say someone with hight 170cm and weight 120kg - model will be forced to search hidden patterns and connections. M/D mean encoding is doing it - describing client for the model in some general way.

Doing group aggregations with standard deviations (specifically of normalized D columns) allows your model to find clients. And doing group aggregation with means (and/or std) allows your model to classify clients.

Consider a group that all have the same uid = card1_addr1_D1n where D1n = day - D1. This group may contain multiple clients (credit cards). The features D4n, D10n, and D15n are more specific than D1n and better at finding individual clients. Therefore many times a group of card1_addr1_D1n will have more than 1 unique value of D15n inside suggesting multiple clients. But some groups have only 1 unique value of D15n inside suggesting a single client. When you do df.groupby('uid').D15n.agg(['std']) you are will get std=0 if there is only one D15n inside and your model will be more confident that that uid is a single client (credit card).

The M columns are very predictive features. For example if you train a model on the first month of data from train using just M4 it will reach train AUC = 0.7 and it can predict the last month of train with AUC = 0.7. So when you use df.groupby('uid').M4.agg(['mean']) after (mapping M4 categories to integers), it allows your model to use this feature to classify clients. Now all uids with M4_mean = 2 will be split one way in your tree and all with M4_mean = 0 another way.

### EDA
So many columns to analyze. Need to reduce the columns. First find columns that have similar NAN structure and group them. Then choose columns from each group, possible choices are

1. apply PCA on each group
2. select subset of uncorrelated columns from each group
3. replace the entire group with all columns' average.

Then continue doing feature selection

### Feature selection

1. forward feature selection
2. recursive feature elimination
3. permutation importance
4. adversarial validation
5. correlation analysis
6. time consistency
"time consistency" is to train a single model using a single feature (or small group of features) on the first month of train dataset and predict isFraud for the last month of train dataset. This evaluates whether a feature by itself is consistent over time.  

### Validation Strategy
local validation scheme was to train a model on the first 75% rows and predict isFraud on the last 25% rows. (This is approximately train first 4.5 months and predict last 1.5 month). 

When engineer a new feature (or group of features), evaluate whether it (they) increases this local validation AUC. Other tests like train.csv/test.csv distribution, time consistency, correlation redundancy would indicate possible "bad" features. I would then remove these features and evaluate whether local validation AUC increased or decreased.

When local validation AUC increased, the Group k Fold CV AUC usually increased too. When they did not agree, I trusted local holdout AUC more because it was a forward in time prediction whereas Group K Fold CV includes some backwards in time folds.

Use GroupKFold and group based on transaction month since it is a time series dataset and there is a dependency btw monthly info.

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook
from sklearn.metrics import roc_auc_score
import gc

from sklearn.preprocessing import LabelEncoder

import matplotlib.pyplot as plt
import seaborn as sns
import datetime
sns.set()
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import KFold, GroupKFold
import lightgbm as lgb
import pickle

In [31]:

print('Loading data...')
folder_path = ''

train_identity = pd.read_csv(f'{folder_path}train_identity.csv', index_col='TransactionID')
print('\tSuccessfully loaded train_identity!')

train_transaction = pd.read_csv(f'{folder_path}train_transaction.csv', index_col='TransactionID')
print('\tSuccessfully loaded train_transaction!')

test_identity = pd.read_csv(f'{folder_path}test_identity.csv', index_col='TransactionID')
print('\tSuccessfully loaded test_identity!')

test_transaction = pd.read_csv(f'{folder_path}test_transaction.csv', index_col='TransactionID')
print('\tSuccessfully loaded test_transaction!')

sub = pd.read_csv(f'{folder_path}sample_submission.csv')
print('\tSuccessfully loaded sample_submission!')

print('Data was successfully loaded!\n')

Loading data...
	Successfully loaded train_identity!
	Successfully loaded train_transaction!
	Successfully loaded test_identity!
	Successfully loaded test_transaction!
	Successfully loaded sample_submission!
Data was successfully loaded!



### Remove V columns
Some V columns that are highly correlated are removed

In [None]:
# remove v_col
v =  [2,5,7,9,10]
v += [12,15,16,18,19,21,22,24,25,28,29,31]
v += [32,33,34,35,38,39,42,43,45,46]
v += [49,50,51,52,53,55,57,58,60,61,63,64,66,69,71]
v += [72,73,74,75,77,79,81,83,84,85,87,90,92]
v += [93,94,95,97,100,101,102,103,105,106]
v += [109,110,112,113,114,116,118,119,125,126]
v += [128,131,132,133,134,135,137]
v += [140,141,143,144,145,146,148,149,150,151,152,153,154,155,157,158,159,161,163]
v += [164,167,168,170,172,174,177,179,181,183]
v += [184,186,189,190,191,192,193,194,195,196,197,199,200,201,202,204,206,208,211]
v += [212,213,214,216,217]
v += [219,222,225,227,230,231,232,233,236]
v += [237,239,241,242,243,244,245,246,247,248,249,251,254,255,256,259,262]
v += [263,265,268,269,270,272,273,275,276,278,279,280,282,287,288,290]
v += [292,293,295,298,299,300,302]
v += [304,306,308,311,312,313,315,316,317,318,319,321]
v += [322,323,324,326,327,328,329,330,331,333,334,336,337,339]
cols_rm = ['V'+str(x) for x in v]
train_transaction.drop(cols_rm, axis = 1, inplace = True)
test_transaction.drop(cols_rm, axis = 1, inplace = True)
print(f'Train dataset has {train_transaction.shape[0]} rows and {train_transaction.shape[1]} columns.')
print(f'Test dataset has {test_transaction.shape[0]} rows and {test_transaction.shape[1]} columns.')

In [43]:
test_transaction.shape

(506691, 182)

In [44]:
def minify_identity_df(df):

    df['id_12'] = df['id_12'].map({'Found':1, 'NotFound':0})
    df['id_15'] = df['id_15'].map({'New':2, 'Found':1, 'Unknown':0})
    df['id_16'] = df['id_16'].map({'Found':1, 'NotFound':0})

    df['id_23'] = df['id_23'].map({'TRANSPARENT':4, 'IP_PROXY':3, 'IP_PROXY:ANONYMOUS':2, 'IP_PROXY:HIDDEN':1})

    df['id_27'] = df['id_27'].map({'Found':1, 'NotFound':0})
    df['id_28'] = df['id_28'].map({'New':2, 'Found':1})

    df['id_29'] = df['id_29'].map({'Found':1, 'NotFound':0})

    df['id_35'] = df['id_35'].map({'T':1, 'F':0})
    df['id_36'] = df['id_36'].map({'T':1, 'F':0})
    df['id_37'] = df['id_37'].map({'T':1, 'F':0})
    df['id_38'] = df['id_38'].map({'T':1, 'F':0})

    df['id_34'] = df['id_34'].fillna(':0')
    df['id_34'] = df['id_34'].apply(lambda x: x.split(':')[1]).astype(np.int8)
    df['id_34'] = np.where(df['id_34']==0, np.nan, df['id_34'])
    
    df['id_33'] = df['id_33'].fillna('0x0')
    df['id_33_0'] = df['id_33'].apply(lambda x: x.split('x')[0]).astype(int)
    df['id_33_1'] = df['id_33'].apply(lambda x: x.split('x')[1]).astype(int)
    df['id_33'] = np.where(df['id_33']=='0x0', np.nan, df['id_33'])

    df['DeviceType'].map({'desktop':1, 'mobile':0})
    return df

train_identity = minify_identity_df(train_identity)
test_identity = minify_identity_df(test_identity)

for col in ['id_33']:
    train_identity[col] = train_identity[col].fillna('unseen_before_label')
    test_identity[col]  = test_identity[col].fillna('unseen_before_label')



In [45]:
print('Merging data...')
train = train_transaction.merge(train_identity, how='left', left_index=True, right_index=True)
test = test_transaction.merge(test_identity, how='left', left_index=True, right_index=True)
y = train['isFraud']
train.drop('isFraud', axis = 1, inplace = True)

print('Data was successfully merged!\n')

del train_identity, train_transaction, test_identity, test_transaction

print(f'Train dataset has {train.shape[0]} rows and {train.shape[1]} columns.')
print(f'Test dataset has {test.shape[0]} rows and {test.shape[1]} columns.')

gc.collect()

Merging data...
Data was successfully merged!

Train dataset has 590540 rows and 224 columns.
Test dataset has 506691 rows and 224 columns.


3147

## Feature engineering
D columns are time deltas from some point in the past. Convert D columns to the correspoinding points in the past. 

D15 = Transaction_Day - D15; 
Transaction_Day = TransactionDT/(24*60*60)

In [46]:
# normalize D column
for i in range(1,16):
    if i in [1,2,3,5,9,15]: continue
    train['D'+str(i)] =  train['D'+str(i)] - train.TransactionDT/np.float32(24*60*60)
    test['D'+str(i)] = test['D'+str(i)] - test.TransactionDT/np.float32(24*60*60) 

Combine two columns, add interaction

In [48]:
# encoding M
for col in ['M1','M2','M3','M5','M6','M7','M8','M9']:
    train[col] = train[col].map({'T':1, 'F':0})
    test[col]  = test[col].map({'T':1, 'F':0})

for col in ['M4']:
    print('Encoding', col)
    temp = pd.concat([train[[col]], test[[col]]])
    col_encoded = temp[col].value_counts().to_dict()   
    train[col] = train[col].map(col_encoded)
    test[col]  = test[col].map(col_encoded)
#     print(col_encoded)
# Some arbitrary features interaction

for feature in ['id_02__id_20', 'id_02__D8', 'D11__DeviceInfo', 'DeviceInfo__P_emaildomain', 'P_emaildomain__C2', 'card1__dist1',
                'card2__dist1', 'card1__card5', 'card2__id_20', 'card5__P_emaildomain', 'addr1__card2','addr1__card1']:

    f1, f2 = feature.split('__')
    train[feature] = train[f1].astype(str) + '_' + train[f2].astype(str)
    test[feature] = test[f1].astype(str) + '_' + test[f2].astype(str)
for fe in ['addr1__card1/P_emaildomain', 'addr1__card2/P_emaildomain']:
    f1, f2 = fe.split('/')
    train[fe] = train[f1].astype(str) + '_' + train[f2].astype(str)
    test[fe] = test[f1].astype(str) + '_' + test[f2].astype(str)
    
#     le = LabelEncoder()
#     le.fit(list(train[feature].astype(str).values) + list(test[feature].astype(str).values))
#     train[feature] = le.transform(list(train[feature].astype(str).values))
#     test[feature] = le.transform(list(test[feature].astype(str).values))
print('after Transaction, time and interaction')
print('Train dataset has {} rows and {} columns.'.format(train.shape[0], train.shape[1]))
print('Test dataset has {} rows and {} columns.'.format(test.shape[0], test.shape[1]))

Encoding M4
after Transaction, time and interaction
Train dataset has 590540 rows and 238 columns.
Test dataset has 506691 rows and 238 columns.


It is not effective to detect fraudulent transactions. Once a client (credit card) has fraud, their entire account can be seen as "untrusted". Therefore we are predicting fraudulent clients (credit cards).
**Creat unique identification** that help the model to find credit card/clients. This UID is not perfect since one UID may contain several clients. Problems were solved by aggregate features with aggregated with mean and std. So imperfect UID will be then splits in the trees.

In [49]:
# find uid, 
# most important columns to identify users are D10, D1, D15, C13, D4, card1, D2, card2, addr1, TransactionAmt, and dist1.
def add_uid(df):
    # uid
    D1n = df['TransactionDT'] / (24*60*60) - df['D1']
    df['uid3'] = df['addr1__card1'].astype(str)+'_'+ D1n.astype(str)

    df['uid'] = df['card1'].astype(str)+'_'+df['card2'].astype(str)
    
    # uid1 = cd1 + cd2 + cd3 + cd5
    df['uid1'] = df['uid'].astype(str)+'_'+df['card3'].astype(str)+'_'+df['card5'].astype(str)

    df['D9'] = np.where(df['D9'].isna(),0,1)

    # uid1 + addr1 + addr2
    df['uid2'] = df['uid1']+'_'+df['addr1'].astype(str)+'_'+df['addr2'].astype(str)

#     df['D1n'] = np.floor(df.TransactionDT / (24*60*60)) - df.D1
#     df['D10n'] = np.floor(df.TransactionDT / (24*60*60)) - df.D10
#     df['D15n'] = np.floor(df.TransactionDT / (24*60*60)) - df.D15
#     df['D4n'] = np.floor(df.TransactionDT / (24*60*60)) - df.D4
#     df['D2n'] = np.floor(df.TransactionDT / (24*60*60)) - df.D2

# for col in ['D1','D10','D15']:
#     new_col = str(col) + 'n'

#     # card1 + full_addr + date
#     df[new_col+'_cUID_1'] = df['card1'].astype(str)+'_'+df['addr1'].astype(str)+'_'+df['addr2'].astype(str)+'_'+df[new_col].astype(str)
    
#     # uid1 + full_addr + date
#     df[new_col+'_cUID_2'] = df['uid2'].astype(str)+'_'+df[new_col].astype(str)
    
#     df[new_col+'_cUID_1'] = np.where(df[col].isna(), np.nan, df[new_col+'_cUID_1'])
#     df[new_col+'_cUID_2'] = np.where(df[col].isna(), np.nan, df[new_col+'_cUID_2'])
    
#     df[new_col+'_cUID_1'] = np.where(df['addr1'].isna()&df['addr2'].isna(), np.nan, df[new_col+'_cUID_1'])
    
#     df[new_col+'_cUID_1'] = le.fit_transform(df[new_col+'_cUID_1'].astype(str).values)
#     df[new_col+'_cUID_2'] = le.fit_transform(df[new_col+'_cUID_2'].astype(str).values)

    return df


train = add_uid(train)
test = add_uid(test)
print('uid added')
print(f'Train dataset has {train.shape[0]} rows and {train.shape[1]} columns.')
print(f'Test dataset has {test.shape[0]} rows and {test.shape[1]} columns.')

# for feature in ['uid1', 'uid2']:
#     le = LabelEncoder()
#     df[feature] = le.fit_transform(df[feature].astype(str).values)

uid added
Train dataset has 590540 rows and 242 columns.
Test dataset has 506691 rows and 242 columns.


In [50]:
# New feature - log of transaction amount.
train['TransactionAmt'] = np.log(train['TransactionAmt'])
test['TransactionAmt'] = np.log(test['TransactionAmt'])
# New feature - decimal part of the transaction amount.
# train['TransactionAmt_decimal'] = ((train['TransactionAmt'] - train['TransactionAmt'].astype(int)) * 1000).astype(int)
# test['TransactionAmt_decimal'] = ((test['TransactionAmt'] - test['TransactionAmt'].astype(int)) * 1000).astype(int)
train['cents'] = (train['TransactionAmt'] - np.floor(train['TransactionAmt'])).astype('float32')
test['cents'] = (test['TransactionAmt'] - np.floor(test['TransactionAmt'])).astype('float32')

In [51]:
# time feature
START_DATE = datetime.datetime.strptime('2017-11-30', '%Y-%m-%d')
def set_time(df):
    df['TransactionDT'] = df['TransactionDT'].fillna(df['TransactionDT'].median())
    df['DT'] = df['TransactionDT'].apply(lambda x: (START_DATE + datetime.timedelta(seconds = x)))
    df['DT_M'] = (df['DT'].dt.year-2017)*12 + df['DT'].dt.month
    df['DT_W'] = (df['DT'].dt.year-2017)*52 + df['DT'].dt.weekofyear
    df['DT_D'] = (df['DT'].dt.year-2017)*365 + df['DT'].dt.dayofyear
    return df
    
train=set_time(train)
test=set_time(test)


rm_col = ['TransactionDT','DT']

In [52]:
# set frequency (frequncy encoding)
i_cols = ['card1','card2','card3','card5',
          'D1','D2','D3','D4','D5','D6','D7','D8',
          'addr1','addr2',
          'dist1','dist2',
          'P_emaildomain', 'R_emaildomain',
          'id_30','id_33',
          'uid','uid1','uid2','uid3','card1__dist1',
          'card2__dist1', 'addr1__card2','addr1__card1', 
          'addr1__card1/P_emaildomain', 'addr1__card2/P_emaildomain'
         ]
for col in i_cols:
    temp = pd.concat([train[[col]], test[[col]]])
    fq_encode = temp[col].value_counts(dropna=False).to_dict()   
    train[col+'_fq_enc'] = train[col].map(fq_encode)
    test[col+'_fq_enc']  = test[col].map(fq_encode)
    
periods = ['DT_M','DT_W','DT_D']
for col in periods:
    temp = pd.concat([train[[col]], test[[col]]])
    fq_encode = temp[col].value_counts().to_dict()
            
    train[col+'_total'] = train[col].map(fq_encode)
    test[col+'_total']  = test[col].map(fq_encode)
print('frequency encoded')
print('Train dataset has {} rows and {} columns.'.format(train.shape[0], train.shape[1]))
print('Test dataset has {} rows and {} columns.'.format(test.shape[0], test.shape[1]))

frequency encoded
Train dataset has 590540 rows and 280 columns.
Test dataset has 506691 rows and 280 columns.


**Aggregation features**: We cannot add UID as a new column because clients in private test dataset are not in the training dataset. Instead we must create aggregated group features. For example we can take all the C, M columns and do this new_features = df.groupby('uid')[CM_columns].agg(['mean']). Then we delete the column uid. Now our model has the ability to classify clients that it has never seen before.

Some of the UID contains many clients, use std to slit them. if std = 0, there is only one clients, if std > 0, there are more clients (e.g. calculate std using date of the first time transaction) 

In [53]:
# mean and std
               
def encode_AG2(main_columns, uids, train_df=train, test_df=test):
    for main_column in main_columns:  
        for col in uids:
            comb = pd.concat([train_df[[col]+[main_column]],test_df[[col]+[main_column]]],axis=0)
            mp = comb.groupby(col)[main_column].agg(['nunique'])['nunique'].to_dict()
            train_df[col+'_'+main_column+'_ct'] = train_df[col].map(mp).astype('float32')
            test_df[col+'_'+main_column+'_ct'] = test_df[col].map(mp).astype('float32')
            print(col+'_'+main_column+'_ct, ',end='')

i_cols = ['card1','card2','addr1__card1','card1__dist1',
        'card2__dist1', 'addr1__card2', 'addr1__card1/P_emaildomain', 'addr1__card2/P_emaildomain']
agg_col = ['TransactionAmt','D9','D10']
for col in i_cols:
    for agg_type in ['mean', 'std']:
        for a_col in agg_col:
            new_col_name = col+'_'+a_col+'_'+agg_type
            temp = pd.concat([train[[col, a_col]], test[[col,a_col]]])
            #temp.loc[temp[a_col]==-1,a_col] = np.nan
            temp = temp.groupby(col)[a_col].agg([agg_type]).rename(columns={agg_type: new_col_name})
            temp = temp[new_col_name].to_dict()  
            train[new_col_name] = train[col].map(temp)
            test[new_col_name]  = test[col].map(temp)
            #train[new_col_name].fillna(-1,inplace=True)
            #test[new_col_name].fillna(-1,inplace=True)
print('AG mean std added')
print('Train dataset has {} rows and {} columns.'.format(train.shape[0], train.shape[1]))
print('Test dataset has {} rows and {} columns.'.format(test.shape[0], test.shape[1]))

AG mean std added
Train dataset has 590540 rows and 328 columns.
Test dataset has 506691 rows and 328 columns.


In [54]:
i_cols =['uid','uid1','uid2','uid3']
agg_col = ['TransactionAmt','D4','D9','D10','D15','C7','C3','C4','C5','C8','C9',
           'C1','C2','C10','C11','C12','C6','C13','C14', 'M1', 'M2', 'M3',
          'M4', 'M5', 'M6','M7', 'M8', 'M9']

for col in i_cols:
    print(col+' start...')
    for agg_type in ['mean', 'std']:
        for a_col in agg_col:
            new_col_name = col+'_'+a_col+'_'+agg_type
            temp = pd.concat([train[[col, a_col]], test[[col,a_col]]])
            #temp.loc[temp[a_col]==-1,a_col] = np.nan
            temp = temp.groupby(col)[a_col].agg([agg_type]).rename(columns={agg_type: new_col_name})
            temp = temp[new_col_name].to_dict()  
            train[new_col_name] = train[col].map(temp)
            test[new_col_name]  = test[col].map(temp)
            #train[new_col_name].fillna(-1,inplace=True)
            #test[new_col_name].fillna(-1,inplace=True)
print('AG mean std added')
print('Train dataset has {} rows and {} columns.'.format(train.shape[0], train.shape[1]))
print('Test dataset has {} rows and {} columns.'.format(test.shape[0], test.shape[1]))

uid start...
uid1 start...
uid2 start...
uid3 start...
AG mean std added
Train dataset has 590540 rows and 552 columns.
Test dataset has 506691 rows and 552 columns.


In [55]:
encode_AG2(['P_emaildomain','dist1','DT_M','id_02','cents','C13',
            'V314','V127','V136','V309','V307','V320'], ['uid','uid1','uid2','uid3'], train_df=train, test_df=test)

# train['outsider15'] = (np.abs(train.D1-train.D15)>3).astype('int8')
# test['outsider15'] = (np.abs(test.D1-test.D15)>3).astype('int8')

train = train.replace(np.inf,999)
test = test.replace(np.inf,999)


print('added mean and std')
print('Train dataset has {} rows and {} columns.'.format(train.shape[0], train.shape[1]))
print('Test dataset has {} rows and {} columns.'.format(test.shape[0], test.shape[1]))

uid_P_emaildomain_ct, uid1_P_emaildomain_ct, uid2_P_emaildomain_ct, uid3_P_emaildomain_ct, uid_dist1_ct, uid1_dist1_ct, uid2_dist1_ct, uid3_dist1_ct, uid_DT_M_ct, uid1_DT_M_ct, uid2_DT_M_ct, uid3_DT_M_ct, uid_id_02_ct, uid1_id_02_ct, uid2_id_02_ct, uid3_id_02_ct, uid_cents_ct, uid1_cents_ct, uid2_cents_ct, uid3_cents_ct, uid_C13_ct, uid1_C13_ct, uid2_C13_ct, uid3_C13_ct, uid_V314_ct, uid1_V314_ct, uid2_V314_ct, uid3_V314_ct, uid_V127_ct, uid1_V127_ct, uid2_V127_ct, uid3_V127_ct, uid_V136_ct, uid1_V136_ct, uid2_V136_ct, uid3_V136_ct, uid_V309_ct, uid1_V309_ct, uid2_V309_ct, uid3_V309_ct, uid_V307_ct, uid1_V307_ct, uid2_V307_ct, uid3_V307_ct, uid_V320_ct, uid1_V320_ct, uid2_V320_ct, uid3_V320_ct, added mean and std
Train dataset has 590540 rows and 601 columns.
Test dataset has 506691 rows and 601 columns.


In [56]:
# clean email feature
emails = {'gmail': 'google', 'att.net': 'att', 'twc.com': 'spectrum', 'scranton.edu': 'other', 'optonline.net': 'other',
          'hotmail.co.uk': 'microsoft', 'comcast.net': 'other', 'yahoo.com.mx': 'yahoo', 'yahoo.fr': 'yahoo',
          'yahoo.es': 'yahoo', 'charter.net': 'spectrum', 'live.com': 'microsoft', 'aim.com': 'aol', 'hotmail.de': 'microsoft',
          'centurylink.net': 'centurylink', 'gmail.com': 'google', 'me.com': 'apple', 'earthlink.net': 'other', 
          'gmx.de': 'other', 'web.de': 'other', 'cfl.rr.com': 'other', 'hotmail.com': 'microsoft', 'protonmail.com': 'other',
          'hotmail.fr': 'microsoft', 'windstream.net': 'other', 'outlook.es': 'microsoft', 'yahoo.co.jp': 'yahoo',
          'yahoo.de': 'yahoo', 'servicios-ta.com': 'other', 'netzero.net': 'other', 'suddenlink.net': 'other',
          'roadrunner.com': 'other', 'sc.rr.com': 'other', 'live.fr': 'microsoft', 'verizon.net': 'yahoo',
          'msn.com': 'microsoft', 'q.com': 'centurylink', 'prodigy.net.mx': 'att', 'frontier.com': 'yahoo',
          'anonymous.com': 'other', 'rocketmail.com': 'yahoo', 'sbcglobal.net': 'att', 'frontiernet.net': 'yahoo',
          'ymail.com': 'yahoo', 'outlook.com': 'microsoft', 'mail.com': 'other', 'bellsouth.net': 'other',
          'embarqmail.com': 'centurylink', 'cableone.net': 'other', 'hotmail.es': 'microsoft', 'mac.com': 'apple',
          'yahoo.co.uk': 'yahoo', 'netzero.com': 'other', 'yahoo.com': 'yahoo', 'live.com.mx': 'microsoft', 'ptd.net': 'other',
          'cox.net': 'other', 'aol.com': 'aol', 'juno.com': 'other', 'icloud.com': 'apple'}
us_emails = ['gmail', 'net', 'edu']

for c in ['P_emaildomain', 'R_emaildomain']:
    train[c + '_bin'] = train[c].map(emails)
    test[c + '_bin'] = test[c].map(emails)
    
    train[c + '_suffix'] = train[c].map(lambda x: str(x).split('.')[-1])
    test[c + '_suffix'] = test[c].map(lambda x: str(x).split('.')[-1])
    
    train[c + '_suffix'] = train[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')
    test[c + '_suffix'] = test[c + '_suffix'].map(lambda x: x if str(x) not in us_emails else 'us')

In [57]:
p = 'P_emaildomain'
r = 'R_emaildomain'
uknown = 'email_not_provided'

def setDomain(df):
    df[p] = df[p].fillna(uknown)
    df[r] = df[r].fillna(uknown)
    
    # Check if P_emaildomain matches R_emaildomain
    df['email_check'] = np.where((df[p]==df[r])&(df[p]!=uknown),1,0)

    df[p+'_prefix'] = df[p].apply(lambda x: x.split('.')[0])
    df[r+'_prefix'] = df[r].apply(lambda x: x.split('.')[0])
    
    return df
    
train=setDomain(train)
test=setDomain(test)

rm_col += ['P_emaildomain', 'R_emaildomain']
print('cleared email')
print('Train dataset has {} rows and {} columns.'.format(train.shape[0], train.shape[1]))
print('Test dataset has {} rows and {} columns.'.format(test.shape[0], test.shape[1]))

cleared email
Train dataset has 590540 rows and 608 columns.
Test dataset has 506691 rows and 608 columns.


In [58]:
# clean device
def id_split(dataframe):
    dataframe['device_name'] = dataframe['DeviceInfo'].str.split('/', expand=True)[0]
    dataframe['device_version'] = dataframe['DeviceInfo'].str.split('/', expand=True)[1]
    dataframe['OS_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[0]
    dataframe['version_id_30'] = dataframe['id_30'].str.split(' ', expand=True)[1]
    dataframe['browser_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[0]
    dataframe['version_id_31'] = dataframe['id_31'].str.split(' ', expand=True)[1]
    
    dataframe.loc[dataframe['device_name'].str.contains('SM', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('SAMSUNG', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('GT-', na=False), 'device_name'] = 'Samsung'
    dataframe.loc[dataframe['device_name'].str.contains('Moto G', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('Moto', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('moto', na=False), 'device_name'] = 'Motorola'
    dataframe.loc[dataframe['device_name'].str.contains('LG-', na=False), 'device_name'] = 'LG'
    dataframe.loc[dataframe['device_name'].str.contains('rv:', na=False), 'device_name'] = 'RV'
    dataframe.loc[dataframe['device_name'].str.contains('HUAWEI', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('ALE-', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('-L', na=False), 'device_name'] = 'Huawei'
    dataframe.loc[dataframe['device_name'].str.contains('Blade', na=False), 'device_name'] = 'ZTE'
    dataframe.loc[dataframe['device_name'].str.contains('BLADE', na=False), 'device_name'] = 'ZTE'
    dataframe.loc[dataframe['device_name'].str.contains('Linux', na=False), 'device_name'] = 'Linux'
    dataframe.loc[dataframe['device_name'].str.contains('XT', na=False), 'device_name'] = 'Sony'
    dataframe.loc[dataframe['device_name'].str.contains('HTC', na=False), 'device_name'] = 'HTC'
    dataframe.loc[dataframe['device_name'].str.contains('ASUS', na=False), 'device_name'] = 'Asus'

    dataframe.loc[dataframe.device_name.isin(dataframe.device_name.value_counts()[dataframe.device_name.value_counts() < 200].index), 'device_name'] = "Others"
    dataframe['had_id'] = 1
    gc.collect()
    
    return dataframe
rm_col += ['DeviceInfo', 'id_30', 'id_31']
train = id_split(train)
test = id_split(test)
print('cleared device')
print('Train dataset has {} rows and {} columns.'.format(train.shape[0], train.shape[1]))
print('Test dataset has {} rows and {} columns.'.format(test.shape[0], test.shape[1]))

cleared device
Train dataset has 590540 rows and 615 columns.
Test dataset has 506691 rows and 615 columns.


In [59]:
# drop rows with many nulls
def get_too_many_null_attr(data):
    many_null_cols = [col for col in data.columns if data[col].isnull().sum() / data.shape[0] > 0.9]
    return many_null_cols

def get_too_many_repeated_val(data):
    big_top_value_cols = [col for col in data.columns if data[col].value_counts(dropna=False, normalize=True).values[0] > 0.95]
    return big_top_value_cols

def get_useless_columns(data):
    too_many_null = get_too_many_null_attr(data)
    print("More than 90% null: " + str(len(too_many_null)))
    too_many_repeated = get_too_many_repeated_val(data)
    print("More than 90% repeated value: " + str(len(too_many_repeated)))
    cols_to_drop = list(set(too_many_null + too_many_repeated))
    #cols_to_drop.remove('isFraud')
    return cols_to_drop
cols_to_drop = get_useless_columns(train)
rm_col += cols_to_drop
rm_col += ['uid', 'uid1','uid2','uid3']
train = train.drop(rm_col, axis=1)
test = test.drop(rm_col, axis=1)

print('cleared useless features')
print('{} columns are removed'.format(len(rm_col)))
print('Train dataset has {} rows and {} columns.'.format(train.shape[0], train.shape[1]))
print('Test dataset has {} rows and {} columns.'.format(test.shape[0], test.shape[1]))

More than 90% null: 43
More than 90% repeated value: 65
cleared useless features
78 columns are removed
Train dataset has 590540 rows and 537 columns.
Test dataset has 506691 rows and 537 columns.


In [60]:
# label encoding
numerical_cols = train.select_dtypes(exclude = 'object').columns
categorical_cols = train.select_dtypes(include = 'object').columns
print('numerical columns: {}'.format(len(numerical_cols)))
print('categorical columns: {}'.format(len(categorical_cols)))
categorical_cols[:10]

numerical columns: 503
categorical columns: 34


Index(['ProductCD', 'card4', 'card6', 'id_33', 'DeviceType', 'id_02__id_20',
       'id_02__D8', 'D11__DeviceInfo', 'DeviceInfo__P_emaildomain',
       'P_emaildomain__C2'],
      dtype='object')

In [61]:
%%time
for f in train.columns:
    if train[f].dtype.name =='object':
        le = LabelEncoder()
        le.fit(list(train[f].astype(str).values) + list(test[f].astype(str).values))
        train[f] = le.transform(list(train[f].astype(str).values))
        test[f] = le.transform(list(test[f].astype(str).values))

CPU times: user 1min 51s, sys: 13.8 s, total: 2min 5s
Wall time: 1min 31s


In [62]:
X = train
X_test = test
print(X.shape)
print(y.shape)
print(X_test.shape)
#del train, test
gc.collect()

(590540, 537)
(590540,)
(506691, 537)


47

In [None]:
params = {'num_leaves': 491,
          'min_child_weight': 0.03454472573214212,
          'feature_fraction': 0.3797454081646243,
          'bagging_fraction': 0.4181193142567742,
          'min_data_in_leaf': 106,
          'objective': 'binary',
          'max_depth': -1,
          'learning_rate': 0.007,
          "boosting_type": "gbdt",
          "bagging_seed": 11,
          "metric": 'auc',
          "verbosity": -1,
          'reg_alpha': 0.3899927210061127,
          'reg_lambda': 0.6485237330340494,
          'random_state': 47,
         }

In [63]:
params = {'num_leaves': 546,
          'min_child_weight': 0.03454472573214212,
          'feature_fraction': 0.1797454081646243,
          'bagging_fraction': 0.2181193142567742,
          'min_data_in_leaf': 106,
          'objective': 'binary',
          'max_depth': -1,
          'learning_rate': 0.005883242363721497,
          "boosting_type": "gbdt",
          "bagging_seed": 11,
          "metric": 'auc',
          "verbosity": -1,
          'reg_alpha': 0.3299927210061127,
          'reg_lambda': 0.3885237330340494,
          'random_state': 42,
}

In [None]:
params = {
                    'objective':'binary',
                    'boosting_type':'gbdt',
                    'metric':'auc',
                    'n_jobs':-1,
                    'learning_rate':0.007,
                    'num_leaves': 2**8,
                    'max_depth':-1,
                    'tree_learner':'serial',
                    'colsample_bytree': 0.7,
                    'subsample_freq':1,
                    'subsample':0.7,
                    'n_estimators':10000,
                    'max_bin':255,
                    'verbose':-1,
                    'seed': 42,
                    'early_stopping_rounds':100, 
                }

In [None]:
%%time

NFOLDS = 6
folds = KFold(NFOLDS)
#folds = GroupKFold(n_splits=NFOLDS)
split_groups = X['DT_M']

columns = X.columns
splits = folds.split(X,y)
#splits = folds.split(X, y, groups = split_groups)
y_preds = np.zeros(X_test.shape[0])
y_oof = np.zeros(X.shape[0])
score = 0

feature_importances = pd.DataFrame()
feature_importances['feature'] = columns

for fold_n, (train_index, valid_index) in enumerate(splits):
    X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
    dtrain = lgb.Dataset(X_train, label=y_train)
    dvalid = lgb.Dataset(X_valid, label=y_valid)
    clf = lgb.train(params, dtrain, 10000, valid_sets = [dtrain, dvalid], verbose_eval=200, early_stopping_rounds=100)
#     month = X_valid['DT_M'].iloc[0]
#     print('Fold',fold_n + 1,' valid withholding month',month)
    
    feature_importances[f'fold_{fold_n + 1}'] = clf.feature_importance()
    
    y_pred_valid = clf.predict(X_valid)
    y_oof[valid_index] = y_pred_valid
    print(f"Fold {fold_n + 1} | AUC: {roc_auc_score(y_valid, y_pred_valid)}")
    
    score += roc_auc_score(y_valid, y_pred_valid) / NFOLDS
    y_preds += clf.predict(X_test) / NFOLDS
    
    del X_train, X_valid, y_train, y_valid
    gc.collect()
    
print(f"\nMean AUC = {score}")
print(f"Out of folds AUC = {roc_auc_score(y, y_oof)}")

In [23]:
sub['isFraud'] = y_preds
sub.to_csv("sub_M_ol_normD_encode_id_rm.csv", index=False) # normD_M

In [None]:
%%time
# Xgb
import xgboost as xgb
NFOLDS = 6
# folds = KFold(NFOLDS)
folds = GroupKFold(n_splits=NFOLDS)
split_groups = X['DT_M']

columns = X.columns
splits = folds.split(X, y, groups = split_groups)
y_preds = np.zeros(X_test.shape[0])
y_oof = np.zeros(X.shape[0])
score = 0

feature_importances = pd.DataFrame()
feature_importances['feature'] = columns

for fold_n, (train_index, valid_index) in enumerate(splits):
    X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
    y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
#     month = X_valid['DT_M'].iloc[0]
#     print('Fold',fold_n + 1,' valid withholding month',month)
    
    clf = xgb.XGBClassifier(
            n_estimators=5000,
            max_depth=8,
            learning_rate=0.02,
            subsample=0.8,
            colsample_bytree=0.4,
            missing=np.nan,
            eval_metric='auc',
            # USE CPU
            nthread=4,
            tree_method='hist'
            # USE GPU
            #tree_method='gpu_hist' 
        ) 
    h = clf.fit(X_train, y_train, 
                eval_set=[(X_valid,y_valid)], verbose=100, early_stopping_rounds=200)
    
    
    feature_importances[f'fold_{fold_n + 1}'] = clf.feature_importances_
    
    y_pred_valid = clf.predict_proba(X_valid)[:,1]
    y_oof[valid_index] = y_pred_valid
    print(f"Fold {fold_n + 1} | AUC: {roc_auc_score(y_valid, y_pred_valid)}")
    
    score += roc_auc_score(y_valid, y_pred_valid) / NFOLDS
    y_preds += clf.predict_proba(X_test)[:,1] / NFOLDS

    del X_train, X_valid, y_train, y_valid, h, clf
    gc.collect()
    
print(f"\nMean AUC = {score}")
print(f"Out of folds AUC = {roc_auc_score(y, y_oof)}")

In [None]:
# groupkfold ADD not sure
# delet columns ADD work

# outliers ADD
# mean encoding using M columns ADD
# Normalize D ADD

# Mark card columns "outliers"
# more uid? +D
# Find nunique dates per client for uid D

# XGB ADD
# post process ADD not work

# work
# for col in ['card4', 'card6', 'ProductCD']:   
#     print('Encoding', col)
#     temp_df = pd.concat([train_df[[col]], test_df[[col]]])
#     col_encoded = temp_df[col].value_counts().to_dict()   
#     train_df[col] = train_df[col].map(col_encoded)
#     test_df[col]  = test_df[col].map(col_encoded)
#     print(col_encoded)

# for col in ['M1','M2','M3','M5','M6','M7','M8','M9']:
#     train_df[col] = train_df[col].map({'T':1, 'F':0})
#     test_df[col]  = test_df[col].map({'T':1, 'F':0})

# for col in ['M4']:
#     print('Encoding', col)
#     temp_df = pd.concat([train_df[[col]], test_df[[col]]])
#     col_encoded = temp_df[col].value_counts().to_dict()   
#     train_df[col] = train_df[col].map(col_encoded)
#     test_df[col]  = test_df[col].map(col_encoded)
#     print(col_encoded)
''' 
def minify_identity_df(df):  ADD

    df['id_12'] = df['id_12'].map({'Found':1, 'NotFound':0})
    df['id_15'] = df['id_15'].map({'New':2, 'Found':1, 'Unknown':0})
    df['id_16'] = df['id_16'].map({'Found':1, 'NotFound':0})

    df['id_23'] = df['id_23'].map({'TRANSPARENT':4, 'IP_PROXY':3, 'IP_PROXY:ANONYMOUS':2, 'IP_PROXY:HIDDEN':1})

    df['id_27'] = df['id_27'].map({'Found':1, 'NotFound':0})
    df['id_28'] = df['id_28'].map({'New':2, 'Found':1})

    df['id_29'] = df['id_29'].map({'Found':1, 'NotFound':0})

    df['id_35'] = df['id_35'].map({'T':1, 'F':0})
    df['id_36'] = df['id_36'].map({'T':1, 'F':0})
    df['id_37'] = df['id_37'].map({'T':1, 'F':0})
    df['id_38'] = df['id_38'].map({'T':1, 'F':0})

    df['id_34'] = df['id_34'].fillna(':0')
    df['id_34'] = df['id_34'].apply(lambda x: x.split(':')[1]).astype(np.int8)
    df['id_34'] = np.where(df['id_34']==0, np.nan, df['id_34'])
    
    df['id_33'] = df['id_33'].fillna('0x0')
    df['id_33_0'] = df['id_33'].apply(lambda x: x.split('x')[0]).astype(int)
    df['id_33_1'] = df['id_33'].apply(lambda x: x.split('x')[1]).astype(int)
    df['id_33'] = np.where(df['id_33']=='0x0', np.nan, df['id_33'])

    df['DeviceType'].map({'desktop':1, 'mobile':0})
    return df

train_identity = minify_identity_df(train_identity)
test_identity = minify_identity_df(test_identity)

for col in ['id_33']:
    train_identity[col] = train_identity[col].fillna('unseen_before_label')
    test_identity[col]  = test_identity[col].fillna('unseen_before_label')
'''   

# with open('model.pkl', 'wb') as fout:
#     pickle.dump(clf, fout)
# # load model with pickle to predict
# with open('model.pkl', 'rb') as fin:
#     pkl_bst = pickle.load(fin)
# # can predict with any iteration when loaded in pickle way
# # y_pred = pkl_bst.predict(X_test)

In [None]:
rm = feature_importances.sort_values(by = 'average').feature[:100]
list(rm)

In [None]:
feature_importances['average'] = feature_importances[['fold_{}'.format(fold + 1) for fold in range(folds.n_splits)]].mean(axis=1)
feature_importances.to_csv('feature_importances.csv')

plt.figure(figsize=(16, 16))
sns.barplot(data=feature_importances.sort_values(by='average', ascending=True).head(100), x='average', y='feature');
plt.title('50 bottom feature importance over {} folds average'.format(folds.n_splits));



In [None]:
plt.figure(figsize=(16, 16))
sns.barplot(data=feature_importances.sort_values(by='average', ascending=False).head(50), x='average', y='feature');
plt.title('50 top feature importance over {} folds average'.format(folds.n_splits));


In [25]:
sub['isFraud'] = y_preds
sub.to_csv("sub_xgb_M_ol_normD_encode_id_rm.csv", index=False)

In [94]:
# post process
# UIDs, We believe each to be an individual client (credit card). 
# Analysis shows us that all transactions from a single client (one of Konstantin's UIDs) 
# are either all isFraud=0 or all isFraud=1. In other words, all their predictions are the same.
# Therefore our post process is to replace all predictions from one client with 
# their average prediction including the isFraud values from the train dataset. 
# We have two slightly different versions so we apply them sequentially.
                                                              
# #X_test['isFraud'] = sample_submission.isFraud.values
# X_test['isFraud'] = y_preds
# X['isFraud'] = y.values
# comb = pd.concat([X[['isFraud']],X_test[['isFraud']]],axis=0)

# #uids = pd.read_csv('X_tr_tt_uids.csv',usecols=['TransactionID','uid2','uid3'])
# uids = pd.concat([X[['uid','uid1']], X_test[['uid','uid1']]], axis = 0)
# comb = comb.merge(uids,left_index = True,right_index=True,how='left')
# mp = comb.groupby('uid').isFraud.agg(['mean'])
# comb.loc[comb.uid>0,'isFraud'] = comb.loc[comb.uid>0].uid.map(mp['mean'])

# mp = comb.groupby('uid1').isFraud.agg(['mean'])
# comb.loc[comb.uid1>0,'isFraud'] = comb.loc[comb.uid1>0].uid1.map(mp['mean'])   
# sub.isFraud = comb.iloc[len(X):].isFraud.values
# sub.to_csv('PP_uid_uid1.csv',index=False)