#### High-level problem statement
E-commerce websites often transact huge amounts of money. Whenever a huge amount of money is moved, there is a high risk of users performing fraudulent activities, e.g. using stolen credit cards, laundering money, etc. 
#### Objective
The goal of this challenge is to build a machine learning model that predicts the probability that the first transaction of a new user is fraudulent.
Electronica is an e-commerce site that sells wholesale electronics. You have been contracted to build a model that predicts whether a given transaction is fraudulent or not. You only have information about each user’s first transaction on Electronica’s website. If you fail to identify a fraudulent transaction, Electronica loses money equivalent to the price of the fraudulently purchased product. If you incorrectly flag a real transaction as fraudulent, it inconveniences the Electronica customers whose valid transactions are flagged—a cost your client values at $8.!

In [41]:
# Data Importing

import os
import pandas as pd
import numpy as np

## Import Excel file
ip_country = pd.read_excel('Candidate_tech_evaluation_candidate_copy_datascience_IpAddress_to_Country.xlsx')
####

## Import CSV file
fraud = pd.read_csv('Candidate_tech_evaluation_candidate_copy_data science_fraud.csv')
no_label = pd.read_csv('fraud_holdout_no_label_1.csv')


### How to find rather a transaction is fradulent

There are three files.
- ipaddress_to_country
- fraud data
- fraud data without label

I will use fraud data as a train dataset and run a model on no_label one. There are many algorithms I can run such as random forest or SVM. For this usecase, I will use two gradient boosting models (XG Boost or LightGBM), because of its fast performance and ability to handle complex problem - there is no need to use Neural network as i have no large dataset.

In [2]:
no_label

Unnamed: 0.1,Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address
0,6,159135,2015-05-21 6:03,2015-07-09 8:05,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2.809315e+09
1,7,50116,2015-08-01 22:40,2015-08-27 3:37,11,IWKVZHJOCLPUR,Ads,Chrome,F,19,3.987484e+09
2,10,182338,2015-01-25 17:49,2015-03-23 23:05,62,NRFFPPHZYFUVC,Ads,IE,M,31,3.416747e+08
3,11,199700,2015-07-11 18:26,2015-10-28 21:59,13,TEPSJVVXGNTYR,Ads,Safari,F,35,1.819009e+09
4,12,73884,2015-05-29 16:22,2015-06-16 5:45,58,ZTZZJUCRDOCJZ,Direct,Chrome,M,32,4.038285e+09
...,...,...,...,...,...,...,...,...,...,...,...
31107,151090,330979,2015-08-12 17:21,2015-12-03 0:00,43,WFYRTUQZQFSQJ,SEO,IE,M,34,1.966801e+09
31108,151094,25306,2015-05-15 20:09,2015-07-08 14:21,38,GFIVGRRVFBFOF,SEO,Chrome,M,31,3.097327e+09
31109,151097,27502,2015-04-14 23:43,2015-06-24 10:42,43,PYCNPZMYIETTA,Direct,Opera,F,30,4.202836e+09
31110,151100,115473,2015-01-01 7:26,2015-01-01 7:26,61,ZRHCEVZHNIBJH,Direct,IE,M,24,3.003296e+09


## Exploratory Data Analysis

In [3]:
n=fraud.nunique(axis=0)
print(n)
fraud

Unnamed: 0        120000
user_id           120000
signup_time       109111
purchase_time     100058
purchase_value       120
device_id         110599
source                 3
browser                5
sex                    2
age                   57
ip_address        114134
class                  2
dtype: int64


Unnamed: 0.1,Unnamed: 0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
0,149671,285108,7/15/2015 4:36,9/10/2015 14:17,31,HZAKVUFTDOSFD,Direct,Chrome,M,49,2.818400e+09,0
1,15611,131009,1/24/2015 12:29,4/13/2015 4:53,31,XGQAJSOUJIZCC,SEO,IE,F,21,3.251268e+09,0
2,73178,328855,3/11/2015 0:54,4/5/2015 12:23,16,VCCTAYDCWKZIY,Direct,IE,M,26,2.727760e+09,0
3,84546,229053,1/7/2015 13:19,1/9/2015 10:12,29,MFFIHYNXCJLEY,SEO,Chrome,M,34,2.083420e+09,0
4,35978,108439,2/8/2015 21:11,4/9/2015 14:26,26,WMSXWGVPNIFBM,Ads,FireFox,M,33,3.207913e+09,0
...,...,...,...,...,...,...,...,...,...,...,...,...
119995,13862,116698,2/26/2015 11:51,4/16/2015 22:57,46,UJYRDGZXTFFJG,Ads,Chrome,M,18,2.509395e+09,0
119996,122655,122699,8/1/2015 18:40,8/25/2015 7:56,26,EMMTCPTUYQYPX,Ads,IE,F,36,2.946612e+09,0
119997,125965,115120,7/25/2015 12:50,9/3/2015 4:10,41,YSZGGEARGETEU,SEO,Chrome,M,31,5.570629e+08,0
119998,31108,87098,4/2/2015 21:11,6/22/2015 16:51,50,BJDWRJULJZNOV,SEO,Chrome,F,43,2.687887e+09,0


In [4]:
## need to first find what data types they are
def get_var_category(series):
    unique_count = series.nunique(dropna=False)
    total_count = len(series)
    if pd.api.types.is_numeric_dtype(series):
        return 'Numerical'
    elif pd.api.types.is_datetime64_dtype(series):
        return 'Date'
    elif unique_count==total_count:
        return 'Text (Unique)'
    else:
        return 'Categorical'

def print_categories(df):
    for column_name in df.columns:
        print(column_name, ": ", get_var_category(df[column_name]))

print_categories(fraud) 


Unnamed: 0 :  Numerical
user_id :  Numerical
signup_time :  Categorical
purchase_time :  Categorical
purchase_value :  Numerical
device_id :  Categorical
source :  Categorical
browser :  Categorical
sex :  Categorical
age :  Numerical
ip_address :  Numerical
class :  Numerical


In [5]:
# there are few problems.
##1) unamed: 0 should be an index 2) time are categorical, not datetime. 3) Ip addresses are in e+09.

#1
fraud = fraud.set_index('Unnamed: 0')

#2
fraud['purchase_time'] = pd.to_datetime(fraud['purchase_time'])
fraud['signup_time'] = pd.to_datetime(fraud['signup_time'])

#3
fraud['ip_address'] = fraud['ip_address'].apply(int)

fraud.head()

Unnamed: 0_level_0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
149671,285108,2015-07-15 04:36:00,2015-09-10 14:17:00,31,HZAKVUFTDOSFD,Direct,Chrome,M,49,2818400139,0
15611,131009,2015-01-24 12:29:00,2015-04-13 04:53:00,31,XGQAJSOUJIZCC,SEO,IE,F,21,3251268287,0
73178,328855,2015-03-11 00:54:00,2015-04-05 12:23:00,16,VCCTAYDCWKZIY,Direct,IE,M,26,2727760440,0
84546,229053,2015-01-07 13:19:00,2015-01-09 10:12:00,29,MFFIHYNXCJLEY,SEO,Chrome,M,34,2083419526,0
35978,108439,2015-02-08 21:11:00,2015-04-09 14:26:00,26,WMSXWGVPNIFBM,Ads,FireFox,M,33,3207912664,0


In [6]:
ip_country

Unnamed: 0,lower_bound_ip_address,upper_bound_ip_address,country
0,16777216,16777471,Australia
1,16777472,16777727,China
2,16777728,16778239,China
3,16778240,16779263,Australia
4,16779264,16781311,China
...,...,...,...
138841,3758092288,3758093311,Hong Kong
138842,3758093312,3758094335,India
138843,3758095360,3758095871,China
138844,3758095872,3758096127,Singapore


In [7]:
## I need to first add country based on the ip_address
## Finds the country for a given IP address based on a reference dataframe.
## returns ip address, others and if not found

def find_country(ip,df):
    # Use vectorized boolean indexing for efficiency
    matches = (df['lower_bound_ip_address'] <= ip) & (df['upper_bound_ip_address'] >= ip)
    if matches.any():
        return df[matches]['country'].iloc[0]
    else:
        return "Others"
    

In [8]:
fraud['country'] = fraud['ip_address'].apply(lambda ip: find_country(ip, ip_country))
fraud

Unnamed: 0_level_0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,country
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
149671,285108,2015-07-15 04:36:00,2015-09-10 14:17:00,31,HZAKVUFTDOSFD,Direct,Chrome,M,49,2818400139,0,United States
15611,131009,2015-01-24 12:29:00,2015-04-13 04:53:00,31,XGQAJSOUJIZCC,SEO,IE,F,21,3251268287,0,United Kingdom
73178,328855,2015-03-11 00:54:00,2015-04-05 12:23:00,16,VCCTAYDCWKZIY,Direct,IE,M,26,2727760440,0,United States
84546,229053,2015-01-07 13:19:00,2015-01-09 10:12:00,29,MFFIHYNXCJLEY,SEO,Chrome,M,34,2083419526,0,Korea Republic of
35978,108439,2015-02-08 21:11:00,2015-04-09 14:26:00,26,WMSXWGVPNIFBM,Ads,FireFox,M,33,3207912664,0,Brazil
...,...,...,...,...,...,...,...,...,...,...,...,...
13862,116698,2015-02-26 11:51:00,2015-04-16 22:57:00,46,UJYRDGZXTFFJG,Ads,Chrome,M,18,2509395150,0,Netherlands
122655,122699,2015-08-01 18:40:00,2015-08-25 07:56:00,26,EMMTCPTUYQYPX,Ads,IE,F,36,2946611970,0,China
125965,115120,2015-07-25 12:50:00,2015-09-03 04:10:00,41,YSZGGEARGETEU,SEO,Chrome,M,31,557062884,0,United States
31108,87098,2015-04-02 21:11:00,2015-06-22 16:51:00,50,BJDWRJULJZNOV,SEO,Chrome,F,43,2687886786,0,Switzerland


In [9]:
## The reason why I didn't drop null values are 1) there are a lot (~10%) 2) bucketing them might also bring an insight as well
fraud.query('country == "Others"')

Unnamed: 0_level_0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,country
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
71773,323775,2015-06-30 07:34:00,2015-09-05 16:59:00,28,DLOOEWQCUQRKZ,SEO,Safari,M,47,4075993923,0,Others
51038,391908,2015-02-02 06:28:00,2015-05-28 11:10:00,32,OQSADNKPZYRIJ,SEO,Safari,F,24,3893641630,0,Others
140099,330545,2015-03-29 17:31:00,2015-07-05 07:46:00,14,QNEIKYWIQWHGH,Ads,Chrome,M,46,4111380696,0,Others
99494,83048,2015-01-26 12:53:00,2015-03-24 23:07:00,68,MVSCLPNPUCJOI,SEO,IE,M,25,3913632723,0,Others
35684,232877,2015-07-29 03:51:00,2015-07-29 09:56:00,73,DZYZRANMIRILR,SEO,Safari,M,28,4212155543,0,Others
...,...,...,...,...,...,...,...,...,...,...,...,...
112611,173124,2015-01-22 05:04:00,2015-05-14 09:46:00,46,BKLQTHLSBBFNT,SEO,FireFox,M,24,3825334549,0,Others
109225,167468,2015-07-02 21:05:00,2015-08-25 13:29:00,44,MQYOCEZHCTIIJ,SEO,Chrome,F,35,169743783,0,Others
68344,379065,2015-03-14 19:43:00,2015-05-06 15:09:00,48,TPVDXCUWUGJFV,Ads,Chrome,M,23,3778213441,1,Others
119879,19916,2015-03-15 09:51:00,2015-05-23 01:37:00,24,KTXGKQGOVLTAR,SEO,Chrome,M,32,4270132350,0,Others


In [10]:
# in addition, we need to change the datetime value. 
# for time like 2015-06-30 07:34:00, i need to break up : here are few considerations i had
## roughly, we can break up time into 4 sections: year,month, day, and hours
## year - does not matter, month might matter for the seasonality, day - mostly weekdays vs weekend, and hours - matter
### i will get use day, hours
def detail_date(df):
    df['signup_day'] = pd.to_datetime(df['signup_time']).dt.day_name()
    df['purchase_day'] = pd.to_datetime(df['purchase_time']).dt.day_name()
    df['signup_week'] = pd.to_datetime(df['signup_time']).dt.isocalendar().week
    df['purchase_week'] = pd.to_datetime(df['purchase_time']).dt.isocalendar().week
    df['signup_hour'] = pd.to_datetime(df['signup_time']).dt.hour
    df['purchase_hour'] = pd.to_datetime(df['purchase_time']).dt.hour
    return df

df = detail_date(fraud)
df

Unnamed: 0_level_0,user_id,signup_time,purchase_time,purchase_value,device_id,source,browser,sex,age,ip_address,class,country,signup_day,purchase_day,signup_week,purchase_week,signup_hour,purchase_hour
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
149671,285108,2015-07-15 04:36:00,2015-09-10 14:17:00,31,HZAKVUFTDOSFD,Direct,Chrome,M,49,2818400139,0,United States,Wednesday,Thursday,29,37,4,14
15611,131009,2015-01-24 12:29:00,2015-04-13 04:53:00,31,XGQAJSOUJIZCC,SEO,IE,F,21,3251268287,0,United Kingdom,Saturday,Monday,4,16,12,4
73178,328855,2015-03-11 00:54:00,2015-04-05 12:23:00,16,VCCTAYDCWKZIY,Direct,IE,M,26,2727760440,0,United States,Wednesday,Sunday,11,14,0,12
84546,229053,2015-01-07 13:19:00,2015-01-09 10:12:00,29,MFFIHYNXCJLEY,SEO,Chrome,M,34,2083419526,0,Korea Republic of,Wednesday,Friday,2,2,13,10
35978,108439,2015-02-08 21:11:00,2015-04-09 14:26:00,26,WMSXWGVPNIFBM,Ads,FireFox,M,33,3207912664,0,Brazil,Sunday,Thursday,6,15,21,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13862,116698,2015-02-26 11:51:00,2015-04-16 22:57:00,46,UJYRDGZXTFFJG,Ads,Chrome,M,18,2509395150,0,Netherlands,Thursday,Thursday,9,16,11,22
122655,122699,2015-08-01 18:40:00,2015-08-25 07:56:00,26,EMMTCPTUYQYPX,Ads,IE,F,36,2946611970,0,China,Saturday,Tuesday,31,35,18,7
125965,115120,2015-07-25 12:50:00,2015-09-03 04:10:00,41,YSZGGEARGETEU,SEO,Chrome,M,31,557062884,0,United States,Saturday,Thursday,30,36,12,4
31108,87098,2015-04-02 21:11:00,2015-06-22 16:51:00,50,BJDWRJULJZNOV,SEO,Chrome,F,43,2687886786,0,Switzerland,Thursday,Monday,14,26,21,16


In [11]:
## Last, let's drop columns we are not going to use
df = df.drop(['signup_time','purchase_time'], axis=1) 

In [12]:
df

Unnamed: 0_level_0,user_id,purchase_value,device_id,source,browser,sex,age,ip_address,class,country,signup_day,purchase_day,signup_week,purchase_week,signup_hour,purchase_hour
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
149671,285108,31,HZAKVUFTDOSFD,Direct,Chrome,M,49,2818400139,0,United States,Wednesday,Thursday,29,37,4,14
15611,131009,31,XGQAJSOUJIZCC,SEO,IE,F,21,3251268287,0,United Kingdom,Saturday,Monday,4,16,12,4
73178,328855,16,VCCTAYDCWKZIY,Direct,IE,M,26,2727760440,0,United States,Wednesday,Sunday,11,14,0,12
84546,229053,29,MFFIHYNXCJLEY,SEO,Chrome,M,34,2083419526,0,Korea Republic of,Wednesday,Friday,2,2,13,10
35978,108439,26,WMSXWGVPNIFBM,Ads,FireFox,M,33,3207912664,0,Brazil,Sunday,Thursday,6,15,21,14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13862,116698,46,UJYRDGZXTFFJG,Ads,Chrome,M,18,2509395150,0,Netherlands,Thursday,Thursday,9,16,11,22
122655,122699,26,EMMTCPTUYQYPX,Ads,IE,F,36,2946611970,0,China,Saturday,Tuesday,31,35,18,7
125965,115120,41,YSZGGEARGETEU,SEO,Chrome,M,31,557062884,0,United States,Saturday,Thursday,30,36,12,4
31108,87098,50,BJDWRJULJZNOV,SEO,Chrome,F,43,2687886786,0,Switzerland,Thursday,Monday,14,26,21,16


In [13]:
## Now need to run same process for no label (test dataset)

#1
no_label = no_label.set_index('Unnamed: 0')

#2
no_label['purchase_time'] = pd.to_datetime(no_label['purchase_time'])
no_label['signup_time'] = pd.to_datetime(no_label['signup_time'])

#3
no_label['ip_address'] = no_label['ip_address'].apply(int)

#country
no_label['country'] = no_label['ip_address'].apply(lambda ip: find_country(ip, ip_country))

# date
test_df = detail_date(no_label)
test_df = test_df.drop(['signup_time','purchase_time'], axis=1) 
test_df

Unnamed: 0_level_0,user_id,purchase_value,device_id,source,browser,sex,age,ip_address,country,signup_day,purchase_day,signup_week,purchase_week,signup_hour,purchase_hour
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
6,159135,42,ALEYXFXINSXLZ,Ads,Chrome,M,18,2809315200,Canada,Thursday,Thursday,21,28,6,8
7,50116,11,IWKVZHJOCLPUR,Ads,Chrome,F,19,3987484329,Others,Saturday,Thursday,31,35,22,3
10,182338,62,NRFFPPHZYFUVC,Ads,IE,M,31,341674739,United States,Sunday,Monday,4,13,17,23
11,199700,13,TEPSJVVXGNTYR,Ads,Safari,F,35,1819008578,United States,Saturday,Wednesday,28,44,18,21
12,73884,58,ZTZZJUCRDOCJZ,Direct,Chrome,M,32,4038284553,Others,Friday,Tuesday,22,25,16,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
151090,330979,43,WFYRTUQZQFSQJ,SEO,IE,M,34,1966801114,China,Wednesday,Thursday,33,49,17,0
151094,25306,38,GFIVGRRVFBFOF,SEO,Chrome,M,31,3097327010,United States,Friday,Wednesday,20,28,20,14
151097,27502,43,PYCNPZMYIETTA,Direct,Opera,F,30,4202835677,Others,Tuesday,Wednesday,16,26,23,10
151100,115473,61,ZRHCEVZHNIBJH,Direct,IE,M,24,3003295709,Chile,Thursday,Thursday,1,1,7,7


## Model

In [14]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer

import xgboost as xgb
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression

In [15]:
# Encoding - need to change categorical variables to numerical.
## device_id :  Categorical
## source :  Categorical
## browser :  Categorical
## sex :  Categorical
## country

### I used label encoding (LE), because when i use ohe, it created quite a lot of columns because of device_id.

categorical_columns = ['device_id', 'source', 'browser', 'sex', 'country','signup_day','purchase_day']


# function
def label_encode_columns(df,categorical_columns):
    df_encoded = df 
    for col in categorical_columns:
        le = LabelEncoder()
        df_encoded[col] = le.fit_transform(df_encoded[col])
    return df_encoded


# Display the resulting DataFrame
df_encoded = label_encode_columns(df,categorical_columns)

test_encoded = label_encode_columns(test_df,categorical_columns)

In [16]:
# For modeling, I did not split the data as I am using original dataset as a train dataset and no-label dataset as a test dataset
X_train = df_encoded.drop('class', axis=1)
y_train = df_encoded['class']

# Extract features from the testing dataset
X_test = test_encoded

### to choose what models i am going to use, i have created two functions
def train_and_evaluate_model(model, x_train, y_train):
    # Train the model
    model.fit(x_train, y_train)

    # Predict on training data
    y_pred_train = model.predict(x_train)

    # Evaluate and return metrics on training data
    accuracy_train = accuracy_score(y_train, y_pred_train)
    precision_train = precision_score(y_train, y_pred_train)
    recall_train = recall_score(y_train, y_pred_train)
    f1_train = f1_score(y_train, y_pred_train)
    auc_train = roc_auc_score(y_train, y_pred_train)
    cm_train = confusion_matrix(y_train, y_pred_train)

    metrics_train = {
        'accuracy': accuracy_train,
        'precision': precision_train,
        'recall': recall_train,
        'f1': f1_train,
        'auc': auc_train,
        'confusion_matrix': cm_train
    }

    return metrics_train, model, y_pred_train

def compare_models(x_train, y_train):
    classifiers = {
        'MLP': MLPClassifier(solver='adam', alpha=1e-4, hidden_layer_sizes=(50, 50), random_state=1),
        'LinearDiscriminant': LinearDiscriminantAnalysis(),
        'RandomForest': RandomForestClassifier(n_estimators=200, max_depth=10, criterion='entropy', max_features='sqrt',
                                               min_samples_split=5, random_state=42, n_jobs=-1),
        'GaussianNB': GaussianNB(),
        'LogisticRegression': LogisticRegression(C=1.0, penalty='l2'),
        'XGBoost': xgb.XGBClassifier(max_depth=5, learning_rate=0.01, n_estimators=200, objective='binary:logistic',
                                      booster='gbtree', reg_alpha=0.1, reg_lambda=1, random_state=42, n_jobs=4),
        'LightGBM': LGBMClassifier(num_leaves=50, learning_rate=0.1, n_estimators=150)
    }

    metrics_results = {}
    models = {}

    for name, model in classifiers.items():
        metrics_train, trained_model, pred_train = train_and_evaluate_model(model, x_train, y_train)
        metrics_results[name] = metrics_train
        models[name] = trained_model

    metrics_df = pd.DataFrame(metrics_results).transpose()
    return metrics_df, models


In [17]:
metrics_df, trained_models = compare_models(X_train, y_train)
print(metrics_df)


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[LightGBM] [Info] Number of positive: 11265, number of negative: 108735
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002137 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1244
[LightGBM] [Info] Number of data points in the train set: 120000, number of used features: 15
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.093875 -> initscore=-2.267213
[LightGBM] [Info] Start training from score -2.267213
                    accuracy precision    recall        f1       auc  \
MLP                 0.905492  0.047619  0.000355  0.000705   0.49981   
LinearDiscriminant  0.910208  0.995951  0.043675  0.083681  0.521828   
RandomForest        0.955183  0.976217  0.535641  0.691734  0.767145   
GaussianNB          0.906125       0.0       0.0       0.0       0.5   
LogisticRegression  0.906125       0.0       0.0       0.0       0.5   
XGBoost 

### Based on this metrix, I am going to use Light GBM as a model.

In [18]:
# Initialize the XGBoost model
model = LGBMClassifier(num_leaves=50, learning_rate=0.1, n_estimators=150)

# Train the model
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

[LightGBM] [Info] Number of positive: 11265, number of negative: 108735
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001185 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1244
[LightGBM] [Info] Number of data points in the train set: 120000, number of used features: 15
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.093875 -> initscore=-2.267213
[LightGBM] [Info] Start training from score -2.267213


In [44]:
np.unique(y_pred, return_counts=True)


(array([0, 1]), array([29508,  1604]))

In [19]:
test_encoded['class'] = y_pred

In [39]:
test_encoded

Unnamed: 0_level_0,user_id,purchase_value,device_id,source,browser,sex,age,ip_address,country,signup_day,purchase_day,signup_week,purchase_week,signup_hour,purchase_hour,class
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
6,159135,42,469,0,0,1,18,2809315200,26,4,4,21,28,6,8,0
7,50116,11,10111,0,0,0,19,3987484329,100,2,4,31,35,22,3,0
10,182338,62,15515,0,2,1,31,341674739,141,3,1,4,13,17,23,0
11,199700,13,21976,0,4,0,35,1819008578,141,2,6,28,44,18,21,0
12,73884,58,29560,1,0,1,32,4038284553,100,0,5,22,25,16,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
151090,330979,43,25427,2,2,1,34,1966801114,29,6,4,33,49,17,0,0
151094,25306,38,7094,2,0,1,31,3097327010,141,0,6,20,28,20,14,0
151097,27502,43,18152,1,3,0,30,4202835677,100,5,6,16,26,23,10,0
151100,115473,61,29417,1,2,1,24,3003295709,28,4,4,1,1,7,7,1


### What next?

1) There were some limitations to this excercise. For sake of the model, I have included all features. But for the more in-depth analysis we may have scrutinize each features to see what would be the best features for the model.
2) There were a lot of missing ip addresses. for the next step, we can look into missing ip address and try to figure out why there are quite amount of missing ip addresses.
