In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
%matplotlib inline

In [2]:
df = pd.read_csv('train_sample.csv')

In [6]:
df.shape

(100000, 8)

### Introduction
The problem is to classify fraudlent ads. This fit in an almost standard classification task in ML where the data are sampled iid according to some distribution, features are real values and the labels are in {0,1}. 
Our goal is to find a predictor that minimize a measure of the miss-classification error. Due to "nice" properties (convexity and smoothness), a common choice is the logistic loss.
Now an important aspect of any classification problem is that there are enough samples of both classes. It's known that in problems such as fraud detection, this may not be the case. Ignoring this will most likely result in a classifier that overfits, at one of the two classes, providing maximum accuracy. One metric to spot such issue is to look at the false negative, namely when a statistical test fails under the alternative hypothesis.

I treated problem like this in the past, and I'm confident with a library called imblearn, which provides effective method to deal with the undersampling of the class, and convient reporting metrics.
I'll get back to this later. Now let's look at the data.

This is a sample dataset, made of 8 columns of which 6 of them are features, 'is_attributed' is the target label. 'attributed_time' is a column that is not nan only if is_attributed is true. I will remove it to avoid data leakage.

In [12]:
df.isna().sum(), df.loc[df.is_attributed ==1, 'is_attributed'].count(), (df.is_attributed == 1).sum()

(ip                     0
 app                    0
 device                 0
 os                     0
 channel                0
 click_time             0
 attributed_time    99773
 is_attributed          0
 dtype: int64, 227, 227)

In [13]:
df.drop(columns='attributed_time', inplace=True, index=1)

Now, we have click_time that is a data object. The others are categorical variable, which takes non ordered values. Those values can be tricky to digest for any kind of classifier, for this reason i transform the click_time in unix time and rescale everything using a minimaxscalar to maintain the integer property.

In [14]:
df.head()

Unnamed: 0,ip,app,device,os,channel,click_time,is_attributed
0,87540,12,1,13,497,2017-11-07 09:30:38,0
2,101424,12,1,19,212,2017-11-07 18:05:24,0
3,94584,13,1,13,477,2017-11-07 04:58:08,0
4,68413,12,1,1,178,2017-11-09 09:00:09,0
5,93663,3,1,17,115,2017-11-09 01:22:13,0


In [17]:
df.click_time = pd.to_datetime(df.click_time).astype(np.int64)// 10**9

In [18]:
from sklearn.preprocessing import minmax_scale

In [26]:
x,y = df.iloc[:, :-1], df.iloc[:, -1]

In [27]:
x = x.apply(minmax_scale)

Before diving into the model construction, let's check two things:
1. are the classes balanced?
2. is there any correlation among features?
The second question is important because if features are highly correlated then any linear classifier may fail to find the separation boundary. A common solution to this problem is regularization.

In [32]:
x[y==1].corr()

Unnamed: 0,ip,app,device,os,channel,click_time
ip,1.0,0.023187,0.047121,0.175618,-0.066273,0.393515
app,0.023187,1.0,-0.067034,0.118395,-0.061293,-0.050037
device,0.047121,-0.067034,1.0,0.003587,0.034337,0.024414
os,0.175618,0.118395,0.003587,1.0,0.01317,0.073776
channel,-0.066273,-0.061293,0.034337,0.01317,1.0,0.080859
click_time,0.393515,-0.050037,0.024414,0.073776,0.080859,1.0


As I can see the class are heavily unbalance. To treat this I will resort in imblearn combine methods. In particular i will use SMOTEENN, which performs oversampling of the true class and remove the noise using a form of a Neirest Neighbour classifier.

Let me also prepare for the classification task, and use a train test,split.

In [42]:
(y==1).sum()

227

In [35]:
from imblearn.combine import SMOTEENN
from imblearn.metrics import classification_report_imbalanced
from sklearn.model_selection import train_test_split, KFold

In [36]:
x_tr, x_ts, y_tr, y_ts = train_test_split(x,y, test_size=0.3)

In [46]:
y_tr.sum(),y_ts.sum()

(147, 80)

In [47]:
sampler = SMOTEENN()

In [49]:
x_tr, y_tr = sampler.fit_resample(x_tr, y_tr)

The resample method added synthetic data to the previous training set, until the ratio between the two classes is the desired one.
We can now use this for training our classifier. I will use three different, logistic regression trained via a quasi-newton method, a logistic regression trained via SGD, and a tree based method.

In [51]:
y_tr.mean()

0.5067515572140026

In [52]:
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

In [63]:
clf = LogisticRegression(solver='lbfgs').fit(x_tr, y_tr)
y_hat = clf.predict(x_ts)
lg = clf.coef_
print(classification_report_imbalanced(y_ts, y_hat))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.81      0.80      0.90      0.81      0.65     29920
          1       0.01      0.80      0.81      0.02      0.81      0.65        80

avg / total       1.00      0.81      0.80      0.89      0.81      0.65     30000



In [64]:
clf = SGDClassifier().fit(x_tr, y_tr)
y_hat = clf.predict(x_ts)
sgd = clf.coef_
print(classification_report_imbalanced(y_ts, y_hat))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.82      0.76      0.90      0.79      0.63     29920
          1       0.01      0.76      0.82      0.02      0.79      0.62        80

avg / total       1.00      0.82      0.76      0.90      0.79      0.63     30000



In [65]:
clf = GradientBoostingClassifier().fit(x_tr, y_tr)
y_hat = clf.predict(x_ts)
gb = clf.feature_importances_
print(classification_report_imbalanced(y_ts, y_hat))

                   pre       rec       spe        f1       geo       iba       sup

          0       1.00      0.97      0.89      0.99      0.93      0.87     29920
          1       0.08      0.89      0.97      0.14      0.93      0.85        80

avg / total       1.00      0.97      0.89      0.98      0.93      0.87     30000



As I said in the introduction, I'm interested in a classifier that has high accuracy and high recall for the under sample class. Both linear models performs well, despite the correlation between features that we observed before (both models use 12 regularization). The best performing model is the Gradient boosting. \
Nevertheless the fact that i can train a logistic classifier via SGD is a big plus given the huge amount of data that I have to set this algorithm for.
Before moving into the construction of the pipeline let's check what are the features that have been used by the classifiers

In [104]:
feature_table = pd.DataFrame(np.vstack([lg, sgd, gb[None,:]]).T, index=df.columns[:-1], columns=['lg','sgd','gb'])

In [113]:
feature_table.T

Unnamed: 0,ip,app,device,os,channel,click_time
lg,3.511568,42.453719,-1.557906,-2.775102,-1.48351,-0.461967
sgd,3.196274,23.950105,-0.311267,-1.600711,-1.173156,-0.716618
gb,0.14324,0.519748,0.13728,0.049512,0.141638,0.008583


Despite the difference in importance, we get the same feature ordering for each of the classifiers.
This part is done, I'll move into implementing the pipeline for training over the entire dataset.