## Exploration

The data is given without any explanation, so we don't have any intuition how it is collected and which heuristic can help us to come up with an accurate method. Thus, the variables needs to be investigated.

CSV is formated with semicolon separator and European style decimal numbers with comma. We will create a DataFrame and do some exploration.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('ggplot')

In [2]:
df = pd.read_csv('../data/training.csv', sep=';', decimal=",")
df_valid = pd.read_csv('../data/validation.csv', sep=';', decimal=",")

In [3]:
df.describe()

Unnamed: 0,v2,v3,v8,v11,v14,v15,v17,v19
count,3661.0,3700.0,3700.0,3700.0,3600.0,3700.0,3600.0,3700.0
mean,32.820713,0.000585,3.439496,4.16,162.695,2246.705946,1626950.0,0.925405
std,12.666181,0.00054,4.335229,6.750553,156.045682,8708.571126,1560457.0,0.262772
min,13.75,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,23.0,0.00015,0.5,0.0,0.0,0.0,0.0,1.0
50%,28.67,0.000425,1.75,2.0,120.0,113.0,1200000.0,1.0
75%,40.83,0.000963,5.0,6.0,280.0,1059.75,2800000.0,1.0
max,80.25,0.0028,28.5,67.0,1160.0,100000.0,11600000.0,1.0


v19 is also a boolean variable and the rest are discreete.

* **Discrete:** v1 v4 v5 v6 v7 v9(bool) v10(bool) v12(bool) v13 v18(NaN) v19(bool)
* **Continues:** v2 v3 v8 v11 v14 v15 v17

There are also some missing values to deal with. There are some approaches to do so:

* discarding  (drop the column or row with NaN)
* imputation (fill with a constant like )
* using methods that can deal with these as an input (ANN,  the Gradient Boosting framework, ...)
* ...

Here are the number of `NaN` values per field:

In [4]:
not_null = []
for c in df.columns:
    nulls = df[c].isna().sum()
    if nulls==0:
        not_null.append(c)
    print(f'{c}: {nulls}')

not_null

v1: 39
v2: 39
v3: 0
v4: 64
v5: 64
v6: 66
v7: 66
v8: 0
v9: 0
v10: 0
v11: 0
v12: 0
v13: 0
v14: 100
v15: 0
v17: 100
v18: 2145
v19: 0
classLabel: 0


['v3', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v15', 'v19', 'classLabel']

Another thing to take into account about this dataset is that the target positive values are imbalanced. So _accuracy_ won't be a good performance metric for this problem. Since the ultimate objective is not clarrified, we will be using F1 metric.

In [5]:
df['classLabel'].value_counts()

yes.    3424
no.      276
Name: classLabel, dtype: int64

## Binary Classification

Before doing anything fancy, it is better to create a simple baseline for comparison. For this purpose, we will create a simple logistic regression classifier.

Continues and not null fields do not require any preprocessing, let's check how it performs with a simple method in the training dataset with 10 fold cross validation.

In [6]:
seed = 42
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

In [7]:
fields = ['v3', 'v8', 'v11', 'v15', 'v19']

X = df[fields].values
y = (df['classLabel'] == 'yes.').astype(int).values

X_valid = df_valid[fields].values
y_valid = (df_valid['classLabel'] == 'yes.').astype(int).values

In [8]:
estimator = LogisticRegression(solver='lbfgs', max_iter=500, random_state=seed)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)

results = cross_val_score(estimator, X, y, cv=kfold, scoring='f1')
print('Results: %.2f%% (%.2f%%)' % (results.mean()*100, results.std()*100))

Results: 100.00% (0.00%)


In [9]:
sum(df['v19'] != (df['classLabel'] == 'yes.').astype(int))

0

apparently v19 field is equal to label in the training set, how about the accuracy for the validation:

In [10]:
model = LogisticRegression(solver='lbfgs')
model.fit(X, y)
y_valid_hat = model.predict(X_valid)
print(f1_score(y_valid, y_valid_hat))

0.48453608247422686


Despite very high performance in the training, it performs poorly in the validation. There is a high bias in the training set in comparison to the validation. 

## Balancing Train and Test

It could be the reason that the second CSV file is collected in a different settings. It is something similar to **MNIST** vs **NIST** (the original dataset) case where samples are collected from different places. Thus, we need to merge two sets and normalize for training our model.

We will conduct another experiment with the simple classifier. First, the discrete variables need to be defined as dummy or a single 1/0 variable. Since logistic regression estimator can not work with `NaN` values, they will be filled as zero.

In [11]:
def prepare_discrete_vars(df):
    
    # not null and binary strings can be a single 1/0 value
    bin_strs = ['v9', 'v10', 'v12']
    for col in bin_strs:
        df[col] = (df[col]=='t').astype(int)
    
    other_discrete_vars = ['v1', 'v4', 'v5', 'v6', 'v7', 'v13', 'v18']

    for col in other_discrete_vars:
        dummies = pd.get_dummies(df[col], prefix=col)
        df = pd.concat([df.drop([col], axis=1), dummies], axis=1)
        
    
    return df

df = pd.concat([df, df_valid])
df = prepare_discrete_vars(df)

In [12]:
X = df.drop(['classLabel'], axis=1).fillna(0).values
y = (df['classLabel'] == 'yes.').astype(int).values

estimator = LogisticRegression(solver='lbfgs', max_iter=500, random_state=seed)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, y, cv=kfold, scoring='f1')
print('Results: %.2f%% (%.2f%%)' % (results.mean()*100, results.std()*100))

Results: 94.89% (0.11%)


Since the biased validation set is only 5% of the dataset, this performance is not quite satisfactory. A better performance can be achieved by using a more robust method which can take NaN values as an input: the Gradient Boosting.

In [13]:
len(df_valid)/len(df)

0.05128205128205128

In [14]:
from xgboost import XGBClassifier

fields = ['v2', 'v3', 'v8', 'v9', 'v10', 'v11', 'v12', 'v14', 'v15', 'v17', 'v19',
       'v1_a', 'v1_b', 'v4_l', 'v4_u', 'v4_y', 'v5_g', 'v5_gg',
       'v5_p', 'v6_W', 'v6_aa', 'v6_c', 'v6_cc', 'v6_d', 'v6_e', 'v6_ff',
       'v6_i', 'v6_j', 'v6_k', 'v6_m', 'v6_q', 'v6_r', 'v6_x', 'v7_bb',
       'v7_dd', 'v7_ff', 'v7_h', 'v7_j', 'v7_n', 'v7_o', 'v7_v', 'v7_z',
       'v13_g', 'v13_p', 'v13_s', 'v18_f', 'v18_t']

X = df[fields].fillna(0).values
y = (df['classLabel'] == 'yes.').astype(int).values

estimator = XGBClassifier()
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(estimator, X, y, cv=kfold, scoring='f1')
print('Results: %.2f%% (%.2f%%)' % (results.mean()*100, results.std()*100))

Results: 99.15% (0.27%)


Quite satisfactory! 