## Features:

### Client:

- Client_id: Unique id for client
- District: District where the client is
- Client_catg: Category client belongs to
- Region: Area where the client is
- Creation_date: Date client joined
- Target: fraud:1 , not fraud: 0


### Invoice data

- Client_id: Unique id for the client
- Invoice_date: Date of the invoice
- Tarif_type: Type of tax
- Counter_number:
- Counter_statue: takes up to 5 values such as working fine, not working, on hold statue, ect
- Counter_code:
- Reading_remarque: notes that the STEG agent takes during his visit to the client (e.g: If the counter shows something wrong, the agent gives a bad score)
- Counter_coefficient: An additional coefficient to be added when standard consumption is exceeded
- Consommation_level_1: Consumption_level_1
- Consommation_level_2: Consumption_level_2
- Consommation_level_3: Consumption_level_3
- Consommation_level_4: Consumption_level_4
- Old_index: Old index
- New_index: New index
- Months_number: Month number
- Counter_type: Type of counter


## Some findings

- the "test" = "competition" set (no targets there). Therefore must split the "train" set into train/test sets

- must not aggregate to make a shorter table with customers. Instead predict on TRANSACTIONS. (df.groupby('client_id').nunique())

- the proportion of positives is higher in the merged "transactions" table than the proportion of positives in the "clients" table (0.06 / 0.08).  So you can treat that as "perturbation" of positives in order to increase the number of positives (where they are scarse). I.e. this is one more argument in the favour of predicting on transactions  and then aggregating them to get a prediction for a particular customer.

- the 'months_number' column does not contain actual months. These values do not correspond to the 'creation_date' or 'invoice_date' columns.  Either keep this columns without any transformation or scaling or delete it completeley. Because the test set contains this kidn of wierd values too.

- the features  ['consommation_level_1', 'consommation_level_2', 'consommation_level_3', 'consommation_level_4']   are not very promising (in tearms of building univariate logistic regression on them)

- columns 'counter_statue'  is supposed to be integers [0-5] but is of mixed type (object) with some bogus values.  Convert to int, drop the rows with values > 5, because the test set doesnt have any bad values in this column - only the valid integers from 0 to 5

- search for a decent baseline model didn't give decent results. Try non-deterministic baseline model based on the prior (i.e. the proportion of positives in the population)

- eventually agreed to predict transactions (and not fraudulent clients)

- rule-based baseline model on two rules (2005, higher consumption)


In [31]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, classification_report
from sklearn.metrics import fbeta_score, confusion_matrix, recall_score, precision_score
from sklearn.metrics import make_scorer

from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score

from sklearn.preprocessing import OrdinalEncoder
from imblearn.over_sampling import SMOTENC

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

In [32]:
from warnings import filterwarnings
filterwarnings('ignore')

In [33]:
# TODO: random seed
...

In [34]:
# feature preprocessing functions

def preprocess(feature, data):
    functions = {'counter_statue': preprocess_counter_statue}
    feature = feature if type(feature) is str else data.name if type(data) is pd.Series else data.columns[feature]
    function = functions[feature]
    return function(data)

    
# preprocess 'counter_statue'
def preprocess_counter_statue(data):
    col = 'counter_statue'
    sr = data[col].astype(str)
    mask = sr.isin(list("012345"))
    sr[~mask] = sr[mask].mode().values[0]
    data[col] = sr.astype(int)


In [35]:
path1 = "data/train/client_train.csv"
path2 = "data/train/invoice_train.csv"

path3 = "data/test/client_test.csv"
path4 = "data/test/invoice_test.csv"

In [36]:
# load the data

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2, low_memory=False)     # low_memory=False

df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)

In [37]:
# join tables

# data from the "train" folder (will have to be split into train/test)
df_entire = df1.merge(df2, left_on='client_id', right_on='client_id', how='outer')

# data from the "test" folder (doesn't contain targets)
df_test_zindi = df3.merge(df4, left_on='client_id', right_on='client_id', how='outer')


In [38]:
# converts all values to int, fills bad values with the mode
preprocess('counter_statue', df_entire)

In [39]:
# feature engineering 

df_entire['year_created'] = pd.to_datetime(df_entire['creation_date'],
                                           format="%d/%m/%Y").dt.year
dates = pd.to_datetime(df_entire['invoice_date'])
df_entire['invoice_year'] = dates.dt.year
df_entire['invoice_month'] = dates.dt.month
df_entire['invoice_weekday'] = dates.dt.weekday

In [40]:
# drop the observations before 2005

YEAR = 2005
df_entire.drop(df_entire.index[df_entire['invoice_year'] < YEAR], axis=0, inplace=True)

# reset index
df_entire.reset_index(drop=True, inplace=True)

In [41]:
# enlabel 'counter_type'

df_entire['counter_type'], _ = pd.factorize(df_entire['counter_type'])

In [42]:
# drop these columns

cols = [
    'disrict', 'client_id', 'creation_date', 'invoice_date', 'old_index', 'months_number',
    'client_catg', 'counter_statue', 'reading_remarque', 'counter_coefficient','counter_type'
        ]

df_entire.drop(cols, axis=1, inplace=True)

## Separate X, y

In [43]:
y_entire = df_entire.pop('target')

## Encode

In [44]:
categoricals = [col for col in df_entire.columns if df_entire[col].nunique() < 20]
non_cats = [col for col in df_entire.columns if df_entire[col].nunique() >= 20]

nd_cats_encoded = OrdinalEncoder().fit_transform(df_entire[categoricals]).astype(int)

In [45]:
df_entire = pd.concat([pd.DataFrame(nd_cats_encoded, 
                                           columns=categoricals,
                                          index=df_entire.index), 
                              df_entire[non_cats]], axis=1)

df_entire.head()

Unnamed: 0,client_catg,tarif_type,counter_statue,reading_remarque,counter_coefficient,counter_type,invoice_year,invoice_month,invoice_weekday,region,counter_number,counter_code,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,new_index,year_created
0,0,3,0,3,1,0,9,2,0,101,1335667,203,82,0,0,0,14384,1994
1,0,3,0,1,1,0,8,2,4,101,1335667,203,1200,184,0,0,13678,1994
2,0,3,0,3,1,0,10,2,0,101,1335667,203,123,0,0,0,14747,1994
3,0,3,0,3,1,0,10,6,0,101,1335667,207,102,0,0,0,14849,1994
4,0,3,0,4,1,0,11,10,3,101,1335667,207,572,0,0,0,15638,1994


## Split the data

In [46]:
df_train, df_test, y_train, y_test = train_test_split(df_entire, y_entire, 
                                                      test_size=0.2, stratify=y_entire)
Xtrain = df_train.values
Xtest = df_test.values
ytrain = y_train.values
ytest = y_test.values

## Upsample data with SMOTE

### sample before SMOTE

In [47]:
train_size = 0.2    # 0.99    for final training

df_sample, df_val, y_sample, y_val = train_test_split(df_train, y_train, 
                                                      train_size=train_size, stratify=y_train)

In [48]:
df_sample.head()

Unnamed: 0,client_catg,tarif_type,counter_statue,reading_remarque,counter_coefficient,counter_type,invoice_year,invoice_month,invoice_weekday,region,counter_number,counter_code,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,new_index,year_created
3760902,0,3,0,1,1,0,6,10,0,302,252704,203,325,0,0,0,12606,1998
3507255,0,13,0,4,1,1,4,11,3,101,169636,5,184,0,0,0,402,2008
3073669,0,13,0,4,1,1,11,0,1,101,4868738,5,6,0,0,0,33,1985
3689928,0,13,0,3,1,1,9,4,3,106,20850259,10,3297,0,0,0,77077,2009
2113945,0,6,0,4,1,0,14,3,1,307,890484,413,746,0,0,0,9397,1986


In [49]:
sm = SMOTENC(categorical_features=list(range(len(categoricals))),
             k_neighbors=3,
            sampling_strategy=0.9)
df_smote, y_smote = sm.fit_resample(df_sample, y_sample)

df_smote.tail(5)

Unnamed: 0,client_catg,tarif_type,counter_statue,reading_remarque,counter_coefficient,counter_type,invoice_year,invoice_month,invoice_weekday,region,counter_number,counter_code,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,new_index,year_created
1246856,0,12,0,1,1,0,9,4,6,105,5641,410,1,0,0,0,12353,2008
1246857,0,13,0,4,1,1,13,6,1,166,27624,65,503,0,0,0,11046,1998
1246858,0,13,0,3,1,1,11,7,0,311,105514,5,150,0,0,0,5006,1988
1246859,0,3,0,4,1,0,11,10,1,231,56233,338,470,0,0,0,13822,2008
1246860,0,13,0,3,1,1,12,4,4,106,132162,5,160,0,0,0,4997,1998


In [50]:
# proportion of positives after upsampling
y_smote.sum() / len(y_smote)

0.4736839150474672

In [51]:
# re-assign for convenience
df, y = df_smote, y_smote
X,y = df_smote.values, y_smote.values

## Feature Selection

### features to use:

In [52]:
"""
region                  0.230020
counter_number          0.184581
year_created            0.068360
interaction             0.061893
ratio                   0.057657
new_index               0.055945
counter_code            0.040542
consommation_level_1    0.038644
invoice_year            0.037021
invoice_month           0.029143
invoice_weekday         0.020443
tarif_type              0.019677
""";

## Feature engineering

In [57]:

def feature_engineer(df):
    df['ratio'] = (df['counter_number'] / df['new_index']).fillna(0).replace((-np.inf, np.inf), 0)
    df['interaction'] = np.log( (df['counter_number'] * df['new_index']).fillna(0).replace((-np.inf, np.inf), 0)).fillna(0).replace((-np.inf, np.inf), 0)
    return df

In [58]:
df = feature_engineer(df)
df_val = feature_engineer(df_val)
df_test = feature_engineer(df_test)

X = df.values

df_test.head()

Unnamed: 0,client_catg,tarif_type,counter_statue,reading_remarque,counter_coefficient,counter_type,invoice_year,invoice_month,invoice_weekday,region,...,new_index,year_created,ratio,interaction,ratio14,ratio41,ratio23,ratio32,interaction14,interaction23
2904543,0,3,0,4,1,0,10,3,1,103,...,641,2014,280.734789,18.563469,0.0,0.0,0.0,0.0,0,0
4151256,0,3,0,4,1,0,13,1,6,301,...,23823,2014,47.766822,24.023145,0.0,0.0,0.0,0.0,0,0
1378330,0,3,0,1,1,0,7,2,2,101,...,99251,2001,1.042408,23.052348,0.0,0.0,0.0,0.0,0,0
435619,0,3,0,1,1,0,3,6,5,104,...,9756,2006,15.444034,21.108498,0.0,0.0,0.0,0.0,0,0
3776864,0,13,0,1,1,1,6,4,3,101,...,1866,2008,1609.677385,22.446894,0.0,0.0,0.0,0.0,0,0


In [60]:
# function printing model report
def report(model, Xtest, ytest):
    print("Test set:")
    ypred = model.predict(Xtest)
    print(classification_report(ytest, ypred))

    probs = model.predict_proba(Xtest)[:,-1]
    auc = roc_auc_score(ytest, probs)
    print("AUC =", auc.round(2))
 

## Decision Tree

In [61]:
from sklearn.tree import DecisionTreeClassifier

md = DecisionTreeClassifier(max_depth=40)
md.fit(X,y)

pd.Series(md.feature_importances_, index=df.columns).sort_values(ascending=False)

counter_number          0.218179
region                  0.207302
consommation_level_1    0.095769
year_created            0.088821
interaction             0.068108
ratio                   0.058956
new_index               0.058266
counter_code            0.049340
invoice_year            0.039486
invoice_month           0.022808
consommation_level_2    0.021376
tarif_type              0.019255
invoice_weekday         0.015927
reading_remarque        0.013093
client_catg             0.009755
ratio32                 0.002250
counter_statue          0.002070
ratio23                 0.001857
counter_type            0.001688
consommation_level_4    0.001297
interaction23           0.001093
ratio14                 0.001090
ratio41                 0.000987
consommation_level_3    0.000853
interaction14           0.000377
counter_coefficient     0.000000
dtype: float64

In [62]:
report(md, df_test, y_test)

Test set:
              precision    recall  f1-score   support

         0.0       0.96      0.90      0.93    820306
         1.0       0.31      0.51      0.39     70622

    accuracy                           0.87    890928
   macro avg       0.63      0.70      0.66    890928
weighted avg       0.90      0.87      0.89    890928

AUC = 0.71


## Random Forest

In [63]:
md = RandomForestClassifier(n_estimators=100, max_depth=50)
md.fit(X, y)

pd.Series(md.feature_importances_, index=df.columns).sort_values(ascending=False)

region                  0.165482
counter_number          0.124430
consommation_level_1    0.099388
year_created            0.085877
interaction             0.085241
ratio                   0.083255
new_index               0.081270
counter_code            0.054922
invoice_year            0.053862
invoice_month           0.039174
invoice_weekday         0.029999
consommation_level_2    0.028103
reading_remarque        0.015996
tarif_type              0.015725
client_catg             0.007210
consommation_level_3    0.004971
interaction23           0.003683
interaction14           0.003572
ratio23                 0.003511
ratio32                 0.003121
counter_type            0.002963
counter_statue          0.002588
ratio14                 0.002136
consommation_level_4    0.001757
ratio41                 0.001727
counter_coefficient     0.000038
dtype: float64

In [64]:
report(md, df_test, y_test)

Test set:
              precision    recall  f1-score   support

         0.0       0.94      0.97      0.96    820306
         1.0       0.47      0.31      0.37     70622

    accuracy                           0.92    890928
   macro avg       0.70      0.64      0.66    890928
weighted avg       0.90      0.92      0.91    890928

AUC = 0.81



before new engineered features

0.31      0.51      0.39

0.47      0.31      0.38


after new engineered features + old deleted
