## Features:

### Client:

- Client_id: Unique id for client
- District: District where the client is
- Client_catg: Category client belongs to
- Region: Area where the client is
- Creation_date: Date client joined
- Target: fraud:1 , not fraud: 0


### Invoice data

- Client_id: Unique id for the client
- Invoice_date: Date of the invoice
- Tarif_type: Type of tax
- Counter_number:
- Counter_statue: takes up to 5 values such as working fine, not working, on hold statue, ect
- Counter_code:
- Reading_remarque: notes that the STEG agent takes during his visit to the client (e.g: If the counter shows something wrong, the agent gives a bad score)
- Counter_coefficient: An additional coefficient to be added when standard consumption is exceeded
- Consommation_level_1: Consumption_level_1
- Consommation_level_2: Consumption_level_2
- Consommation_level_3: Consumption_level_3
- Consommation_level_4: Consumption_level_4
- Old_index: Old index
- New_index: New index
- Months_number: Month number
- Counter_type: Type of counter


## Some findings

- the "test" = "competition" set (no targets there). Therefore must split the "train" set into train/test sets

- must not aggregate to make a shorter table with customers. Instead predict on TRANSACTIONS. (df.groupby('client_id').nunique())

- the proportion of positives is higher in the merged "transactions" table than the proportion of positives in the "clients" table (0.06 / 0.08).  So you can treat that as "perturbation" of positives in order to increase the number of positives (where they are scarse). I.e. this is one more argument in the favour of predicting on transactions  and then aggregating them to get a prediction for a particular customer.

- the 'months_number' column does not contain actual months. These values do not correspond to the 'creation_date' or 'invoice_date' columns.  Either keep this columns without any transformation or scaling or delete it completeley. Because the test set contains this kidn of wierd values too.

- the features  ['consommation_level_1', 'consommation_level_2', 'consommation_level_3', 'consommation_level_4']   are not very promising (in tearms of building univariate logistic regression on them)

- columns 'counter_statue'  is supposed to be integers [0-5] but is of mixed type (object) with some bogus values.  Convert to int, drop the rows with values > 5, because the test set doesnt have any bad values in this column - only the valid integers from 0 to 5

- search for a decent baseline model didn't give decent results. Try non-deterministic baseline model based on the prior (i.e. the proportion of positives in the population)

- eventually agreed to predict transactions (and not fraudulent clients)

- rule-based baseline model on two rules (2005, higher consumption)


In [131]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, classification_report
from sklearn.metrics import fbeta_score, confusion_matrix, recall_score, precision_score
from sklearn.metrics import make_scorer

from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score

from sklearn.preprocessing import OrdinalEncoder
from imblearn.over_sampling import SMOTENC

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier

In [135]:
from warnings import filterwarnings
filterwarnings('ignore')

In [98]:
# TODO: random seed
...

In [99]:
# feature preprocessing functions

def preprocess(feature, data):
    functions = {'counter_statue': preprocess_counter_statue}
    feature = feature if type(feature) is str else data.name if type(data) is pd.Series else data.columns[feature]
    function = functions[feature]
    return function(data)

    
# preprocess 'counter_statue'
def preprocess_counter_statue(data):
    col = 'counter_statue'
    sr = data[col].astype(str)
    mask = sr.isin(list("012345"))
    sr[~mask] = sr[mask].mode().values[0]
    data[col] = sr.astype(int)


In [100]:
path1 = "data/train/client_train.csv"
path2 = "data/train/invoice_train.csv"

path3 = "data/test/client_test.csv"
path4 = "data/test/invoice_test.csv"

In [101]:
# load the data

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2, low_memory=False)     # low_memory=False

df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)

In [102]:
# join tables

# data from the "train" folder (will have to be split into train/test)
df_entire = df1.merge(df2, left_on='client_id', right_on='client_id', how='outer')

# data from the "test" folder (doesn't contain targets)
df_test_zindi = df3.merge(df4, left_on='client_id', right_on='client_id', how='outer')


In [103]:
# converts all values to int, fills bad values with the mode
preprocess('counter_statue', df_entire)

In [104]:
# feature engineering 

df_entire['year_created'] = pd.to_datetime(df_entire['creation_date'],
                                           format="%d/%m/%Y").dt.year
dates = pd.to_datetime(df_entire['invoice_date'])
df_entire['invoice_year'] = dates.dt.year
df_entire['invoice_month'] = dates.dt.month
df_entire['invoice_weekday'] = dates.dt.weekday

In [105]:
# drop the observations before 2005

YEAR = 2005
df_entire.drop(df_entire.index[df_entire['invoice_year'] < YEAR], axis=0, inplace=True)

# reset index
df_entire.reset_index(drop=True, inplace=True)

In [106]:
# enlabel 'counter_type'

df_entire['counter_type'], _ = pd.factorize(df_entire['counter_type'])

In [107]:
# drop these columns

cols = [
    'disrict', 'client_id', 'creation_date', 'invoice_date', 'old_index', 'months_number',
    # drop more
    'client_catg', 'counter_statue', 'reading_remarque', 'counter_coefficient','counter_type'
        ]

df_entire.drop(cols, axis=1, inplace=True)

## Separate X, y

In [108]:
y_entire = df_entire.pop('target')

## Encode

In [109]:
categoricals = [col for col in df_entire.columns if df_entire[col].nunique() < 20]
non_cats = [col for col in df_entire.columns if df_entire[col].nunique() >= 20]

In [110]:
#enc = OrdinalEncoder().fit(df_entire[categoricals])
nd_cats_encoded = OrdinalEncoder().fit_transform(df_entire[categoricals]).astype(int)

In [111]:
df_entire = pd.concat([pd.DataFrame(nd_cats_encoded, 
                                           columns=categoricals,
                                          index=df_entire.index), 
                              df_entire[non_cats]], axis=1)

df_entire.head()

Unnamed: 0,tarif_type,invoice_year,invoice_month,invoice_weekday,region,counter_number,counter_code,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,new_index,year_created
0,3,9,2,0,101,1335667,203,82,0,0,0,14384,1994
1,3,8,2,4,101,1335667,203,1200,184,0,0,13678,1994
2,3,10,2,0,101,1335667,203,123,0,0,0,14747,1994
3,3,10,6,0,101,1335667,207,102,0,0,0,14849,1994
4,3,11,10,3,101,1335667,207,572,0,0,0,15638,1994


## Split the data

In [112]:
df_train, df_test, y_train, y_test = train_test_split(df_entire, y_entire, 
                                                      test_size=0.2, stratify=y_entire)
Xtrain = df_train.values
Xtest = df_test.values
ytrain = y_train.values
ytest = y_test.values


## Upsample data with SMOTE

### sample before SMOTE

In [113]:
train_size = 0.2    # 0.99    for final training

df_sample, df_val, y_sample, y_val = train_test_split(df_train, y_train, 
                                                      train_size=train_size, stratify=y_train)

In [114]:
df_sample.head()

Unnamed: 0,tarif_type,invoice_year,invoice_month,invoice_weekday,region,counter_number,counter_code,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,new_index,year_created
3584144,13,14,1,4,306,339511,5,75,0,0,0,1419,2001
3931611,13,11,6,2,103,4883508,5,0,0,0,0,1,2015
2173263,3,7,1,4,303,79645,413,342,0,0,0,8992,2003
2860252,3,12,10,3,301,2168701822800,413,155,0,0,0,155,2017
3809100,13,13,5,2,313,0,5,0,0,0,0,0,2012


In [115]:
sm = SMOTENC(categorical_features=list(range(len(categoricals))),
             k_neighbors=3,
            sampling_strategy=0.9)
df_smote, y_smote = sm.fit_resample(df_sample, y_sample)

df_smote.tail(5)

Unnamed: 0,tarif_type,invoice_year,invoice_month,invoice_weekday,region,counter_number,counter_code,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,new_index,year_created
738269,2,2,11,4,101,9522292,202,200,341,0,0,51419,1978
738270,3,13,11,3,225,19701,203,800,338,208,0,11667,1983
738271,3,9,5,4,301,553643,206,435,0,0,0,16555,2001
738272,3,11,11,4,313,306583,203,192,0,0,0,3412,2012
738273,3,13,4,2,324,2175702237817,205,499,0,0,0,3957,1993


In [116]:
y_smote.sum() / len(y_smote)

0.3333328818297814

In [117]:
# re-assign for convenience
df, y = df_smote, y_smote
X,y = df_smote.values, y_smote.values

## Feature Selection

### features to use:

In [118]:
"""
region                  0.230020
counter_number          0.184581
year_created            0.068360
interaction             0.061893
ratio                   0.057657
new_index               0.055945
counter_code            0.040542
consommation_level_1    0.038644
invoice_year            0.037021
invoice_month           0.029143
invoice_weekday         0.020443
tarif_type              0.019677
""";

## Feature engineering

In [119]:

def feature_engineer(df):
    df['ratio'] = (df['counter_number'] / df['new_index']).fillna(0).replace((-np.inf, np.inf), 0)
    df['interaction'] = np.log( (df['counter_number'] * df['new_index']).fillna(0).replace((-np.inf, np.inf), 0)).fillna(0).replace((-np.inf, np.inf), 0)
    return df

In [120]:
df = feature_engineer(df)
df_val = feature_engineer(df_val)
df_test = feature_engineer(df_test)

X = df.values

df_test.head()

  result = getattr(ufunc, method)(*inputs, **kwargs)
  result = getattr(ufunc, method)(*inputs, **kwargs)
  result = getattr(ufunc, method)(*inputs, **kwargs)


Unnamed: 0,tarif_type,invoice_year,invoice_month,invoice_weekday,region,counter_number,counter_code,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,new_index,year_created,ratio,interaction
3580164,13,4,5,0,101,102094,5,138,0,0,0,1771,1988,57.647657,19.012949
1238594,13,10,8,4,313,6983911,5,55,0,0,0,1187,2004,5883.665543,22.838304
4231640,2,10,10,3,101,474973,202,140,0,0,0,9110,2001,52.137541,22.188141
2520407,2,6,4,3,104,9532931,202,400,1367,0,0,57578,1978,165.565511,27.031159
128750,13,5,7,0,301,658598,5,27,0,0,0,3927,2009,167.710211,21.6735


In [132]:
# function printing model report
def report(model, X,y, Xtest=None, ytest=None):
    CV=3
    ytrue = y
    
    """
    ypred = cross_val_predict(model, X,y, cv=CV)
    print("Train set:")
    print(classification_report(ytrue, ypred))
    """
    
    # f1
    #f1 = cross_val_score(model, X,y, scoring='f1', cv=CV)
    #print("f1:", f1.round(3), f1.mean().round(3))

    # f2
    scorer = make_scorer(fbeta_score, beta=2)
    #f2 = cross_val_score(model, X,y, scoring=scorer, cv=CV)
    #print("f2:", f2.round(3), f2.mean().round(3))

    # AUC
    #ppred = cross_val_predict(model, X,y, cv=CV, method='predict_proba')[:,-1]
    #auc = roc_auc_score(ytrue, ppred)
    #print("AUC =", auc.round(2))
    #print("---"*20)
    
    if Xtest is not None and ytest is not None:
        print("\nTest set:")
        ypred = model.predict(Xtest)
        print(classification_report(ytest, ypred))
        
        probs = model.predict_proba(Xtest)[:,-1]
        auc = roc_auc_score(ytest, probs)
        print("AUC =", auc.round(2))
 

## AdaBoost

In [133]:
md = AdaBoostClassifier()
md.fit(X,y)

sr_imp = pd.Series(md.feature_importances_, index=df.columns).sort_values(ascending=False)
sr_imp

region                  0.46
counter_code            0.26
consommation_level_1    0.08
consommation_level_2    0.06
year_created            0.06
tarif_type              0.02
invoice_weekday         0.02
consommation_level_3    0.02
interaction             0.02
invoice_year            0.00
invoice_month           0.00
counter_number          0.00
consommation_level_4    0.00
new_index               0.00
ratio                   0.00
dtype: float64

In [134]:
report(md, X, y, df_test, y_test)


Test set:




              precision    recall  f1-score   support

         0.0       0.92      0.98      0.95    820306
         1.0       0.21      0.05      0.08     70622

    accuracy                           0.91    890928
   macro avg       0.57      0.52      0.52    890928
weighted avg       0.87      0.91      0.88    890928





AUC = 0.61


## Gradient Boosting

In [122]:
md = GradientBoostingClassifier()
md.fit(X,y)

sr_imp = pd.Series(md.feature_importances_, index=df.columns).sort_values(ascending=False)
sr_imp

region                  0.559402
counter_code            0.105649
consommation_level_1    0.090039
year_created            0.072453
tarif_type              0.064410
consommation_level_2    0.053056
counter_number          0.017650
invoice_year            0.012173
consommation_level_3    0.011123
interaction             0.007027
invoice_weekday         0.002604
new_index               0.001954
consommation_level_4    0.001740
invoice_month           0.000525
ratio                   0.000197
dtype: float64

In [126]:
report(md, X, y, df_test, y_test)

------------------------------------------------------------

Test set:




              precision    recall  f1-score   support

         0.0       0.92      0.99      0.96    820306
         1.0       0.28      0.05      0.08     70622

    accuracy                           0.91    890928
   macro avg       0.60      0.52      0.52    890928
weighted avg       0.87      0.91      0.89    890928





AUC = 0.65


## Decision Tree

In [127]:
from sklearn.tree import DecisionTreeClassifier

md = DecisionTreeClassifier(max_depth=40)
md.fit(X,y)

sr_imp = pd.Series(md.feature_importances_, index=df.columns).sort_values(ascending=False)
sr_imp

region                  0.250786
counter_number          0.193765
consommation_level_1    0.094208
year_created            0.078354
interaction             0.065468
counter_code            0.056551
ratio                   0.055268
new_index               0.053965
invoice_year            0.041285
invoice_month           0.030478
tarif_type              0.028351
consommation_level_2    0.023563
invoice_weekday         0.020981
consommation_level_3    0.004265
consommation_level_4    0.002711
dtype: float64

In [128]:
report(md, X, y, df_test, y_test)

------------------------------------------------------------

Test set:




              precision    recall  f1-score   support

         0.0       0.95      0.91      0.93    820306
         1.0       0.30      0.44      0.35     70622

    accuracy                           0.87    890928
   macro avg       0.62      0.67      0.64    890928
weighted avg       0.90      0.87      0.88    890928





AUC = 0.68


## Random Forest

In [129]:
md = RandomForestClassifier(n_estimators=100)
md.fit(X, y)

sr_imp = pd.Series(md.feature_importances_, index=df.columns).sort_values(ascending=False)
sr_imp

region                  0.187411
counter_number          0.115868
consommation_level_1    0.100112
year_created            0.082788
interaction             0.081581
ratio                   0.079442
new_index               0.078699
counter_code            0.070905
invoice_year            0.053993
invoice_month           0.041858
consommation_level_2    0.032000
invoice_weekday         0.031809
tarif_type              0.026031
consommation_level_3    0.012160
consommation_level_4    0.005346
dtype: float64

In [130]:
report(md, X, y, df_test, y_test)

------------------------------------------------------------

Test set:




              precision    recall  f1-score   support

         0.0       0.93      0.98      0.96    820306
         1.0       0.48      0.20      0.28     70622

    accuracy                           0.92    890928
   macro avg       0.71      0.59      0.62    890928
weighted avg       0.90      0.92      0.90    890928





AUC = 0.77
