## Features:

### Client:

- Client_id: Unique id for client
- District: District where the client is
- Client_catg: Category client belongs to
- Region: Area where the client is
- Creation_date: Date client joined
- Target: fraud:1 , not fraud: 0


### Invoice data

- Client_id: Unique id for the client
- Invoice_date: Date of the invoice
- Tarif_type: Type of tax
- Counter_number:
- Counter_statue: takes up to 5 values such as working fine, not working, on hold statue, ect
- Counter_code:
- Reading_remarque: notes that the STEG agent takes during his visit to the client (e.g: If the counter shows something wrong, the agent gives a bad score)
- Counter_coefficient: An additional coefficient to be added when standard consumption is exceeded
- Consommation_level_1: Consumption_level_1
- Consommation_level_2: Consumption_level_2
- Consommation_level_3: Consumption_level_3
- Consommation_level_4: Consumption_level_4
- Old_index: Old index
- New_index: New index
- Months_number: Month number
- Counter_type: Type of counter


## Some findings

- the "test" = "competition" set (no targets there). Therefore must split the "train" set into train/test sets

- must not aggregate to make a shorter table with customers. Instead predict on TRANSACTIONS. (df.groupby('client_id').nunique())

- the proportion of positives is higher in the merged "transactions" table than the proportion of positives in the "clients" table (0.06 / 0.08).  So you can treat that as "perturbation" of positives in order to increase the number of positives (where they are scarse). I.e. this is one more argument in the favour of predicting on transactions  and then aggregating them to get a prediction for a particular customer.

- the 'months_number' column does not contain actual months. These values do not correspond to the 'creation_date' or 'invoice_date' columns.  Either keep this columns without any transformation or scaling or delete it completeley. Because the test set contains this kidn of wierd values too.

- the features  ['consommation_level_1', 'consommation_level_2', 'consommation_level_3', 'consommation_level_4']   are not very promising (in tearms of building univariate logistic regression on them)

- columns 'counter_statue'  is supposed to be integers [0-5] but is of mixed type (object) with some bogus values.  Convert to int, drop the rows with values > 5, because the test set doesnt have any bad values in this column - only the valid integers from 0 to 5

- search for a decent baseline model didn't give decent results. Try non-deterministic baseline model based on the prior (i.e. the proportion of positives in the population)

- eventually agreed to predict transactions (and not fraudulent clients)

- rule-based baseline model on two rules (2005, higher consumption)


In [1]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, classification_report
from sklearn.metrics import fbeta_score, confusion_matrix, recall_score, precision_score
from sklearn.metrics import make_scorer

from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score

from sklearn.preprocessing import OrdinalEncoder
from imblearn.over_sampling import SMOTENC

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [2]:
from warnings import filterwarnings
filterwarnings('ignore')

In [3]:
# TODO: random seed
...

In [4]:
# feature preprocessing functions

def preprocess(feature, data):
    functions = {'counter_statue': preprocess_counter_statue}
    feature = feature if type(feature) is str else data.name if type(data) is pd.Series else data.columns[feature]
    function = functions[feature]
    return function(data)

    
# preprocess 'counter_statue'
def preprocess_counter_statue(data):
    col = 'counter_statue'
    sr = data[col].astype(str)
    mask = sr.isin(list("012345"))
    sr[~mask] = sr[mask].mode().values[0]
    data[col] = sr.astype(int)


In [5]:
path1 = "data/train/client_train.csv"
path2 = "data/train/invoice_train.csv"

path3 = "data/test/client_test.csv"
path4 = "data/test/invoice_test.csv"

In [6]:
# load the data

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2, low_memory=False)     # low_memory=False

df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)

In [7]:
# join tables

# data from the "train" folder (will have to be split into train/test)
df_entire = df1.merge(df2, left_on='client_id', right_on='client_id', how='outer')

# data from the "test" folder (doesn't contain targets)
df_test_zindi = df3.merge(df4, left_on='client_id', right_on='client_id', how='outer')


In [8]:
# converts all values to int, fills bad values with the mode
preprocess('counter_statue', df_entire)

In [9]:
# feature engineering 

df_entire['year_created'] = pd.to_datetime(df_entire['creation_date'],
                                           format="%d/%m/%Y").dt.year
dates = pd.to_datetime(df_entire['invoice_date'])
df_entire['invoice_year'] = dates.dt.year
df_entire['invoice_month'] = dates.dt.month
df_entire['invoice_weekday'] = dates.dt.weekday

In [10]:
# drop the observations before 2005

YEAR = 2005
df_entire.drop(df_entire.index[df_entire['invoice_year'] < YEAR], axis=0, inplace=True)

# reset index
df_entire.reset_index(drop=True, inplace=True)

In [11]:
# enlabel 'counter_type'

df_entire['counter_type'], _ = pd.factorize(df_entire['counter_type'])

In [12]:
# drop these columns

cols = [
    'disrict', 'client_id', 'creation_date', 'invoice_date', 'old_index', 'months_number',
    'client_catg', 'counter_statue', 'reading_remarque', 'counter_coefficient','counter_type'
        ]

df_entire.drop(cols, axis=1, inplace=True)

## Separate X, y

In [13]:
y_entire = df_entire.pop('target')

## Encode

In [14]:
categoricals = [col for col in df_entire.columns if df_entire[col].nunique() < 20]
non_cats = [col for col in df_entire.columns if df_entire[col].nunique() >= 20]

nd_cats_encoded = OrdinalEncoder().fit_transform(df_entire[categoricals]).astype(int)

In [15]:
df_entire = pd.concat([pd.DataFrame(nd_cats_encoded, 
                                           columns=categoricals,
                                          index=df_entire.index), 
                              df_entire[non_cats]], axis=1)

df_entire.head()

Unnamed: 0,tarif_type,invoice_year,invoice_month,invoice_weekday,region,counter_number,counter_code,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,new_index,year_created
0,3,9,2,0,101,1335667,203,82,0,0,0,14384,1994
1,3,8,2,4,101,1335667,203,1200,184,0,0,13678,1994
2,3,10,2,0,101,1335667,203,123,0,0,0,14747,1994
3,3,10,6,0,101,1335667,207,102,0,0,0,14849,1994
4,3,11,10,3,101,1335667,207,572,0,0,0,15638,1994


## Split the data

In [16]:
df_train, df_test, y_train, y_test = train_test_split(df_entire, y_entire, 
                                                      test_size=0.2, stratify=y_entire)
Xtrain = df_train.values
Xtest = df_test.values
ytrain = y_train.values
ytest = y_test.values

## Upsample data with SMOTE

### sample before SMOTE

In [17]:
# DO NOT DO THIS SPLIT, 

train_size = 0.9999    # 0.99    for final training

df_sample, df_val, y_sample, y_val = train_test_split(df_train, y_train, 
                                                      train_size=train_size, stratify=y_train)

In [37]:
train_encoded_path = "data/train/train_encoded.csv"
test_encoded_path = "data/test/test_encoded.csv"

In [42]:
# save to csv
df_encoded_train_upsampled = pd.concat([df_smote, y_smote.astype(int)], axis=1)
df_encoded_train_upsampled.to_csv(train_encoded_path)

df_encoded_test = pd.concat([df_test, y_test.astype(int)], axis=1)
df_encoded_test.to_csv(test_encoded_path)

In [19]:
sm = SMOTENC(categorical_features=list(range(len(categoricals))),
             k_neighbors=3,
            sampling_strategy=0.9)
df_smote, y_smote = sm.fit_resample(df_sample, y_sample)

df_smote.tail(5)

Unnamed: 0,tarif_type,invoice_year,invoice_month,invoice_weekday,region,counter_number,counter_code,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,new_index,year_created
6233686,13,1,4,0,311,8713,5,68,0,0,0,1914,1997
6233687,3,4,4,0,177,631260,203,837,248,497,388,64668,1996
6233688,3,10,6,0,281,550985,203,533,10,0,0,8919,2001
6233689,3,11,3,2,307,216663,203,419,0,0,0,4502,1995
6233690,3,9,3,4,332,921896,203,503,0,0,0,4104,1999


In [166]:
# provisionary split   !!!


df_smote = df_smote.iloc[:,:13]

SMALL_PORTION_FROM_UPSAMPLED_DATA = 0.9999

df, _, y, _ = train_test_split(df_smote, y_smote, 
            train_size=SMALL_PORTION_FROM_UPSAMPLED_DATA, stratify=y_smote)


###df,y = df_smote, y_smote

X,y = df.values, y.values

df.shape, X.shape, y.shape

((6233067, 13), (6233067, 13), (6233067,))

In [167]:
df_smote.columns

Index(['tarif_type', 'invoice_year', 'invoice_month', 'invoice_weekday',
       'region', 'counter_number', 'counter_code', 'consommation_level_1',
       'consommation_level_2', 'consommation_level_3', 'consommation_level_4',
       'new_index', 'year_created'],
      dtype='object')

## Feature Selection

### features to use:

In [22]:
"""
region                  0.230020
counter_number          0.184581
year_created            0.068360
interaction             0.061893
ratio                   0.057657
new_index               0.055945
counter_code            0.040542
consommation_level_1    0.038644
invoice_year            0.037021
invoice_month           0.029143
invoice_weekday         0.020443
tarif_type              0.019677
""";

## Feature engineering

In [168]:

def feature_engineer(df):
    #df['ratio'] = (df['counter_number'] / df['region']).fillna(0).replace((-np.inf, np.inf), 0)
    #df['interaction'] = np.log( (df['counter_number'] * df['region']).fillna(0).replace((-np.inf, np.inf), 0)).fillna(0).replace((-np.inf, np.inf), 0)
    return df

In [169]:
df_test = df_test.iloc[:, :13]

In [170]:
df = feature_engineer(df)
df_val = feature_engineer(df_val)
df_test = feature_engineer(df_test)

X = df.values



In [171]:
X.shape, df.shape, df_test.shape

((6233067, 13), (6233067, 13), (890928, 13))

In [172]:
df.columns, df_test.columns

(Index(['tarif_type', 'invoice_year', 'invoice_month', 'invoice_weekday',
        'region', 'counter_number', 'counter_code', 'consommation_level_1',
        'consommation_level_2', 'consommation_level_3', 'consommation_level_4',
        'new_index', 'year_created'],
       dtype='object'),
 Index(['tarif_type', 'invoice_year', 'invoice_month', 'invoice_weekday',
        'region', 'counter_number', 'counter_code', 'consommation_level_1',
        'consommation_level_2', 'consommation_level_3', 'consommation_level_4',
        'new_index', 'year_created'],
       dtype='object'))

In [83]:
# function printing model report
def print_report(model, Xtrain, ytrain, Xtest, ytest):
    print("Train set:")
    ypred = model.predict(Xtrain)
    print(classification_report(ytrain, ypred))

    print("Test set:")
    ypred = model.predict(Xtest)
    print(classification_report(ytest, ypred))

    probs = model.predict_proba(Xtest)[:,-1]
    auc = roc_auc_score(ytest, probs)
    print("AUC =", auc.round(2))
    
    print(confusion_matrix(ytest, ypred))
    


## Fit a single Desicion Tree

In [60]:
from sklearn.tree import DecisionTreeClassifier

md = DecisionTreeClassifier(max_depth=None)
md.fit(X,y)

pd.Series(md.feature_importances_, index=df.columns).sort_values(ascending=False)

counter_number          0.366394
region                  0.168340
year_created            0.090586
consommation_level_1    0.077741
interaction             0.056648
new_index               0.052626
counter_code            0.052248
ratio                   0.048736
invoice_year            0.030172
consommation_level_2    0.016977
invoice_month           0.012896
tarif_type              0.012756
invoice_weekday         0.008052
consommation_level_3    0.003874
consommation_level_4    0.001952
dtype: float64

In [61]:
report(md, df_test, y_test)

Test set:
              precision    recall  f1-score   support

         0.0       0.98      0.94      0.96    820306
         1.0       0.54      0.78      0.64     70622

    accuracy                           0.93    890928
   macro avg       0.76      0.86      0.80    890928
weighted avg       0.95      0.93      0.94    890928

AUC = 0.87


## Fit a random forest

In [62]:
md = RandomForestClassifier(n_estimators=25, criterion='gini')
md.fit(X,y)

RandomForestClassifier(n_estimators=25)

In [63]:
pd.Series(md.feature_importances_, index=df.columns).sort_values(ascending=False)

counter_number          0.176064
region                  0.139613
interaction             0.095499
consommation_level_1    0.094773
year_created            0.094501
ratio                   0.092018
new_index               0.088609
invoice_year            0.054832
counter_code            0.049154
invoice_month           0.036751
invoice_weekday         0.027161
consommation_level_2    0.025487
tarif_type              0.012834
consommation_level_3    0.009146
consommation_level_4    0.003558
dtype: float64

In [64]:
report(md, df_test, y_test)

Test set:
              precision    recall  f1-score   support

         0.0       0.96      0.97      0.96    820306
         1.0       0.60      0.56      0.58     70622

    accuracy                           0.94    890928
   macro avg       0.78      0.76      0.77    890928
weighted avg       0.93      0.94      0.93    890928

AUC = 0.91


## Decision Tree (gird search for f1)

In [177]:
md = DecisionTreeClassifier(criterion='gini')

params = {'max_depth': list(np.arange(30, 120, 20)) + [None,]}  # the best was NOne, try more than 60

gs = GridSearchCV(md, params, cv=CV, scoring='f1').fit(X,y)

In [178]:
best_tree = gs.best_estimator_

best_params_f1 = gs.best_params_
print("best params for desicion tree f1:", best_params_f1, sep="\n")

best params for desicion tree f1:
{'max_depth': 70}


In [179]:
pd.Series(best_tree.feature_importances_, index=df.columns).sort_values(ascending=False)

counter_number          0.418780
region                  0.164412
new_index               0.102772
year_created            0.094421
consommation_level_1    0.078255
counter_code            0.050733
invoice_year            0.032516
consommation_level_2    0.015953
invoice_month           0.014162
tarif_type              0.012978
invoice_weekday         0.009137
consommation_level_3    0.003927
consommation_level_4    0.001953
dtype: float64

In [180]:
report(best_tree, df_test, y_test)

Test set:
              precision    recall  f1-score   support

         0.0       0.98      0.96      0.97    820306
         1.0       0.63      0.80      0.71     70622

    accuracy                           0.95    890928
   macro avg       0.81      0.88      0.84    890928
weighted avg       0.95      0.95      0.95    890928

AUC = 0.88


In [183]:
print_report(best_tree, X, y, Xtest, ytest)

Train set:
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00   3280562
         1.0       1.00      1.00      1.00   2952505

    accuracy                           1.00   6233067
   macro avg       1.00      1.00      1.00   6233067
weighted avg       1.00      1.00      1.00   6233067

Test set:
              precision    recall  f1-score   support

         0.0       0.98      0.96      0.97    820306
         1.0       0.63      0.80      0.71     70622

    accuracy                           0.95    890928
   macro avg       0.81      0.88      0.84    890928
weighted avg       0.95      0.95      0.95    890928

AUC = 0.88
[[786951  33355]
 [ 13913  56709]]


## Random Forest (grid search for f1)

In [None]:
md = RandomForestClassifier('n_estimators'=100, criterion='gini', bootstrap=True, oob_score=True)

params = {
          'max_depth': [None, 40, 60],
          'min_samples_split': [2, 5],
          'min_samples_leaf': [1, 3]}

gs = GridSearchCV(md, params, cv=CV, scoring='f1').fit(X,y)


In [None]:
best_rf = gs.best_estimator_

best_params_f1 = gs.best_params_
print("best params for random forest f1:", best_params_f1, sep="\n")

In [None]:
pd.Series(best_rf.feature_importances_, index=df.columns).sort_values(ascending=False)

In [None]:
report(best_rf, df_test, y_test)

### Randomized Search

In [27]:

md = DecisionTreeClassifier(criterion='gini')

params = {'criterion': ['gini', 'entropy', 'log_loss'],
          'max_depth': np.arange(10, 100),
          'min_samples_split': np.arange(2, 9),
          'min_samples_leaf': np.arange(1, 5)}

gs_f1 = RandomizedSearchCV(md, params, cv=CV, scoring='f1', n_iter=N_ITER).fit(X,y)
gs_recall = RandomizedSearchCV(md, params, cv=CV, scoring='recall', n_iter=N_ITER).fit(X,y)


KeyboardInterrupt: 

In [None]:
best_tree = gs_f1.best_estimator_

best_params_f1 = gs_f1.best_params_
print("best params for desicion tree f1:", best_params_f1, sep="\n")

In [None]:
tree_recall = gs_recall.best_estimator_

best_params_recall = gs_recall.best_params_
print("best params for desicion tree recall:", best_params_recall, sep="\n")

In [None]:
pd.Series(best_tree.feature_importances_, index=df.columns).sort_values(ascending=False)

In [None]:
report(best_tree, df_test, y_test)

## Random Forest

In [None]:
md = RandomForestClassifier(criterion='gini', bootstrap=True, oob_score=True)

params = {'n_estimators': np.arange(50, 250, 20),
          'criterion': ['gini', 'entropy', 'log_loss'],
          'max_depth': np.arange(10, 100),
          'min_samples_split': np.arange(2, 9),
          'min_samples_leaf': np.arange(1, 5),
          'oob_score': [False, True]}

gs_f1 = RandomizedSearchCV(md, params, cv=CV, scoring='f1', n_iter=N_ITER).fit(X,y)
gs_recall = RandomizedSearchCV(md, params, cv=CV, scoring='recall', n_iter=N_ITER).fit(X,y)

In [None]:
best_rf = gs_f1.best_estimator_

best_params_f1 = gs_f1.best_params_
print("best params for random forest f1:", best_params_f1, sep="\n")

In [None]:
best_rf_recall = gs_recall.best_estimator_

best_params_recall = gs_recall.best_params_
print("best params for random forest recall:", best_params_recall, sep="\n")

In [None]:
pd.Series(best_rf.feature_importances_, index=df.columns).sort_values(ascending=False)

In [None]:
report(best_rf, df_test, y_test)

In [128]:
#prec 0.54      rec 0.78      f1 0.64  0.87

# prec 0.59         0.81            0.68  auc 0.89    (40 depth + 2 engin feats)

In [None]:
# 0.55      0.80      0.65  0.88  (no engin feats)