## Features:

### Client:

- Client_id: Unique id for client
- District: District where the client is
- Client_catg: Category client belongs to
- Region: Area where the client is
- Creation_date: Date client joined
- Target: fraud:1 , not fraud: 0


### Invoice data

- Client_id: Unique id for the client
- Invoice_date: Date of the invoice
- Tarif_type: Type of tax
- Counter_number:
- Counter_statue: takes up to 5 values such as working fine, not working, on hold statue, ect
- Counter_code:
- Reading_remarque: notes that the STEG agent takes during his visit to the client (e.g: If the counter shows something wrong, the agent gives a bad score)
- Counter_coefficient: An additional coefficient to be added when standard consumption is exceeded
- Consommation_level_1: Consumption_level_1
- Consommation_level_2: Consumption_level_2
- Consommation_level_3: Consumption_level_3
- Consommation_level_4: Consumption_level_4
- Old_index: Old index
- New_index: New index
- Months_number: Month number
- Counter_type: Type of counter


In [2]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, classification_report
from sklearn.metrics import fbeta_score, confusion_matrix, recall_score, precision_score
from sklearn.metrics import make_scorer

from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score

from sklearn.preprocessing import OrdinalEncoder
from imblearn.over_sampling import SMOTENC

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [3]:
from warnings import filterwarnings
filterwarnings('ignore')

In [5]:
path1 = "data/train/client_train.csv"
path2 = "data/train/invoice_train.csv"

path3 = "data/test/client_test.csv"
path4 = "data/test/invoice_test.csv"

In [9]:
# load the data

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2, low_memory=False)     # low_memory=False

df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)

In [10]:
# join tables

# data from the "train" folder (will have to be split into train/test)
df_entire = df1.merge(df2, left_on='client_id', right_on='client_id', how='outer')

# data from the "test" folder (doesn't contain targets)
df_test_zindi = df3.merge(df4, left_on='client_id', right_on='client_id', how='outer')


## Preprocess the data / feature engineering

In [4]:
# feature preprocessing functions

def preprocess(feature, data):
    functions = {'counter_statue': preprocess_counter_statue}
    feature = feature if type(feature) is str else data.name if type(data) is pd.Series else data.columns[feature]
    function = functions[feature]
    return function(data)

    
# preprocess 'counter_statue'
def preprocess_counter_statue(data):
    col = 'counter_statue'
    sr = data[col].astype(str)
    mask = sr.isin(list("012345"))
    sr[~mask] = sr[mask].mode().values[0]
    data[col] = sr.astype(int)


In [11]:
# converts all values to int, fills bad values with the mode
preprocess('counter_statue', df_entire)

In [12]:
# feature engineering 

df_entire['year_created'] = pd.to_datetime(df_entire['creation_date'],
                                           format="%d/%m/%Y").dt.year
dates = pd.to_datetime(df_entire['invoice_date'])
df_entire['invoice_year'] = dates.dt.year
df_entire['invoice_month'] = dates.dt.month
df_entire['invoice_weekday'] = dates.dt.weekday

In [13]:
# drop the observations before 2005

YEAR = 2005
df_entire.drop(df_entire.index[df_entire['invoice_year'] < YEAR], axis=0, inplace=True)

# reset index
df_entire.reset_index(drop=True, inplace=True)

In [14]:
# enlabel 'counter_type'

df_entire['counter_type'], _ = pd.factorize(df_entire['counter_type'])

In [15]:
# drop these columns

cols = [
    'disrict', 'creation_date', 'invoice_date', 'old_index',  # drop
    'client_id', 'months_number',                             # maybe keep?
    'client_catg', 'counter_statue', 'reading_remarque', 'counter_coefficient','counter_type'
        ]

df_entire.drop(cols, axis=1, inplace=True)

## Separate X, y

In [16]:
y_entire = df_entire.pop('target')

## Encode (for upcoming upsampling with SMOTE)

In [17]:
categoricals = [col for col in df_entire.columns if df_entire[col].nunique() < 20]
non_cats = [col for col in df_entire.columns if df_entire[col].nunique() >= 20]

nd_cats_encoded = OrdinalEncoder().fit_transform(df_entire[categoricals]).astype(int)

In [18]:
df_entire = pd.concat([pd.DataFrame(nd_cats_encoded, 
                                           columns=categoricals,
                                          index=df_entire.index), 
                              df_entire[non_cats]], axis=1)

df_entire.head()

Unnamed: 0,tarif_type,invoice_year,invoice_month,invoice_weekday,region,counter_number,counter_code,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,new_index,year_created
0,3,9,2,0,101,1335667,203,82,0,0,0,14384,1994
1,3,8,2,4,101,1335667,203,1200,184,0,0,13678,1994
2,3,10,2,0,101,1335667,203,123,0,0,0,14747,1994
3,3,10,6,0,101,1335667,207,102,0,0,0,14849,1994
4,3,11,10,3,101,1335667,207,572,0,0,0,15638,1994


## Split the data

In [19]:
df_train, df_test, y_train, y_test = train_test_split(df_entire, y_entire, 
                                                      test_size=0.2, stratify=y_entire)
#X_train = df_train.values
#X_test = df_test.values
#y_train = y_train.values
#y_test = y_test.values

## Upsample data with SMOTE

In [None]:
# run the upsampling procedure or load upsampled data from disk?
UPSAMPLE = False  # False if you have run this already and saved upsampled data on sick

In [20]:
train_encoded_path = "data/train/train_encoded.csv"
test_encoded_path = "data/test/test_encoded.csv"

In [19]:
if UPSAMPLE:
    sm = SMOTENC(categorical_features=list(range(len(categoricals))),
                 k_neighbors=3,
                sampling_strategy=0.9)

    df_smote, y_smote = sm.fit_resample(df_train, y_train)

Unnamed: 0,tarif_type,invoice_year,invoice_month,invoice_weekday,region,counter_number,counter_code,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,new_index,year_created
6233686,13,1,4,0,311,8713,5,68,0,0,0,1914,1997
6233687,3,4,4,0,177,631260,203,837,248,497,388,64668,1996
6233688,3,10,6,0,281,550985,203,533,10,0,0,8919,2001
6233689,3,11,3,2,307,216663,203,419,0,0,0,4502,1995
6233690,3,9,3,4,332,921896,203,503,0,0,0,4104,1999


In [42]:
# save to csv

if UPSAMPLE:
    df_encoded_train_upsampled = pd.concat([df_smote, y_smote.astype(int)], axis=1)
    df_encoded_train_upsampled.to_csv(train_encoded_path, header=True, index=False)

    df_encoded_test = pd.concat([df_test, y_test.astype(int)], axis=1)
    df_encoded_test.to_csv(test_encoded_path, header=True, index=False)

In [25]:
# load the data
df_encoded_train_upsampled = pd.read_csv(train_encoded_path)
df_encoded_test = pd.read_csv(test_encoded_path)

In [27]:
# Provisionary  measure
#df_encoded_train_upsampled.drop('Unnamed: 0', axis=1, inplace=True)
#df_encoded_test.drop('Unnamed: 0', axis=1, inplace=True)

In [28]:
# split X,y
y_train_upsample = df_encoded_train_upsampled.pop('target')
df_train_upsample = df_encoded_train_upsampled

y_test = df_encoded_test.pop('target')
df_test = df_encoded_test

In [30]:
# provisional measure  due to the saved smote data
#df_train_upsample = df_train_upsample.iloc[:,:13]   
#df_test = df_test.iloc[:, :13]

In [31]:
# take a small sample from the train set (optional)

PORTION_FROM_UPSAMPLED_DATA = 0.9999   # 0.9999 = almost all the upsampled observations

df_train, _, y_train, _ = train_test_split(df_train_upsample, y_train_upsample, 
            train_size=PORTION_FROM_UPSAMPLED_DATA, stratify=y_train_upsample)

X_train, y_train = df_train.values, y_train.values

X_train.shape, df_train.shape, y_train.shape

((6233067, 13), (6233067, 13), (6233067,))

## Feature engineering

In [32]:

def feature_engineer(df):
    df['ratio'] = (df['counter_number'] / df['region']).fillna(0).replace((-np.inf, np.inf), 0)
    df['interaction'] = np.log( (df['counter_number'] * df['region']).fillna(0).replace((-np.inf, np.inf), 0)).fillna(0).replace((-np.inf, np.inf), 0)
    return df

In [33]:
df_train = feature_engineer(df_train)
df_test = feature_engineer(df_test)

X_train = df_train.values
X_test = df_test.values

In [37]:
def print_report(model, Xtrain, ytrain, Xtest, ytest):
    print("Train set:")
    ypred = model.predict(Xtrain)
    print(classification_report(ytrain, ypred))

    print("Test set:")
    ypred = model.predict(Xtest)
    print(classification_report(ytest, ypred))

    probs = model.predict_proba(Xtest)[:,-1]
    auc = roc_auc_score(ytest, probs)
    print("AUC =", auc.round(2))
    print(confusion_matrix(ytest, ypred))
    

## Fit a single Desicion Tree

In [41]:
from sklearn.tree import DecisionTreeClassifier

md = DecisionTreeClassifier(max_depth=70)   # 70 = grid search result
md.fit(X_train, y_train)

pd.Series(md.feature_importances_, index=df_train.columns).sort_values(ascending=False)

counter_number          0.155252
region                  0.151652
ratio                   0.151206
interaction             0.144448
new_index               0.096039
year_created            0.092097
consommation_level_1    0.074388
counter_code            0.050354
invoice_year            0.030313
consommation_level_2    0.015549
invoice_month           0.012590
tarif_type              0.012581
invoice_weekday         0.007872
consommation_level_3    0.003743
consommation_level_4    0.001916
dtype: float64

In [42]:
print_report(md, X_train, y_train, df_test, y_test)

Train set:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   3280562
           1       1.00      1.00      1.00   2952505

    accuracy                           1.00   6233067
   macro avg       1.00      1.00      1.00   6233067
weighted avg       1.00      1.00      1.00   6233067

Test set:
              precision    recall  f1-score   support

           0       0.98      0.96      0.97    820306
           1       0.65      0.82      0.73     70622

    accuracy                           0.95    890928
   macro avg       0.82      0.89      0.85    890928
weighted avg       0.96      0.95      0.95    890928

AUC = 0.89
[[789880  30426]
 [ 12987  57635]]


In [None]:
# 0.65      0.82      0.73     0.89. (with engin feats, 70)

In [None]:
# 0.59      0.81      0.68    0.89    (with engin fetas, 40 depth)

In [None]:
# 0.63      0.80      0.71     0.88. (no engin feats, depth=70)

In [None]:
# 0.84      0.79      0.81     0.98. (random f, engin feats, 25 est)       

## Fit a random forest

In [46]:
md = RandomForestClassifier(n_estimators=100, criterion='gini')
md.fit(X_train, y_train)

pd.Series(md.feature_importances_, index=df_train.columns).sort_values(ascending=False)

counter_number          0.145563
ratio                   0.144313
interaction             0.142769
region                  0.124144
new_index               0.091114
year_created            0.088620
consommation_level_1    0.081906
counter_code            0.047769
invoice_year            0.040571
invoice_month           0.026413
consommation_level_2    0.022839
invoice_weekday         0.019229
tarif_type              0.012397
consommation_level_3    0.008151
consommation_level_4    0.004202
dtype: float64

In [47]:
print_report(md, X_train, y_train, df_test, y_test)

Train set:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   3280562
           1       1.00      1.00      1.00   2952505

    accuracy                           1.00   6233067
   macro avg       1.00      1.00      1.00   6233067
weighted avg       1.00      1.00      1.00   6233067

Test set:
              precision    recall  f1-score   support

           0       0.98      0.99      0.99    820306
           1       0.87      0.78      0.82     70622

    accuracy                           0.97    890928
   macro avg       0.92      0.89      0.90    890928
weighted avg       0.97      0.97      0.97    890928

AUC = 0.98
[[811902   8404]
 [ 15244  55378]]


In [None]:
#. 0.84      0.79      0.81   0.98    25 estimators

In [None]:
#  0.87      0.78      0.82   0.98.   50 est

## Decision Tree (gird search for f1)

In [None]:
N_ITER = 30 
CV = 7

In [65]:
md = DecisionTreeClassifier(criterion='gini')

params = {'max_depth': [50, 70]}  # the best was NOne, try more than 60

# TIMEIT!
gs = GridSearchCV(md, params, cv=CV, scoring='f1').fit(X,y)

In [66]:
best_tree = gs.best_estimator_

best_params_f1 = gs.best_params_
print("best params for desicion tree f1:", best_params_f1, sep="\n")

best params for desicion tree f1:
{'max_depth': None}


In [67]:
pd.Series(best_tree.feature_importances_, index=df.columns).sort_values(ascending=False)

counter_number          0.372326
region                  0.163611
year_created            0.089843
consommation_level_1    0.077164
interaction             0.056707
new_index               0.052362
counter_code            0.050505
ratio                   0.050066
invoice_year            0.030793
consommation_level_2    0.016331
invoice_month           0.013517
tarif_type              0.012542
invoice_weekday         0.008649
consommation_level_3    0.003720
consommation_level_4    0.001863
dtype: float64

In [68]:
report(best_tree, df_test, y_test)

Test set:
              precision    recall  f1-score   support

         0.0       0.98      0.96      0.97    820306
         1.0       0.61      0.78      0.69     70622

    accuracy                           0.94    890928
   macro avg       0.80      0.87      0.83    890928
weighted avg       0.95      0.94      0.95    890928

AUC = 0.87


## Random Forest (grid search for f1)

In [None]:
md = RandomForestClassifier('n_estimators'=100, criterion='gini', bootstrap=True, oob_score=True)

params = {
          'max_depth': [None, 40, 60],
          'min_samples_split': [2, 5],
          'min_samples_leaf': [1, 3]}

gs = GridSearchCV(md, params, cv=CV, scoring='f1').fit(X,y)


In [None]:
best_rf = gs.best_estimator_

best_params_f1 = gs.best_params_
print("best params for random forest f1:", best_params_f1, sep="\n")

In [None]:
pd.Series(best_rf.feature_importances_, index=df.columns).sort_values(ascending=False)

In [None]:
report(best_rf, df_test, y_test)

### Randomized Search

In [27]:

md = DecisionTreeClassifier(criterion='gini')

params = {'criterion': ['gini', 'entropy', 'log_loss'],
          'max_depth': np.arange(10, 100),
          'min_samples_split': np.arange(2, 9),
          'min_samples_leaf': np.arange(1, 5)}

gs_f1 = RandomizedSearchCV(md, params, cv=CV, scoring='f1', n_iter=N_ITER).fit(X,y)
gs_recall = RandomizedSearchCV(md, params, cv=CV, scoring='recall', n_iter=N_ITER).fit(X,y)


KeyboardInterrupt: 

In [None]:
best_tree = gs_f1.best_estimator_

best_params_f1 = gs_f1.best_params_
print("best params for desicion tree f1:", best_params_f1, sep="\n")

In [None]:
tree_recall = gs_recall.best_estimator_

best_params_recall = gs_recall.best_params_
print("best params for desicion tree recall:", best_params_recall, sep="\n")

In [None]:
pd.Series(best_tree.feature_importances_, index=df.columns).sort_values(ascending=False)

In [None]:
report(best_tree, df_test, y_test)

## Random Forest

In [None]:
md = RandomForestClassifier(criterion='gini', bootstrap=True, oob_score=True)

params = {'n_estimators': np.arange(50, 250, 20),
          'criterion': ['gini', 'entropy', 'log_loss'],
          'max_depth': np.arange(10, 100),
          'min_samples_split': np.arange(2, 9),
          'min_samples_leaf': np.arange(1, 5),
          'oob_score': [False, True]}

gs_f1 = RandomizedSearchCV(md, params, cv=CV, scoring='f1', n_iter=N_ITER).fit(X,y)
gs_recall = RandomizedSearchCV(md, params, cv=CV, scoring='recall', n_iter=N_ITER).fit(X,y)

In [None]:
best_rf = gs_f1.best_estimator_

best_params_f1 = gs_f1.best_params_
print("best params for random forest f1:", best_params_f1, sep="\n")

In [None]:
best_rf_recall = gs_recall.best_estimator_

best_params_recall = gs_recall.best_params_
print("best params for random forest recall:", best_params_recall, sep="\n")

In [None]:
pd.Series(best_rf.feature_importances_, index=df.columns).sort_values(ascending=False)

In [None]:
report(best_rf, df_test, y_test)

In [None]:
#### the last one

In [164]:
from sklearn.tree import DecisionTreeClassifier

md = DecisionTreeClassifier(max_depth=40)
md.fit(X,y)

pd.Series(md.feature_importances_, index=df.columns).sort_values(ascending=False)

region                  0.155287
counter_number          0.139989
ratio                   0.133872
interaction             0.128010
year_created            0.088485
consommation_level_1    0.074590
counter_code            0.051127
new_index               0.046064
ratio2                  0.042776
interaction2_r          0.029570
interaction2            0.028571
invoice_year            0.028164
consommation_level_2    0.016565
tarif_type              0.012512
invoice_month           0.011706
invoice_weekday         0.007055
consommation_level_3    0.003723
consommation_level_4    0.001934
dtype: float64

In [165]:
print_report(md, X,y, df_test, y_test)

Train set:
              precision    recall  f1-score   support

         0.0       0.99      0.98      0.99   3280562
         1.0       0.98      0.99      0.99   2952505

    accuracy                           0.99   6233067
   macro avg       0.99      0.99      0.99   6233067
weighted avg       0.99      0.99      0.99   6233067

Test set:
              precision    recall  f1-score   support

         0.0       0.98      0.95      0.96    820306
         1.0       0.56      0.80      0.66     70622

    accuracy                           0.94    890928
   macro avg       0.77      0.87      0.81    890928
weighted avg       0.95      0.94      0.94    890928

AUC = 0.88
[[776751  43555]
 [ 14243  56379]]


In [128]:
#prec 0.54      rec 0.78      f1 0.64  0.87

# prec 0.59         0.81            0.68  auc 0.89    (40 depth + 2 engin feats)

In [None]:
# 0.55      0.80      0.65  0.88  (no engin feats)