## Features:

### Client:

- Client_id: Unique id for client
- District: District where the client is
- Client_catg: Category client belongs to
- Region: Area where the client is
- Creation_date: Date client joined
- Target: fraud:1 , not fraud: 0


### Invoice data

- Client_id: Unique id for the client
- Invoice_date: Date of the invoice
- Tarif_type: Type of tax
- Counter_number:
- Counter_statue: takes up to 5 values such as working fine, not working, on hold statue, ect
- Counter_code:
- Reading_remarque: notes that the STEG agent takes during his visit to the client (e.g: If the counter shows something wrong, the agent gives a bad score)
- Counter_coefficient: An additional coefficient to be added when standard consumption is exceeded
- Consommation_level_1: Consumption_level_1
- Consommation_level_2: Consumption_level_2
- Consommation_level_3: Consumption_level_3
- Consommation_level_4: Consumption_level_4
- Old_index: Old index
- New_index: New index
- Months_number: Month number
- Counter_type: Type of counter


## Some findings

- the "test" = "competition" set (no targets there). Therefore must split the "train" set into train/test sets

- must not aggregate to make a shorter table with customers. Instead predict on TRANSACTIONS. (df.groupby('client_id').nunique())

- the proportion of positives is higher in the merged "transactions" table than the proportion of positives in the "clients" table (0.06 / 0.08).  So you can treat that as "perturbation" of positives in order to increase the number of positives (where they are scarse). I.e. this is one more argument in the favour of predicting on transactions  and then aggregating them to get a prediction for a particular customer.

- the 'months_number' column does not contain actual months. These values do not correspond to the 'creation_date' or 'invoice_date' columns.  Either keep this columns without any transformation or scaling or delete it completeley. Because the test set contains this kidn of wierd values too.

- the features  ['consommation_level_1', 'consommation_level_2', 'consommation_level_3', 'consommation_level_4']   are not very promising (in tearms of building univariate logistic regression on them)

- columns 'counter_statue'  is supposed to be integers [0-5] but is of mixed type (object) with some bogus values.  Convert to int, drop the rows with values > 5, because the test set doesnt have any bad values in this column - only the valid integers from 0 to 5

- search for a decent baseline model didn't give decent results. Try non-deterministic baseline model based on the prior (i.e. the proportion of positives in the population)

- eventually agreed to predict transactions (and not fraudulent clients)

- rule-based baseline model on two rules (2005, higher consumption)


In [29]:
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, classification_report
from sklearn.metrics import fbeta_score, confusion_matrix, recall_score, precision_score
from sklearn.metrics import make_scorer

from sklearn.model_selection import train_test_split, cross_val_predict, cross_val_score

from sklearn.linear_model import LogisticRegression

In [2]:
# feature preprocessing functions

def preprocess(feature, data):
    functions = {'counter_statue': preprocess_counter_statue}
    feature = feature if type(feature) is str else data.name if type(data) is pd.Series else data.columns[feature]
    function = functions[feature]
    return function(data)

    
# preprocess 'counter_statue'
def preprocess_counter_statue(data):
    col = 'counter_statue'
    sr = data[col].astype(str)
    mask = sr.isin(list("012345"))
    sr[~mask] = sr[mask].mode().values[0]
    data[col] = sr.astype(int)


In [3]:
path1 = "data/train/client_train.csv"
path2 = "data/train/invoice_train.csv"

path3 = "data/test/client_test.csv"
path4 = "data/test/invoice_test.csv"

In [4]:
# load the data

df1 = pd.read_csv(path1)
df2 = pd.read_csv(path2)   # low_memory=False

df3 = pd.read_csv(path3)
df4 = pd.read_csv(path4)

  df2 = pd.read_csv(path2)   # low_memory=False


In [5]:
# join tables

# data from the "train" folder (will have to be split into train/test)
df = df1.merge(df2, left_on='client_id', right_on='client_id', how='outer')

# data from the "test" folder (doesn't contain targets)
df_test_zindi = df3.merge(df4, left_on='client_id', right_on='client_id', how='outer')


In [6]:
# not 1000% correct - must use the mode of the train set on the test set
preprocess('counter_statue', df)

In [7]:
# quick and dirty feature engineering 

df['year_created'] = pd.to_datetime(df['creation_date']).dt.year
dates = pd.to_datetime(df['invoice_date'])
df['invoice_year'] = dates.dt.year
df['invoice_month'] = dates.dt.month
df['invoice_weekday'] = dates.dt.weekday



  df['year_created'] = pd.to_datetime(df['creation_date']).dt.year


In [8]:
"""
# interaction between the 4 features
df['mult'] = df[['consommation_level_1', 'consommation_level_2',
                 'consommation_level_3', 'consommation_level_4']].apply(np.multiply.reduce, axis=1)
"""

"\n# interaction between the 4 features\ndf['mult'] = df[['consommation_level_1', 'consommation_level_2',\n                 'consommation_level_3', 'consommation_level_4']].apply(np.multiply.reduce, axis=1)\n"

In [9]:
# split the data

# make random indeces/mask
m = len(df)
p = .80   # for the train set
nx = np.random.permutation(m)[:int(m*p)]
mask = np.zeros(m).astype(bool)
mask[nx] = True

# split
df_train = df[mask]
df_test = df[~mask]


## TO DISCUSS: Which features to use 

In [10]:
df_train.head()

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,target,invoice_date,tarif_type,counter_number,counter_statue,...,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type,year_created,invoice_year,invoice_month,invoice_weekday
0,60,train_Client_0,11,101,31/12/1994,0.0,2014-03-24,11,1335667,0,...,0,0,14302,14384,4,ELEC,1994,2014,3,0
1,60,train_Client_0,11,101,31/12/1994,0.0,2013-03-29,11,1335667,0,...,0,0,12294,13678,4,ELEC,1994,2013,3,4
3,60,train_Client_0,11,101,31/12/1994,0.0,2015-07-13,11,1335667,0,...,0,0,14747,14849,4,ELEC,1994,2015,7,0
4,60,train_Client_0,11,101,31/12/1994,0.0,2016-11-17,11,1335667,0,...,0,0,15066,15638,12,ELEC,1994,2016,11,3
5,60,train_Client_0,11,101,31/12/1994,0.0,2017-07-17,11,1335667,0,...,0,0,15638,15952,8,ELEC,1994,2017,7,0


### features to use:

* client_catg  - yes?
* region
* tarif_type

* counter_number - ?
* counter_statue
* counter_code - defenitely
* reading_remarque - yes
* counter_coefficient - maybe

* consommation_level_1 ... consommation_level_4 - YES (baseline model)
* new_index - ?
* months_number - ? (did a lot of exploration - difficult to decide)
* counter_type - probably yes (common sence / domain knowledge)


### features to omit:

* disrict (this information is dependent on 'region' and is "stored" there already)
* client_id
* creation_date - ?
* invoice_date (feature engineer month / year?)
* old_index (almost perfectly correlates with 'new_index')


### features to engineer:

* month
* year ?

## The first quick and dirty model (logistic regression)

In [11]:
df.columns

Index(['disrict', 'client_id', 'client_catg', 'region', 'creation_date',
       'target', 'invoice_date', 'tarif_type', 'counter_number',
       'counter_statue', 'counter_code', 'reading_remarque',
       'counter_coefficient', 'consommation_level_1', 'consommation_level_2',
       'consommation_level_3', 'consommation_level_4', 'old_index',
       'new_index', 'months_number', 'counter_type', 'year_created',
       'invoice_year', 'invoice_month', 'invoice_weekday'],
      dtype='object')

In [12]:
# treat these as continuous (i.e. scale them) or bin them (?)
discrete_features = ['counter_number', 'consommation_level_1', 'consommation_level_2',
       'consommation_level_3', 'consommation_level_4', 'new_index', 'months_number']

for col in discrete_features:
    print(col, df[col].nunique())

counter_number 201893
consommation_level_1 8295
consommation_level_2 12576
consommation_level_3 2253
consommation_level_4 12075
new_index 157980
months_number 1370


In [13]:
# these categorical features have 43 unique value or less (dummyfy them)
cat_features = ['client_catg', 'region', 'tarif_type',
       'counter_statue', 'counter_code', 'reading_remarque',
       'counter_coefficient', 'counter_type', 'year_created',   # counter_type = object?
       'invoice_year', 'invoice_month', 'invoice_weekday']

for col in cat_features:
    print(col, df[col].nunique())

client_catg 3
region 25
tarif_type 17
counter_statue 6
counter_code 42
reading_remarque 8
counter_coefficient 16
counter_type 2
year_created 43
invoice_year 43
invoice_month 12
invoice_weekday 7


In [14]:
# scale
df_scaled = (df[discrete_features] - df[discrete_features].mean(axis=0)) / df[discrete_features].std(axis=0)

In [15]:
# dummyfy
df_dummy = pd.get_dummies(df[cat_features], drop_first=True, columns=cat_features)

In [16]:
# concatenate
df_prep = pd.concat([df_scaled, df_dummy], axis=1)

In [17]:
# get the X,y for sklearn input
X = df_prep.values
y = df['target'].values.astype(int)

In [37]:
# build a model


######################################
# random sample
m = 100_000
nx = np.random.permutation(len(y))[:m]

X = df_prep.values
y = df['target'].values.astype(int)

X = X[nx]
y = ytrue = y[nx]
####################################


md = LogisticRegression(solver='newton-cholesky', max_iter=500)

ypred = cross_val_predict(md, X,y, cv=5)


# f1
f1 = cross_val_score(md, X,y, scoring='f1', cv=5)
print("f1:", f1.round(3), f1.mean().round(3))

# f2
scorer = make_scorer(fbeta_score, beta=2)
f2 = cross_val_score(md, X,y, scoring=scorer, cv=5)
print("f2:", f2.round(3), f2.mean().round(3))

# AUC
ppred = cross_val_predict(md, X,y, cv=5, method='predict_proba')[:,-1]
auc = roc_auc_score(ytrue, ppred)
print("AUC =", auc)


f1: [0.128 0.124 0.139 0.124 0.111] 0.125
f2: [0.085 0.082 0.093 0.082 0.073] 0.083
AUC = 0.7480235750605515


In [38]:
print(classification_report(ytrue, ypred))

              precision    recall  f1-score   support

           0       0.92      1.00      0.96     91824
           1       0.83      0.07      0.13      8176

    accuracy                           0.92    100000
   macro avg       0.88      0.53      0.54    100000
weighted avg       0.92      0.92      0.89    100000



In [39]:
confusion_matrix(ytrue, ypred)

array([[91710,   114],
       [ 7622,   554]])