<img src="https://raw.githubusercontent.com/imgremlin/Photos/master/electricity.jpg" width="1000px"> 
# Fraud Detection in Electricity and Gas Consumption Challenge
**by team GORNYAKI (Tsepa Oleksii and Samoshin Andriy [Ukraine, KPI, IASA])**

Thanks to the organizers for this [challenge](https://zindi.africa/competitions/ai-hack-tunisia-4-predictive-analytics-challenge-1) and everyone for participating! In this notebook you will find:

* importing libraries
* basic EDA
* feature engeneering
* modelling
* prediction 
* submission

<h2>Importing libraries</h2>

In [3]:
#pip install bayesian-optimization

Collecting bayesian-optimizationNote: you may need to restart the kernel to use updated packages.

  Downloading bayesian_optimization-1.4.3-py3-none-any.whl.metadata (543 bytes)
Downloading bayesian_optimization-1.4.3-py3-none-any.whl (18 kB)
Installing collected packages: bayesian-optimization
Successfully installed bayesian-optimization-1.4.3


In [4]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from lightgbm import LGBMClassifier
import time
from bayes_opt import BayesianOptimization

seed=47

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [5]:
invoice_test = pd.read_csv('invoice_test.csv',low_memory=False)
invoice_train = pd.read_csv('invoice_train.csv',low_memory=False)
client_test = pd.read_csv('client_test.csv',low_memory=False)
client_train = pd.read_csv('client_train.csv',low_memory=False)
sample_submission = pd.read_csv('submission_fraud-3.csv',low_memory=False)

<h2>Basic EDA</h2>

We won't show full EDA, just want to attract your attention to tips which help us to reach good score.

In next two cells you will find value counts according each column in train and test set. This information we'll use in feature engeneering  

In [None]:
ds = client_train.groupby(['target'])['client_id'].count()
plt.bar(x=ds.index, height=ds.values, tick_label =[0,1])
plt.title('target distribution')
plt.show()

In [None]:
for col in ['disrict','region','client_catg']:
    ds = client_train.groupby([col])['client_id'].count()
    plt.bar(x=ds.index, height=ds.values)
    plt.title(col+' distribution')
    plt.show()

In [None]:
print('Number of missing rows in invoice_train:',invoice_train.isna().sum().sum())
print('Number of missing rows in invoice_test:',invoice_test.isna().sum().sum(),'\n')
print('Number of missing rows in client_train:',client_train.isna().sum().sum())
print('Number of missing rows in client_test:',client_test.isna().sum().sum())

In [None]:
print('Number of unique values in invoice_train:')
for col in invoice_train.columns:
    print(f"{col} - {invoice_train[col].nunique()}")

<h2>Feature engeneering</h2>

In this part we want to explain the most powerful decision in our notebook - feature creation

In [6]:
def feature_change(cl, inv):

    cl['client_catg'] = cl['client_catg'].astype('category')
    cl['disrict'] = cl['disrict'].astype('category')
    cl['region'] = cl['region'].astype('category')
    cl['region_group'] = cl['region'].apply(lambda x: 100 if x<100 else 300 if x>300 else 200)
    cl['creation_date'] = pd.to_datetime(cl['creation_date'])
    
    cl['coop_time'] = (2019 - cl['creation_date'].dt.year)*12 - cl['creation_date'].dt.month

    inv['counter_type'] = inv['counter_type'].map({"ELEC":1,"GAZ":0})
    inv['counter_statue'] = inv['counter_statue'].map({0:0,1:1,2:2,3:3,4:4,5:5,769:5,'0':0,'5':5,'1':1,'4':4,'A':0,618:5,269375:5,46:5,420:5})
    
    inv['invoice_date'] = pd.to_datetime(inv['invoice_date'], dayfirst=True)
    inv['invoice_month'] = inv['invoice_date'].dt.month
    inv['invoice_year'] = inv['invoice_date'].dt.year
    inv['is_weekday'] = ((pd.DatetimeIndex(inv.invoice_date).dayofweek) // 5 == 1).astype(float)
    inv['delta_index'] = inv['new_index'] - inv['old_index']
    
    return cl, inv

* 'client_catg', 'district' and 'region' were assigned as categories to use them as categorical features in lgbm (as for me, lgbm for default threats with cat features slightly better than other encoders such as catboost/target encoder)
* 'region_group' created simply by dividing 'region' in 3 groups (we purposed that regions weren't randomly decoded)
* 'coop_time' - amount of time since account creation in months
* 'counter_type' was binary encoded 
* 'counter_statue' cleaned from mislabeled values
* extracted month, year from 'invoice_date', also added binary feature - 'is_weekday'
* not sure about any logical sense in 'delta_index', but it improved score

In [7]:
client_train1, invoice_train1 = feature_change(client_train, invoice_train)
client_test1, invoice_test1 = feature_change(client_test, invoice_test)

  cl['creation_date'] = pd.to_datetime(cl['creation_date'])
  inv['invoice_date'] = pd.to_datetime(inv['invoice_date'], dayfirst=True)
  cl['creation_date'] = pd.to_datetime(cl['creation_date'])
  inv['invoice_date'] = pd.to_datetime(inv['invoice_date'], dayfirst=True)


In [8]:
def agg_feature(invoice, client_df, agg_stat):
    
    invoice['delta_time'] = invoice.sort_values(['client_id','invoice_date']).groupby('client_id')['invoice_date'].diff().dt.days.reset_index(drop=True)
    agg_trans = invoice.groupby('client_id')[agg_stat+['delta_time']].agg(['mean','std','min','max'])
    
    agg_trans.columns = ['_'.join(col).strip() for col in agg_trans.columns.values]
    agg_trans.reset_index(inplace=True)

    df = invoice.groupby('client_id').size().reset_index(name='transactions_count')
    agg_trans = pd.merge(df, agg_trans, on='client_id', how='left')
    
    weekday_avg = invoice.groupby('client_id')[['is_weekday']].agg(['mean'])
    weekday_avg.columns = ['_'.join(col).strip() for col in weekday_avg.columns.values]
    weekday_avg.reset_index(inplace=True)
    client_df = pd.merge(client_df, weekday_avg, on='client_id', how='left')
    
    full_df = pd.merge(client_df, agg_trans, on='client_id', how='left')
    
    full_df['invoice_per_cooperation'] = full_df['transactions_count'] / full_df['coop_time']
    
    return full_df

* created some aggregation features (min/max/mean/std) over continious columns per every client
* added 'delta_time' - amount of time between invoices for each user
* created 'invoice_per_cooperation' - number of transactions per some amount of time

In [9]:
agg_stat_columns = [
 'tarif_type',
 'counter_number',
 'counter_statue',
 'counter_code',
 'reading_remarque',
 'consommation_level_1',
 'consommation_level_2',
 'consommation_level_3',
 'consommation_level_4',
 'old_index',
 'new_index',
 'months_number',
 'counter_type',
 'invoice_month',
 'invoice_year',
 'delta_index'
]

train_df1 = agg_feature(invoice_train1, client_train1, agg_stat_columns)
test_df1 = agg_feature(invoice_test1, client_test1, agg_stat_columns)

In [10]:
def new_features(df):
    
    for col in agg_stat_columns:
        df[col+'_range'] = df[col+'_max'] - df[col+'_min']
        df[col+'_max_mean'] = df[col+'_max']/df[col+'_mean']
    
    return df

Also we created statistical 'max_mean' and 'range' features which noticeably improved score

In [11]:
train_df2 = new_features(train_df1)
test_df2 = new_features(test_df1)

Now let's review how many features did we create:

In [12]:
print('Initial number of columns: ', len(client_train.columns)+len(invoice_train.columns))
print('Number of columns now: ', len(train_df2.columns))

Initial number of columns:  29
Number of columns now:  111


In [13]:
def drop(df):

    col_drop = ['client_id', 'creation_date']
    for col in col_drop:
        df.drop([col], axis=1, inplace=True)
    return df

* we created really a lot of features and sure, not all of them were usefull, so we dropped some unnessesary columns in next few cells
* 'drop_col' array was made after using our own backward feature selection function

In [14]:
train_df = drop(train_df2)
test_df = drop(test_df2)

In [15]:
y = train_df['target']
X = train_df.drop('target',axis=1)

feature_name = X.columns.tolist()

In [16]:
drop_col=['reading_remarque_max','counter_statue_min','counter_type_min','counter_type_max','counter_type_range',
          'tarif_type_max', 'delta_index_min', 'consommation_level_4_mean']

X = X.drop(drop_col, axis=1)
test_df = test_df.drop(drop_col, axis=1)

In [17]:
# Define the objective function for Bayesian Optimization
def lgbm_cv(n_estimators, num_leaves, max_depth, learning_rate, min_split_gain, feature_fraction, bagging_freq):
    params = {
        'n_estimators': int(n_estimators),
        'num_leaves': int(num_leaves),
        'max_depth': int(max_depth),
        'learning_rate': learning_rate,
        'min_split_gain': min_split_gain,
        'feature_fraction': feature_fraction,
        'bagging_freq': int(bagging_freq),
        'verbose': -1,
        'random_state': seed
    }

    # Perform cross-validation
    stkfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
    scores = []
    for train_idx, valid_idx in stkfold.split(X, y):
        X_train, X_valid = X.iloc[train_idx], X.iloc[valid_idx]
        y_train, y_valid = y[train_idx], y[valid_idx]
        
        model = LGBMClassifier(**params)
        model.fit(X_train, y_train)
        preds = model.predict_proba(X_valid)[:, 1]
        score = roc_auc_score(y_valid, preds)
        scores.append(score)

    return np.mean(scores)

In [18]:
# Define the parameter space for Bayesian Optimization
pbounds = {
    'n_estimators': (200, 1000),
    'num_leaves': (2, 512),
    'max_depth': (2, 128),
    'learning_rate': (0.001, 0.15),
    'min_split_gain': (0.001, 0.1),
    'feature_fraction': (0.1, 1.0),
    'bagging_freq': (1, 10)
}

In [19]:
# Perform Bayesian Optimization
optimizer = BayesianOptimization(
    f=lgbm_cv,
    pbounds=pbounds,
    random_state=seed,
    verbose=2
)

In [31]:
optimizer.maximize(init_points=5, n_iter=15)

|   iter    |  target   | baggin... | featur... | learni... | max_depth | min_sp... | n_esti... | num_le... |
-------------------------------------------------------------------------------------------------------------
| [0m7        [0m | [0m0.8799   [0m | [0m9.48     [0m | [0m0.7705   [0m | [0m0.04081  [0m | [0m47.59    [0m | [0m0.05311  [0m | [0m637.5    [0m | [0m133.9    [0m |
| [0m8        [0m | [0m0.884    [0m | [0m2.572    [0m | [0m0.4246   [0m | [0m0.02189  [0m | [0m51.02    [0m | [0m0.04772  [0m | [0m975.1    [0m | [0m76.24    [0m |
| [0m9        [0m | [0m0.8819   [0m | [0m5.628    [0m | [0m0.5749   [0m | [0m0.04648  [0m | [0m22.12    [0m | [0m0.06009  [0m | [0m283.0    [0m | [0m291.7    [0m |
| [0m10       [0m | [0m0.8749   [0m | [0m4.477    [0m | [0m0.1762   [0m | [0m0.08474  [0m | [0m83.66    [0m | [0m0.06653  [0m | [0m351.4    [0m | [0m488.6    [0m |
| [0m11       [0m | [0m0.8737   [0m | [0m1.531 

In [32]:
# Retrieve the best hyperparameters
best_params = optimizer.max['params']
print("Best hyperparameters:", best_params)

Best hyperparameters: {'bagging_freq': 6.468928523306753, 'feature_fraction': 0.3712885964713344, 'learning_rate': 0.018241924681123255, 'max_depth': 105.182533635031, 'min_split_gain': 0.04956066394995558, 'n_estimators': 816.9058482514012, 'num_leaves': 89.0885086859329}


In [None]:
{
    'bagging_freq': 6.767701100293832, 'feature_fraction': 0.3899871532625311, 
    'learning_rate': 0.028638626573307676, 'max_depth': 117.56638717872684, 
    'min_split_gain': 0.02782115953410223, 'n_estimators': 418.838313277975, 
    'num_leaves': 488.75046802325676
}

In [33]:
# Convert certain hyperparameters to integer type
best_params['n_estimators'] = int(best_params['n_estimators'])
best_params['num_leaves'] = int(best_params['num_leaves'])
best_params['max_depth'] = int(best_params['max_depth'])
best_params['bagging_freq'] = int(best_params['bagging_freq'])

print("Best hyperparameters:", best_params)

Best hyperparameters: {'bagging_freq': 6, 'feature_fraction': 0.3712885964713344, 'learning_rate': 0.018241924681123255, 'max_depth': 105, 'min_split_gain': 0.04956066394995558, 'n_estimators': 816, 'num_leaves': 89}


In [34]:
# Define the model using the best hyperparameters
model = LGBMClassifier(**best_params)

In [35]:
stkfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)

In [36]:
def calc(X, y, model, cv):
    res=[]
    local_probs=pd.DataFrame()
    probs = pd.DataFrame()

    for i, (tdx, vdx) in enumerate(cv.split(X, y)):
        X_train, X_valid, y_train, y_valid = X.iloc[tdx], X.iloc[vdx], y[tdx], y[vdx]
        model.fit(X_train, y_train,
                 eval_set=[(X_train, y_train), (X_valid, y_valid)])
        
        preds = model.predict_proba(X_valid)
        oof_predict = model.predict_proba(test_df)
        local_probs['fold_%i'%i] = oof_predict[:,1]
        res.append(roc_auc_score(y_valid, preds[:,1]))

    print('ROC AUC:', round(np.mean(res), 6))    
    local_probs['res'] = local_probs.mean(axis=1)
    probs['target'] = local_probs['res']
    
    return probs

In [37]:
%%time
probs = calc(X, y, model, stkfold)

[LightGBM] [Info] Number of positive: 6053, number of negative: 102341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.120325 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 20092
[LightGBM] [Info] Number of data points in the train set: 108394, number of used features: 100
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.055843 -> initscore=-2.827756
[LightGBM] [Info] Start training from score -2.827756
[LightGBM] [Info] Number of positive: 6053, number of negative: 102341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.125552 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 20076
[LightGBM] [Info] Number of data points in the train set: 108394, number of used features: 100
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.055843 -> initscore=-2.827756
[LightGBM] [Info] Start training from score -2.827756
[Light

<h2>Modelling</h2>

* we used [optuna](https://optuna.org/) for hyperparameters tuning
* it was performed with respect to StratifiedKFold cross validation on 5 folds
* you can check parameters for tuning and their final values in cells below

In [None]:
from optuna import Trial
import gc
import optuna
from sklearn.model_selection import train_test_split
import lightgbm as lgb

category_cols = ['disrict', 'client_catg', 'region']

def objective(trial:Trial):
    
    gc.collect()
    models=[]
    validScore=0
   
    model,log = fitLGBM(trial,X,y)
    
    models.append(model)
    gc.collect()
    validScore+=log
    validScore/=len(models)
    
    return validScore

In [None]:
def fitLGBM(trial,X, y):
    
    params={
      'n_estimators':trial.suggest_int('n_estimators', 0, 1000), 
      'num_leaves':trial.suggest_int('num_leaves', 2, 512),
      'max_depth':trial.suggest_int('max_depth', 2, 128),
      'learning_rate': trial.suggest_loguniform('learning_rate', 0.001, 0.15),
      'min_split_gain': trial.suggest_loguniform('min_split_gain', 0.001, 0.1),
      'feature_fraction':trial.suggest_uniform('feature_fraction',0.1, 1.0),
      'bagging_freq':trial.suggest_int('bagging_freq',0.1,10),
      'verbosity': -1,
      'random_state':seed
            }
    stkfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
    model = LGBMClassifier(**params)
    
    res=[]
    for i, (tdx, vdx) in enumerate(stkfold.split(X, y)):
        X_train, X_valid, y_train, y_valid = X.iloc[tdx], X.iloc[vdx], y[tdx], y[vdx]
        model.fit(X_train, y_train,
                 eval_set=[(X_train, y_train), (X_valid, y_valid)])
        preds = model.predict_proba(X_valid)
        res.append(roc_auc_score(y_valid, preds[:,1]))
    
    err = np.mean(res)
    
    return model, err

In [None]:
# hyperparameter tunning with Optuna
study = optuna.create_study(direction='maximize', pruner=optuna.pruners.MedianPruner(n_warmup_steps=5))
study.optimize(objective, timeout=60*60*2)

In [None]:
# Retrieve the best hyperparameters
best_params = study.best_params

In [None]:
# Define the model using the best hyperparameters
model = LGBMClassifier(**best_params)

In [None]:
'''model = LGBMClassifier(random_state=seed, n_estimators=830,num_leaves=454, max_depth=61,
                       learning_rate=0.006910869038433314, min_split_gain=0.00667926424629105, 
                       feature_fraction=0.3764303138879782, bagging_freq=8, early_stopping_rounds=30,
                 verbose=-1)'''

stkfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)

def calc(X, y, model, cv):
    res=[]
    local_probs=pd.DataFrame()
    probs = pd.DataFrame()

    for i, (tdx, vdx) in enumerate(cv.split(X, y)):
        X_train, X_valid, y_train, y_valid = X.iloc[tdx], X.iloc[vdx], y[tdx], y[vdx]
        model.fit(X_train, y_train,
                 eval_set=[(X_train, y_train), (X_valid, y_valid)])
        
        preds = model.predict_proba(X_valid)
        oof_predict = model.predict_proba(test_df)
        local_probs['fold_%i'%i] = oof_predict[:,1]
        res.append(roc_auc_score(y_valid, preds[:,1]))

    print('ROC AUC:', round(np.mean(res), 6))    
    local_probs['res'] = local_probs.mean(axis=1)
    probs['target'] = local_probs['res']
    
    return probs

<h2>Prediction and submission</h2>

In the next few cells you can see our local cross validation which almost match  LB score

In [None]:
%%time
probs = calc(X, y, model, stkfold)

In [None]:
submission = pd.DataFrame({
        "client_id": sample_submission["client_id"],
        "target": probs['target']
    })
submission.to_csv('submission-1.csv', index=False)

To sum up, at the time of publication of the notebook, we got 4th place in this competition!Thank you for watching, waiting your comments!

<img src="https://raw.githubusercontent.com/imgremlin/Photos/master/lb.png" width="700"> 