## Contexto

O comportamento agressivo de condução é o principal fator de acidentes de trânsito. Conforme relatado pela Fundação AAA para Segurança no Trânsito , 106.727 acidentes fatais – 55,7% do total – durante um período recente de quatro anos envolveram motoristas que cometeram uma ou mais ações agressivas de direção. Portanto, como prever o comportamento de condução perigosa com rapidez e precisão?

## Abordagem da solução

A condução agressiva inclui excesso de velocidade, frenagens repentinas e curvas repentinas à esquerda ou à direita. Todos esses eventos são refletidos nos dados do acelerômetro e do giroscópio. Por isso, sabendo que hoje em dia quase todo mundo possui um smartphone que possui uma grande variedade de sensores, foi utilizado dados de um aplicativo de coleta de dados em Android baseado nos sensores acelerômetro e giroscópio, para realizar a classificação do comportamento de motoristas, a partir do uso dos classificadores:
    
   * CatBoost
   * LightGBM
   * XGBoost
   * Ensemble com os classificadores 

Para este projeto, foi utilizada uma técnica de redução de dimensionalidade, que visa escolher um subconjunto de recursos relevantes dos recursos originais, removendo recursos irrelevantes, redundantes ou ruidosos. A seleção de recursos geralmente pode levar a um melhor desempenho de aprendizado, maior precisão de aprendizado, menor custo computacional e melhor interpretabilidade do modelo. Este Trabalho utilizou o Binary Fish School Search Algorithm no processo de seleção de recursos, aumentando a precisão do classificador e reduzindo o número de atributos para a tarefa de classificação.

O ponto crítico para encontrar os melhores modelos que podem resolver um problema não são apenas os tipos de modelos. É preciso encontrar os parâmetros ideais para que os modelos funcionem de maneira ideal, dado o conjunto de dados. Isso é chamado de localizar ou pesquisar hiperparâmetros. Neste trabalho utilizamos os seguintes algoritmos para este objetivo:
    

   * Particle Swarm Optimization (PSO)
   * Genetic Algorithm

A acurácia, AUC, Precisão, Recall são considerados através da realização de uma avaliação de aptidão. Os algoritmos de otimização foram avaliados usando os classificadores citados acima. O conjunto de dados Driving Behavior foi usado para treinar e avaliar os algoritmos. Os resultados mostram que o método é útil para reduzir o tempo de treinamento e aumentar a assertividade.


Driving Behavior: https://www.kaggle.com/datasets/outofskills/driving-behavior


In [2]:
import pandas as pd
import numpy as np
import pickle
from category_encoders.cat_boost import CatBoostEncoder
from category_encoders.target_encoder import TargetEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from catboost import CatBoostClassifier, Pool
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
import joblib
from BFSS import Fish, BFSS
from integer_pso import IntOptimizerPSO

2022-08-11 22:41:06,588 - numexpr.utils - INFO - NumExpr defaulting to 2 threads.
  import pandas.util.testing as tm


In [3]:
df_train = pd.read_csv('/data/train_motion_data.csv')
df_test = pd.read_csv('/data/test_motion_data.csv')

In [4]:
df_train['Class'].value_counts()

SLOW          1331
NORMAL        1200
AGGRESSIVE    1113
Name: Class, dtype: int64

In [5]:
X_test = df_test.drop(columns = ['Class', 'Timestamp'])
y_test = df_test['Class']

X_train = df_train.drop(columns = ['Class', 'Timestamp'])
y_train = df_train['Class']

In [6]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = 0.2, random_state = 42)

In [7]:
print(y_train.value_counts())
print(y_val.value_counts())
print(y_test.value_counts())

SLOW          1082
NORMAL         931
AGGRESSIVE     902
Name: Class, dtype: int64
NORMAL        269
SLOW          249
AGGRESSIVE    211
Name: Class, dtype: int64
SLOW          1273
NORMAL         997
AGGRESSIVE     814
Name: Class, dtype: int64


In [8]:
def normalize(X, features, tipo_dado, name_norm="norm.pkl"):
    if tipo_dado == 'Treinamento':
        le = MinMaxScaler().fit(X[features])
        X[features] = le.transform(X[features])
        joblib.dump(le, name_norm)
    elif tipo_dado == 'Teste':
        le = joblib.load(name_norm)
        X[features] = le.transform(X[features])
    return X

In [9]:
X_train = normalize(X_train, X_train.columns, 'Treinamento')
X_val = normalize(X_val, X_train.columns, 'Teste')
X_test = normalize(X_test, X_train.columns, 'Teste')

In [12]:
import random
class FitnessFuncion(object):
    dim = 6
    minf = 2
    maxf = 6

    def evaluate(position):
        if (position == [0]*6).all():
          position = np.array([1,1,0,0,0,0])
          np.random.shuffle(position)
        features = X_train.columns
        idx_selected = np.where(position == 1)[0]
        selected_features = []
        for idx in idx_selected:
          selected_features.append(features[idx])

        X_train_alpha = X_train[selected_features]
        X_val_alpha = X_val[selected_features]

        catboost = CatBoostClassifier(verbose=False, max_depth=4)
        train_data = Pool(data=X_train_alpha,label=y_train)
        eval_data = Pool(data=X_val_alpha,label=y_val)
        catboost.fit(X=train_data, eval_set=eval_data,plot=False)

        y_train_catboost = catboost.predict_proba(X_train_alpha)
        y_val_catboost = catboost.predict_proba(X_val_alpha)
        acc_train = metrics.roc_auc_score(y_train,y_train_catboost, multi_class = 'ovo', average = 'macro')
        acc_val = metrics.roc_auc_score(y_val,y_val_catboost, multi_class = 'ovo', average = 'macro')
        cost = acc_val

        return cost, acc_val, acc_train, selected_features


## BFSS for Feature Selection

In [12]:
bfss = BFSS(FitnessFuncion, 20, 20, 1, 1.2, 0.8, 0.7, 0.2, 0.6)

In [13]:
bfss.optimize()

In [15]:
bfss.best_agent.features

['AccX', 'AccY', 'AccZ']

In [16]:
bfss.best_agent.cost

0.6041160831890905

In [11]:
X_train = X_train[['AccX', 'AccY', 'AccZ']]
X_val = X_val[['AccX', 'AccY', 'AccZ']]
X_test = X_test[['AccX', 'AccY', 'AccZ']]

### Creating DataFrame to Compare

In [71]:
df_comparison = pd.DataFrame()

##PSO for Hyperparameters Tunning

###Catboost

#### Before Optimization

In [72]:
catboost = CatBoostClassifier(random_state = 42, verbose = 0)

train_data = Pool(data=X_train,label=y_train)
eval_data = Pool(data=X_val,label=y_val)
catboost.fit(X=train_data, eval_set=eval_data,plot=False)

y_test_catboost = catboost.predict_proba(X_test)
pred_test_catboost = catboost.predict(X_test)

In [73]:
accuracy = metrics.accuracy_score(y_test, pred_test_catboost)
auc = metrics.roc_auc_score(y_test, y_test_catboost, multi_class = 'ovo', average = 'macro')
precision = metrics.precision_score(y_test, pred_test_catboost, average = 'macro')
recall = metrics.recall_score(y_test, pred_test_catboost, average = 'macro')

df_comparison = df_comparison.append(pd.DataFrame({
    'Model': ['CatBoost Plain'],
    'Accuracy': [accuracy],
    'AUC': [auc],
    'Precision': [precision],
    'Recall': [recall]
}))

print(f'Accuracy: {accuracy} \t AUC: {auc} \t Precision: {precision} \t Recall: {recall}')

Accuracy: 0.4477950713359274 	 AUC: 0.5842604123237461 	 Precision: 0.4108903357780771 	 Recall: 0.4200217527473451


In [74]:
df_comparison

Unnamed: 0,Model,Accuracy,AUC,Precision,Recall
0,CatBoost Plain,0.447795,0.58426,0.41089,0.420022


#### Optimization

In [16]:
import tqdm.notebook as tq
from itertools import product
import statistics
from collections import Counter
import pyswarms as ps

@ps.cost
def parab(X):

    md = X[0]
    l2 = X[1]

    print(f"\n MD: {md}, l2: {l2}")

    catboost = CatBoostClassifier(learning_rate=0.03,
                                  random_seed=42,
                                  max_depth=abs(int(md)),
                                  num_trees=50,
                                  l2_leaf_reg=abs(int(l2)),
                                  verbose=0)
    train_data = Pool(data=X_train,label=y_train)
    eval_data = Pool(data=X_val,label=y_val)
    catboost.fit(X=train_data, eval_set=eval_data,plot=False)
    y_val_catboost = catboost.predict_proba(X_val)
    acc_val = metrics.roc_auc_score(y_val,y_val_catboost, multi_class = 'ovo', average = 'macro')
    
    return 1-acc_val
constraints = (np.array([0, 0]),
               np.array([16, 16]))
opt = IntOptimizerPSO(50,
                      2,
                      {"c1": 0.5, "c2": 0.3, "w": 1.9},
                      bounds = constraints)

c, p = opt.optimize(parab, 10)



integer_pso:   0%|          |0/10


 MD: 11.893058444781385, l2: 6.27121744100832

 MD: 12.814231410204604, l2: 5.341442232200604

 MD: 1.4655334972095293, l2: 10.713131085077785

 MD: 7.7803333579104645, l2: 3.4456308760773506

 MD: 3.8154589914039345, l2: 3.0755822956673757

 MD: 1.4808015087984145, l2: 5.5500213418105115

 MD: 2.2320566443873133, l2: 11.668265164704854

 MD: 11.232557686903716, l2: 14.08449630963425

 MD: 13.514970708999133, l2: 14.029650366959533

 MD: 15.412204058806866, l2: 0.07591458742906099

 MD: 7.282882236836333, l2: 14.16232638590292

 MD: 4.972149926579066, l2: 12.864284056301457

 MD: 10.82984552512682, l2: 2.756603739026634

 MD: 10.161318558312484, l2: 4.08590179791962

 MD: 15.962545845725517, l2: 0.9900051990077632

 MD: 5.459321375640732, l2: 5.359565513381897

 MD: 3.0575748004637493, l2: 15.31204015761711

 MD: 4.075253399094111, l2: 0.007861166272535414

 MD: 13.259447745865883, l2: 3.404322262928293

 MD: 14.010975582302292, l2: 10.30706493289643

 MD: 15.40980721071841, l2: 6.555

integer_pso:  10%|█         |1/10, best_cost=0.391


 MD: 2.8254754744517, l2: 7.272540573211016

 MD: 11, l2: 6

 MD: 11, l2: 5

 MD: 3, l2: 10

 MD: 8, l2: 4

 MD: 3, l2: 3

 MD: 3, l2: 5

 MD: 3, l2: 11

 MD: 11, l2: 14

 MD: 13, l2: 14

 MD: 0, l2: 1

 MD: 8, l2: 14

 MD: 5, l2: 12

 MD: 10, l2: 3

 MD: 11, l2: 5

 MD: 0, l2: 0

 MD: 6, l2: 6

 MD: 5, l2: 13

 MD: 5, l2: 2

 MD: 13, l2: 4

 MD: 15, l2: 11

 MD: 0, l2: 7

 MD: 13, l2: 4

 MD: 2, l2: 3

 MD: 3, l2: 2

 MD: 12, l2: 12

 MD: 15, l2: 5

 MD: 8, l2: 7

 MD: 11, l2: 8

 MD: 9, l2: 14

 MD: 6, l2: 1

 MD: 12, l2: 2

 MD: 9, l2: 5

 MD: 14, l2: 1

 MD: 12, l2: 10

 MD: 8, l2: 6

 MD: 3, l2: 5

 MD: 7, l2: 2

 MD: 8, l2: 11

 MD: 12, l2: 0

 MD: 3, l2: 4

 MD: 6, l2: 1

 MD: 3, l2: 11

 MD: 6, l2: 12

 MD: 7, l2: 1

 MD: 11, l2: 4

 MD: 9, l2: 1

 MD: 12, l2: 3

 MD: 7, l2: 4

 MD: 13, l2: 8


integer_pso:  20%|██        |2/10, best_cost=0.391


 MD: 2, l2: 7

 MD: 11, l2: 6

 MD: 9, l2: 5

 MD: 6, l2: 10

 MD: 9, l2: 5

 MD: 3, l2: 3

 MD: 7, l2: 5

 MD: 5, l2: 10

 MD: 11, l2: 12

 MD: 13, l2: 13

 MD: 6, l2: 2

 MD: 9, l2: 12

 MD: 7, l2: 12

 MD: 10, l2: 4

 MD: 12, l2: 6

 MD: 8, l2: 0

 MD: 8, l2: 7

 MD: 9, l2: 9

 MD: 7, l2: 5

 MD: 13, l2: 5

 MD: 15, l2: 11

 MD: 4, l2: 8

 MD: 14, l2: 4

 MD: 6, l2: 6

 MD: 3, l2: 2

 MD: 12, l2: 11

 MD: 14, l2: 5

 MD: 9, l2: 7

 MD: 11, l2: 8

 MD: 9, l2: 14

 MD: 6, l2: 3

 MD: 12, l2: 2

 MD: 9, l2: 5

 MD: 14, l2: 3

 MD: 12, l2: 10

 MD: 9, l2: 6

 MD: 3, l2: 6

 MD: 7, l2: 2

 MD: 8, l2: 11

 MD: 12, l2: 0

 MD: 3, l2: 4

 MD: 6, l2: 3

 MD: 7, l2: 10

 MD: 8, l2: 10

 MD: 8, l2: 1

 MD: 11, l2: 4

 MD: 10, l2: 3

 MD: 12, l2: 3

 MD: 7, l2: 4

 MD: 13, l2: 9


integer_pso:  30%|███       |3/10, best_cost=0.391


 MD: 3, l2: 7

 MD: 11, l2: 6

 MD: 6, l2: 5

 MD: 11, l2: 10

 MD: 10, l2: 6

 MD: 3, l2: 3

 MD: 14, l2: 5

 MD: 9, l2: 8

 MD: 11, l2: 7

 MD: 13, l2: 11

 MD: 1, l2: 3

 MD: 10, l2: 9

 MD: 10, l2: 11

 MD: 10, l2: 5

 MD: 12, l2: 6

 MD: 7, l2: 0

 MD: 11, l2: 8

 MD: 16, l2: 1

 MD: 11, l2: 10

 MD: 13, l2: 6

 MD: 13, l2: 9

 MD: 15, l2: 8

 MD: 14, l2: 4

 MD: 13, l2: 11

 MD: 3, l2: 2

 MD: 12, l2: 9

 MD: 12, l2: 5

 MD: 10, l2: 7

 MD: 11, l2: 7

 MD: 9, l2: 13

 MD: 6, l2: 6

 MD: 12, l2: 2

 MD: 9, l2: 5

 MD: 14, l2: 6

 MD: 12, l2: 10

 MD: 10, l2: 6

 MD: 3, l2: 7

 MD: 7, l2: 2

 MD: 8, l2: 10

 MD: 12, l2: 0

 MD: 4, l2: 4

 MD: 6, l2: 6

 MD: 14, l2: 7

 MD: 11, l2: 6

 MD: 9, l2: 1

 MD: 11, l2: 4

 MD: 11, l2: 6

 MD: 12, l2: 3

 MD: 7, l2: 4

 MD: 13, l2: 10

 MD: 6, l2: 7


integer_pso:  40%|████      |4/10, best_cost=0.391


 MD: 11, l2: 6

 MD: 2, l2: 5

 MD: 4, l2: 9

 MD: 10, l2: 7

 MD: 4, l2: 3

 MD: 8, l2: 5

 MD: 16, l2: 3

 MD: 11, l2: 1

 MD: 13, l2: 8

 MD: 7, l2: 4

 MD: 11, l2: 5

 MD: 15, l2: 8

 MD: 10, l2: 6

 MD: 11, l2: 5

 MD: 3, l2: 0

 MD: 15, l2: 8

 MD: 9, l2: 3

 MD: 1, l2: 16

 MD: 13, l2: 7

 MD: 10, l2: 6

 MD: 3, l2: 7

 MD: 14, l2: 4

 MD: 9, l2: 3

 MD: 3, l2: 2

 MD: 12, l2: 5

 MD: 8, l2: 5

 MD: 11, l2: 7

 MD: 11, l2: 5

 MD: 9, l2: 10

 MD: 6, l2: 9

 MD: 12, l2: 2

 MD: 9, l2: 5

 MD: 14, l2: 10

 MD: 12, l2: 10

 MD: 11, l2: 6

 MD: 4, l2: 8

 MD: 7, l2: 2

 MD: 8, l2: 8

 MD: 12, l2: 0

 MD: 6, l2: 4

 MD: 6, l2: 9

 MD: 10, l2: 1

 MD: 15, l2: 15

 MD: 9, l2: 1

 MD: 11, l2: 4

 MD: 11, l2: 10

 MD: 12, l2: 3

 MD: 7, l2: 4

 MD: 12, l2: 10

 MD: 11, l2: 7


integer_pso:  50%|█████     |5/10, best_cost=0.391


 MD: 11, l2: 6

 MD: 14, l2: 5

 MD: 6, l2: 8

 MD: 10, l2: 7

 MD: 6, l2: 3

 MD: 10, l2: 5

 MD: 12, l2: 11

 MD: 11, l2: 7

 MD: 13, l2: 2

 MD: 1, l2: 5

 MD: 11, l2: 0

 MD: 5, l2: 3

 MD: 10, l2: 6

 MD: 9, l2: 3

 MD: 9, l2: 0

 MD: 2, l2: 8

 MD: 9, l2: 9

 MD: 15, l2: 9

 MD: 13, l2: 8

 MD: 5, l2: 0

 MD: 12, l2: 5

 MD: 12, l2: 4

 MD: 15, l2: 2

 MD: 4, l2: 2

 MD: 12, l2: 14

 MD: 1, l2: 5

 MD: 11, l2: 7

 MD: 11, l2: 2

 MD: 9, l2: 5

 MD: 6, l2: 9

 MD: 12, l2: 2

 MD: 9, l2: 5

 MD: 13, l2: 16

 MD: 12, l2: 10

 MD: 11, l2: 6

 MD: 6, l2: 9

 MD: 7, l2: 2

 MD: 8, l2: 5

 MD: 12, l2: 0

 MD: 9, l2: 4

 MD: 6, l2: 12

 MD: 0, l2: 6

 MD: 3, l2: 15

 MD: 9, l2: 1

 MD: 11, l2: 4

 MD: 10, l2: 14

 MD: 12, l2: 3

 MD: 7, l2: 4

 MD: 10, l2: 10


integer_pso:  60%|██████    |6/10, best_cost=0.391


 MD: 3, l2: 7

 MD: 11, l2: 6

 MD: 4, l2: 5

 MD: 8, l2: 5

 MD: 10, l2: 6

 MD: 10, l2: 3

 MD: 11, l2: 5

 MD: 0, l2: 11

 MD: 11, l2: 2

 MD: 13, l2: 11

 MD: 2, l2: 6

 MD: 10, l2: 12

 MD: 2, l2: 13

 MD: 10, l2: 6

 MD: 5, l2: 0

 MD: 15, l2: 0

 MD: 9, l2: 8

 MD: 7, l2: 5

 MD: 5, l2: 9

 MD: 13, l2: 9

 MD: 15, l2: 8

 MD: 9, l2: 1

 MD: 8, l2: 4

 MD: 7, l2: 15

 MD: 6, l2: 2

 MD: 12, l2: 15

 MD: 7, l2: 5

 MD: 9, l2: 7

 MD: 11, l2: 13

 MD: 9, l2: 12

 MD: 6, l2: 5

 MD: 12, l2: 2

 MD: 9, l2: 5

 MD: 10, l2: 8

 MD: 12, l2: 9

 MD: 11, l2: 6

 MD: 10, l2: 9

 MD: 7, l2: 2

 MD: 8, l2: 15

 MD: 11, l2: 0

 MD: 14, l2: 4

 MD: 6, l2: 13

 MD: 0, l2: 15

 MD: 12, l2: 15

 MD: 8, l2: 1

 MD: 11, l2: 4

 MD: 8, l2: 3

 MD: 12, l2: 3

 MD: 7, l2: 4

 MD: 7, l2: 9


integer_pso:  70%|███████   |7/10, best_cost=0.391


 MD: 3, l2: 7

 MD: 11, l2: 6

 MD: 4, l2: 5

 MD: 8, l2: 0

 MD: 10, l2: 3

 MD: 1, l2: 3

 MD: 7, l2: 5

 MD: 8, l2: 11

 MD: 11, l2: 11

 MD: 13, l2: 12

 MD: 12, l2: 6

 MD: 8, l2: 4

 MD: 13, l2: 0

 MD: 10, l2: 5

 MD: 15, l2: 12

 MD: 15, l2: 0

 MD: 5, l2: 7

 MD: 0, l2: 2

 MD: 0, l2: 5

 MD: 13, l2: 10

 MD: 2, l2: 8

 MD: 11, l2: 10

 MD: 1, l2: 4

 MD: 3, l2: 0

 MD: 10, l2: 2

 MD: 12, l2: 13

 MD: 5, l2: 5

 MD: 5, l2: 7

 MD: 11, l2: 1

 MD: 9, l2: 6

 MD: 6, l2: 12

 MD: 11, l2: 2

 MD: 9, l2: 5

 MD: 5, l2: 5

 MD: 12, l2: 7

 MD: 11, l2: 6

 MD: 1, l2: 9

 MD: 7, l2: 2

 MD: 8, l2: 0

 MD: 10, l2: 0

 MD: 6, l2: 4

 MD: 6, l2: 11

 MD: 12, l2: 16

 MD: 10, l2: 1

 MD: 6, l2: 1

 MD: 11, l2: 4

 MD: 5, l2: 12

 MD: 12, l2: 3

 MD: 7, l2: 4

 MD: 3, l2: 7

 MD: 2, l2: 6


integer_pso:  80%|████████  |8/10, best_cost=0.391


 MD: 11, l2: 6

 MD: 9, l2: 5

 MD: 1, l2: 8

 MD: 10, l2: 14

 MD: 4, l2: 3

 MD: 9, l2: 5

 MD: 3, l2: 14

 MD: 11, l2: 12

 MD: 13, l2: 16

 MD: 15, l2: 6

 MD: 5, l2: 5

 MD: 14, l2: 10

 MD: 10, l2: 3

 MD: 2, l2: 4

 MD: 9, l2: 0

 MD: 13, l2: 5

 MD: 14, l2: 7

 MD: 4, l2: 10

 MD: 12, l2: 10

 MD: 14, l2: 9

 MD: 15, l2: 9

 MD: 6, l2: 4

 MD: 6, l2: 14

 MD: 1, l2: 2

 MD: 11, l2: 9

 MD: 5, l2: 5

 MD: 14, l2: 7

 MD: 11, l2: 11

 MD: 9, l2: 12

 MD: 6, l2: 9

 MD: 10, l2: 2

 MD: 9, l2: 5

 MD: 12, l2: 14

 MD: 12, l2: 5

 MD: 10, l2: 6

 MD: 1, l2: 9

 MD: 7, l2: 2

 MD: 8, l2: 6

 MD: 8, l2: 0

 MD: 6, l2: 4

 MD: 6, l2: 6

 MD: 9, l2: 16

 MD: 4, l2: 12

 MD: 3, l2: 1

 MD: 11, l2: 4

 MD: 2, l2: 10

 MD: 12, l2: 3

 MD: 7, l2: 4

 MD: 13, l2: 4

 MD: 15, l2: 4


integer_pso:  90%|█████████ |9/10, best_cost=0.391


 MD: 11, l2: 6

 MD: 6, l2: 5

 MD: 10, l2: 7

 MD: 10, l2: 1

 MD: 9, l2: 3

 MD: 1, l2: 5

 MD: 3, l2: 9

 MD: 11, l2: 2

 MD: 13, l2: 9

 MD: 4, l2: 5

 MD: 1, l2: 8

 MD: 9, l2: 15

 MD: 10, l2: 0

 MD: 10, l2: 5

 MD: 8, l2: 0

 MD: 9, l2: 2

 MD: 13, l2: 2

 MD: 5, l2: 12

 MD: 11, l2: 9

 MD: 6, l2: 12

 MD: 8, l2: 5

 MD: 2, l2: 4

 MD: 15, l2: 9

 MD: 15, l2: 2

 MD: 9, l2: 7

 MD: 10, l2: 5

 MD: 15, l2: 7

 MD: 11, l2: 14

 MD: 9, l2: 10

 MD: 6, l2: 1

 MD: 8, l2: 2

 MD: 9, l2: 5

 MD: 10, l2: 10

 MD: 12, l2: 1

 MD: 7, l2: 6

 MD: 1, l2: 8

 MD: 7, l2: 2

 MD: 8, l2: 4

 MD: 5, l2: 0

 MD: 4, l2: 4

 MD: 6, l2: 13

 MD: 1, l2: 5

 MD: 7, l2: 8

 MD: 0, l2: 1

 MD: 11, l2: 4

 MD: 0, l2: 2

 MD: 12, l2: 3

 MD: 7, l2: 4

 MD: 1, l2: 15

 MD: 14, l2: 2


integer_pso: 100%|██████████|10/10, best_cost=0.391
2022-08-11 23:52:05,533 - integer_pso - INFO - Optimization finished | best cost: 0.3911385542286602, best pos: [7.94070879 2.90106392]


#### After Optimization

In [94]:
catboost = CatBoostClassifier(learning_rate=0.03,
                                  random_seed=42,
                                  max_depth=8,
                                  num_trees=50,
                                  l2_leaf_reg=3,
                                  verbose=0)

train_data = Pool(data=X_train,label=y_train)
eval_data = Pool(data=X_val,label=y_val)
catboost.fit(X=train_data, eval_set=eval_data,plot=False)

y_test_catboost = catboost.predict_proba(X_test)
pred_test_catboost = catboost.predict(X_test)


In [76]:
accuracy = metrics.accuracy_score(y_test, pred_test_catboost)
auc = metrics.roc_auc_score(y_test, y_test_catboost, multi_class = 'ovo', average = 'macro')
precision = metrics.precision_score(y_test, pred_test_catboost, average = 'macro')
recall = metrics.recall_score(y_test, pred_test_catboost, average = 'macro')

df_comparison = df_comparison.append(pd.DataFrame({
    'Model': ['CatBoost PSO Optimized'],
    'Accuracy': [accuracy],
    'AUC': [auc],
    'Precision': [precision],
    'Recall': [recall]
}))

print(f'Accuracy: {accuracy} \t AUC: {auc} \t Precision: {precision} \t Recall: {recall}')

Accuracy: 0.45298313878080415 	 AUC: 0.5879301668497614 	 Precision: 0.4153991696443171 	 Recall: 0.4187158740772449


In [77]:
df_comparison


Unnamed: 0,Model,Accuracy,AUC,Precision,Recall
0,CatBoost Plain,0.447795,0.58426,0.41089,0.420022
0,CatBoost PSO Optimized,0.452983,0.58793,0.415399,0.418716


### LightGBM

#### Before Optimization

In [78]:
lgbm = LGBMClassifier(random_state=42)

lgbm.fit(X_train, y_train, eval_set = [(X_val, y_val)], verbose = 0)

y_test_lgbm = lgbm.predict_proba(X_test)
pred_test_lgbm = lgbm.predict(X_test)

In [79]:
accuracy = metrics.accuracy_score(y_test, pred_test_lgbm)
auc = metrics.roc_auc_score(y_test, y_test_lgbm, multi_class = 'ovo', average = 'macro')
precision = metrics.precision_score(y_test, pred_test_lgbm, average = 'macro')
recall = metrics.recall_score(y_test, pred_test_lgbm, average = 'macro')

df_comparison = df_comparison.append(pd.DataFrame({
    'Model': ['LGBM Plain'],
    'Accuracy': [accuracy],
    'AUC': [auc],
    'Precision': [precision],
    'Recall': [recall]
}))

print(f'Accuracy: {accuracy} \t AUC: {auc} \t Precision: {precision} \t Recall: {recall}')

Accuracy: 0.4079118028534371 	 AUC: 0.5721934193588355 	 Precision: 0.39563584351419556 	 Recall: 0.39859094717030663


In [80]:
df_comparison

Unnamed: 0,Model,Accuracy,AUC,Precision,Recall
0,CatBoost Plain,0.447795,0.58426,0.41089,0.420022
0,CatBoost PSO Optimized,0.452983,0.58793,0.415399,0.418716
0,LGBM Plain,0.407912,0.572193,0.395636,0.398591


#### Optimization

In [25]:
import tqdm.notebook as tq
from itertools import product
import statistics
from collections import Counter

@ps.cost
def parab(X):

    md = X[0]
    lr = X[1]

    lgbm = LGBMClassifier(n_estimators=50,
                          learning_rate=abs(lr/16),
                          max_depth=abs(int(md)),
                          random_state=42)
    lgbm.fit(X_train, y_train, eval_set = [(X_val, y_val)], verbose = 0)
    y_val_lgbm = lgbm.predict_proba(X_val)
    acc_val = metrics.roc_auc_score(y_val,y_val_lgbm, multi_class = 'ovo', average = 'macro')
    
    return 1-acc_val


constraints = (np.array([0, 1]),
               np.array([16, 16]))
opt = IntOptimizerPSO(50,
                      2,
                      {"c1": 0.5, "c2": 0.3, "w": 1.9},
                      bounds = constraints)

c, p = opt.optimize(parab, 10)


integer_pso: 100%|██████████|10/10, best_cost=0.408
2022-08-12 00:02:27,307 - integer_pso - INFO - Optimization finished | best cost: 0.4079355530616272, best pos: [4. 1.]


#### After Optimization

In [93]:
lgbm = LGBMClassifier(n_estimators=50,
                          learning_rate= 1/16,
                          max_depth= 4,
                          random_state=42)

lgbm.fit(X_train, y_train, eval_set = [(X_val, y_val)], verbose = 0)

y_test_lgbm = lgbm.predict_proba(X_test)
pred_test_lgbm = lgbm.predict(X_test)

In [82]:
accuracy = metrics.accuracy_score(y_test, pred_test_lgbm)
auc = metrics.roc_auc_score(y_test, y_test_lgbm, multi_class = 'ovo', average = 'macro')
precision = metrics.precision_score(y_test, pred_test_lgbm, average = 'macro')
recall = metrics.recall_score(y_test, pred_test_lgbm, average = 'macro')

df_comparison = df_comparison.append(pd.DataFrame({
    'Model': ['LGBM PSO Optimized'],
    'Accuracy': [accuracy],
    'AUC': [auc],
    'Precision': [precision],
    'Recall': [recall]
}))

print(f'Accuracy: {accuracy} \t AUC: {auc} \t Precision: {precision} \t Recall: {recall}')

Accuracy: 0.4588197146562905 	 AUC: 0.5957447121995362 	 Precision: 0.424718930795255 	 Recall: 0.4224707515648682


In [83]:
df_comparison

Unnamed: 0,Model,Accuracy,AUC,Precision,Recall
0,CatBoost Plain,0.447795,0.58426,0.41089,0.420022
0,CatBoost PSO Optimized,0.452983,0.58793,0.415399,0.418716
0,LGBM Plain,0.407912,0.572193,0.395636,0.398591
0,LGBM PSO Optimized,0.45882,0.595745,0.424719,0.422471


### XGBoost

#### Before Optimization

In [84]:
xgb = XGBClassifier()

xgb.fit(X_train, y_train, eval_set = [(X_val, y_val)], verbose = 0)

y_test_xgb = xgb.predict_proba(X_test)
pred_test_xgb = xgb.predict(X_test)

In [85]:
accuracy = metrics.accuracy_score(y_test, pred_test_xgb)
auc = metrics.roc_auc_score(y_test, y_test_xgb, multi_class = 'ovo', average = 'macro')
precision = metrics.precision_score(y_test, pred_test_xgb, average = 'macro')
recall = metrics.recall_score(y_test, pred_test_xgb, average = 'macro')

df_comparison = df_comparison.append(pd.DataFrame({
    'Model': ['XGBoost Plain'],
    'Accuracy': [accuracy],
    'AUC': [auc],
    'Precision': [precision],
    'Recall': [recall]
}))

print(f'Accuracy: {accuracy} \t AUC: {auc} \t Precision: {precision} \t Recall: {recall}')

Accuracy: 0.44293125810635536 	 AUC: 0.5835852706811223 	 Precision: 0.4103295918494496 	 Recall: 0.41502299140138027


In [86]:
df_comparison

Unnamed: 0,Model,Accuracy,AUC,Precision,Recall
0,CatBoost Plain,0.447795,0.58426,0.41089,0.420022
0,CatBoost PSO Optimized,0.452983,0.58793,0.415399,0.418716
0,LGBM Plain,0.407912,0.572193,0.395636,0.398591
0,LGBM PSO Optimized,0.45882,0.595745,0.424719,0.422471
0,XGBoost Plain,0.442931,0.583585,0.41033,0.415023


#### Optimization

In [37]:
import tqdm.notebook as tq
from itertools import product
import statistics
from collections import Counter


@ps.cost
def parab(X):

    lr = X[0]
    md = X[1]
    mcw = X[2]

    xgb = XGBClassifier(learning_rate=abs(lr/16),
                                n_estimators=50,
                                min_child_weight = abs(int(mcw)),
                                max_depth = abs(int(md)),
                                silent=True,
                                nthread=-1)
    xgb.fit(X_train, y_train, eval_set = [(X_val, y_val)], verbose = False)
    y_val_xgb = xgb.predict_proba(X_val)
    acc_val = metrics.roc_auc_score(y_val,y_val_xgb, multi_class = 'ovo', average = 'macro')
    
    return 1-acc_val


constraints = (np.array([1, 0, 0]),
               np.array([16, 16, 16]))
opt = IntOptimizerPSO(50,
                      3,
                      {"c1": 0.5, "c2": 0.3, "w": 1.9},
                      bounds = constraints)

c, p = opt.optimize(parab, 20)

integer_pso: 100%|██████████|20/20, best_cost=0.41
2022-08-12 00:23:09,718 - integer_pso - INFO - Optimization finished | best cost: 0.4103705412864599, best pos: [ 4.  2. 13.]


#### After Optimization

In [92]:
xgb = XGBClassifier(learning_rate= 4/16,
                                n_estimators=50,
                                min_child_weight = 13,
                                max_depth = 2,
                                silent=True,
                                nthread=-1)

xgb.fit(X_train, y_train, eval_set = [(X_val, y_val)], verbose = 0)

y_test_xgb = xgb.predict_proba(X_test)
pred_test_xgb = xgb.predict(X_test)

In [88]:
accuracy = metrics.accuracy_score(y_test, pred_test_xgb)
auc = metrics.roc_auc_score(y_test, y_test_xgb, multi_class = 'ovo', average = 'macro')
precision = metrics.precision_score(y_test, pred_test_xgb, average = 'macro')
recall = metrics.recall_score(y_test, pred_test_xgb, average = 'macro')

df_comparison = df_comparison.append(pd.DataFrame({
    'Model': ['XGBoost PSO Optimized'],
    'Accuracy': [accuracy],
    'AUC': [auc],
    'Precision': [precision],
    'Recall': [recall]
}))

print(f'Accuracy: {accuracy} \t AUC: {auc} \t Precision: {precision} \t Recall: {recall}')

Accuracy: 0.45460440985732814 	 AUC: 0.5878284963328784 	 Precision: 0.4259883772187241 	 Recall: 0.42679591272542156


In [89]:
df_comparison

Unnamed: 0,Model,Accuracy,AUC,Precision,Recall
0,CatBoost Plain,0.447795,0.58426,0.41089,0.420022
0,CatBoost PSO Optimized,0.452983,0.58793,0.415399,0.418716
0,LGBM Plain,0.407912,0.572193,0.395636,0.398591
0,LGBM PSO Optimized,0.45882,0.595745,0.424719,0.422471
0,XGBoost Plain,0.442931,0.583585,0.41033,0.415023
0,XGBoost PSO Optimized,0.454604,0.587828,0.425988,0.426796


### Ensemble Voting Classifier

In [90]:
vc = VotingClassifier([('catboost', catboost), ('lgbm', lgbm), ('xgb', xgb)], voting = 'soft', weights=[1,1,1])
vc.fit(X_train, y_train)
y_pred_vc_test = vc.predict_proba(X_test)
pred_test_vc = vc.predict(X_test)

In [91]:
accuracy = metrics.accuracy_score(y_test, pred_test_vc)
auc = metrics.roc_auc_score(y_test, y_pred_vc_test, multi_class = 'ovo', average = 'macro')
precision = metrics.precision_score(y_test, pred_test_vc, average = 'macro')
recall = metrics.recall_score(y_test, pred_test_vc, average = 'macro')

df_comparison = df_comparison.append(pd.DataFrame({
    'Model': ['Ensemble Plain'],
    'Accuracy': [accuracy],
    'AUC': [auc],
    'Precision': [precision],
    'Recall': [recall]
}))

print(f'Accuracy: {accuracy} \t AUC: {auc} \t Precision: {precision} \t Recall: {recall}')

Accuracy: 0.4591439688715953 	 AUC: 0.5928584757768697 	 Precision: 0.42488413590610835 	 Recall: 0.4261984019654517


#### Genetic Algorithm to find best params

In [99]:
from sklearn.ensemble import VotingClassifier
import pygad

desired_output = 0.99

def fitness_function(solution, solution_idx):
    vc = VotingClassifier([('catboost', catboost), ('lgbm', lgbm), ('xgb', xgb)], voting = 'soft', weights=[solution[0], solution[1], solution[2]])
    vc.fit(X_train, y_train)
    y_pred_vc_val = vc.predict_proba(X_val)
    output = metrics.roc_auc_score(y_val, y_pred_vc_val, multi_class = 'ovo', average = 'macro')
    # fitness = 1.0 / np.abs(output - desired_output)

    return output

gene_space = {"low": 1, "high": 8, "step": 1}
gene_type=int
num_generations = 100
num_parents_mating = 10

sol_per_pop = 20
num_genes = 3

parent_selection_type = "tournament"
keep_parents = 4

crossover_type = "single_point"

mutation_type = "random"
mutation_percent_genes = 40

ga_instance = pygad.GA(num_generations=num_generations,
                       num_parents_mating=num_parents_mating,
                       fitness_func=fitness_function,
                       sol_per_pop=sol_per_pop,
                       num_genes=num_genes,
                       parent_selection_type=parent_selection_type,
                       keep_parents=keep_parents,
                       crossover_type=crossover_type,
                       mutation_type=mutation_type,
                       mutation_percent_genes=mutation_percent_genes,
                       gene_space=gene_space,
                       gene_type=gene_type)
					   
ga_instance.run()
solution, solution_fitness, solution_idx = ga_instance.best_solution()
print("Parameters of the best solution : {solution}".format(solution=solution))
print("Fitness value of the best solution = {solution_fitness}".format(solution_fitness=solution_fitness))

Parameters of the best solution : [6 1 1]
Fitness value of the best solution = 0.6039769288751404


#### Results

In [97]:
vc = VotingClassifier([('catboost', catboost), ('lgbm', lgbm), ('xgb', xgb)], voting = 'soft', weights=[solution[0],solution[1],solution[2]])
vc.fit(X_train, y_train)
y_pred_vc_test = vc.predict_proba(X_test)
pred_test_vc = vc.predict(X_test)

In [98]:
accuracy = metrics.accuracy_score(y_test, pred_test_vc)
auc = metrics.roc_auc_score(y_test, y_pred_vc_test, multi_class = 'ovo', average = 'macro')
precision = metrics.precision_score(y_test, pred_test_vc, average = 'macro')
recall = metrics.recall_score(y_test, pred_test_vc, average = 'macro')

print(f'Accuracy: {accuracy} \t AUC: {auc} \t Precision: {precision} \t Recall: {recall}')

Accuracy: 0.45395590142671854 	 AUC: 0.5916323174665366 	 Precision: 0.416714044844178 	 Recall: 0.42069066322462806


In [70]:
df_comparison

Unnamed: 0,Model,Accuracy,AUC,Precision,Recall
0,CatBoost Plain,0.447795,0.58426,0.41089,0.420022
0,CatBoost PSO Optimized,0.452983,0.58793,0.415399,0.418716
0,LGBM Plain,0.407912,0.572193,0.395636,0.398591
0,LGBM PSO Optimized,0.45882,0.595745,0.424719,0.422471
0,XGBoost Plain,0.442931,0.583585,0.41033,0.415023
0,XGBoost PSO Optimized,0.454604,0.587828,0.425988,0.426796
0,Ensemble GA Optimized,0.457847,0.592578,0.422925,0.424566
0,Ensemble Plain,0.459144,0.592858,0.424884,0.426198


In [None]:
# Define objective function with the cost decorator allows the defintion of the
# objective function for one particle
@ps.cost
def parab(X):
    cost = X[0]**2 + X[1]**2
    return cost

initpos = np.array([[-7,4],[12,23],[-4,8],[-9,-7],[-3,2]]) # Initial positions of the particles
opt = IntOptimizerPSO(5,
                      2,
                      {"c1": 0.5, "c2": 0.3, "w": 1.9},
                      initpos=initpos)

c, p = opt.optimize(parab, 20)
