<br>

## __Exercício: Detecção de Anomalias__

<br>

__1:__

Utilizando a classe DetectorAnomalias criada ao longo do módulo, __vamos avaliar um detector de anomalias.__

O dataset utilizado pode ser importado através da função getData. 

Nesse conjunto de dados, possuímos 6 variáveis explicativas, $X_1, .., X_6$ e uma variável com a marcação se a instância é uma anomalia ou não.

Utilizando a __metodolodia__ discutida ao longo do módulo, __teste diferentes modelos (variando o limiar $\epsilon$)__ a fim de encontrar o que __melhor fita os dados.__

Justifique as escolhas do $\epsilon$, bem como quais as métricas de performance abordadas. 

<br>

In [243]:
import pandas as pd 
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt

In [244]:
class DetectorAnomalias():
    
    def __init__(self, epsilon):
        self.epsilon = epsilon
        
    def fit(self, X):
        medias = X.mean(axis = 0)
        desvios = X.std(axis = 0)
        gaussianas = [st.norm(loc = m, scale = d) for m, d in zip(medias, desvios)]  
        self.gaussianas = gaussianas
        self.X = X
        
    def prob(self, x):
        p = 1
        for i in range(self.X.shape[1]):
            gaussiana_i = self.gaussianas[i]
            x_i = x[i]
            p *= gaussiana_i.pdf(x_i)
        return p
    
    def isAnomaly(self, x):
        return int(np.where(self.prob(x) < self.epsilon, 1, 0))

In [245]:
def getData():
    return pd.read_csv("dataframe_anomalias_exercicio.csv")

In [246]:
df = getData()
df

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
0,7.731153,23.299155,-0.367453,4.715372,9.306179,16.780965,0.0
1,11.466833,16.943695,-0.245131,7.060311,10.462826,19.821289,0.0
2,11.501272,20.196011,1.206049,-4.957189,7.771262,19.100079,0.0
3,10.893921,16.072385,2.738045,-3.684228,7.373334,23.225524,0.0
4,10.091706,19.253894,0.996895,-9.504052,8.883988,17.903298,0.0
...,...,...,...,...,...,...,...
10095,11.192286,18.451987,-0.953650,-14.362996,10.875826,17.056541,0.0
10096,12.014177,19.461815,1.985099,-7.119190,11.079922,17.582755,0.0
10097,10.745460,18.175951,0.206037,-1.897015,9.888329,17.963324,0.0
10098,9.893969,22.333270,-1.465981,4.137382,7.690620,21.570097,0.0


In [247]:
df.anomalia.value_counts()

0.0    10046
1.0       54
Name: anomalia, dtype: int64

In [248]:
df.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
0,7.731153,23.299155,-0.367453,4.715372,9.306179,16.780965,0.0
1,11.466833,16.943695,-0.245131,7.060311,10.462826,19.821289,0.0
2,11.501272,20.196011,1.206049,-4.957189,7.771262,19.100079,0.0
3,10.893921,16.072385,2.738045,-3.684228,7.373334,23.225524,0.0
4,10.091706,19.253894,0.996895,-9.504052,8.883988,17.903298,0.0


In [249]:
df1 = df[(df.anomalia!=0)]
df2 = df[(df.anomalia!=1)]

In [250]:
treino = df2[0:6046]
val = pd.concat([df2[6046:8046], df1[0:27]])
test = pd.concat([df2[8046:10046], df1[27:54]])
print(treino.shape, val.shape, test.shape)

(6046, 7) (2027, 7) (2027, 7)


In [251]:
df1[0:27].shape, df2[6046:8046].shape

((27, 7), (2000, 7))

In [252]:
val['anomalia'].value_counts()

0.0    2000
1.0      27
Name: anomalia, dtype: int64

In [253]:
treino.head()

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
0,7.731153,23.299155,-0.367453,4.715372,9.306179,16.780965,0.0
1,11.466833,16.943695,-0.245131,7.060311,10.462826,19.821289,0.0
2,11.501272,20.196011,1.206049,-4.957189,7.771262,19.100079,0.0
3,10.893921,16.072385,2.738045,-3.684228,7.373334,23.225524,0.0
4,10.091706,19.253894,0.996895,-9.504052,8.883988,17.903298,0.0


In [254]:
xtreino=treino.drop('anomalia', axis=1).values
ytreino=treino.anomalia.values

xval=val.drop('anomalia', axis=1).values
yval=val.anomalia.values

xtest=test.drop('anomalia', axis=1).values
ytest=test.anomalia.values

In [255]:
ann = DetectorAnomalias(epsilon = 1.0000000000000004e-06) # ou dez elevado a -6
ann.fit(xtreino)

In [256]:
xtreino

array([[ 7.73115287, 23.29915461, -0.36745342,  4.71537151,  9.30617937,
        16.78096518],
       [11.46683276, 16.94369515, -0.24513144,  7.06031139, 10.46282586,
        19.82128898],
       [11.50127248, 20.19601072,  1.20604852, -4.95718899,  7.77126188,
        19.10007872],
       ...,
       [10.12829266, 18.07314536, -1.95158758,  0.4080624 ,  8.53990017,
        18.05747396],
       [12.36248333, 17.76752237,  0.42317785, -6.97956567, 10.34695581,
        21.63124586],
       [ 8.68285285, 19.63280664,  0.64989408,  1.47898733, 11.63585945,
        20.83097871]])

In [257]:
#fazendo testes
x = xtreino[0]
x

array([ 7.73115287, 23.29915461, -0.36745342,  4.71537151,  9.30617937,
       16.78096518])

In [258]:
ann.prob(xtreino[0])

4.1112641769790377e-07

In [259]:
ann.isAnomaly(xtreino[0])

1

In [260]:
from sklearn.metrics import roc_auc_score

Para o epsilon = 1.0000000000000004e-06 temos na primeira linha uma anomalia, vamos testar o range, por assim dizer, das anomalias em um código.

In [261]:
eps = [ 1 * 0.1 ** (n - 1) for n in range(1, 11) ]

for p in eps:
    yvalpred=[]
    detect = DetectorAnomalias(p)
    detect.fit(xtreino)

    for t in range(len(val)):
        yvalpred.append(detect.isAnomaly(xval[t]))

    print('|epsilon: '+str(p))
    print('|Matriz de Confusão \n',confusion_matrix(yval, yvalpred))
    print('Acurácia:', accuracy_score(yval, yvalpred))
    print('F1 Score: ', f1_score(y_true = yval, y_pred = yvalpred))
    print('Precision', precision_score(y_true = yval, y_pred = yvalpred))
    print('Recall', recall_score(y_true = yval, y_pred = yvalpred))
    print("roc:", roc_auc_score(y_true = yval, y_score = yvalpred))
    print('======================= \n')
    

|epsilon: 1.0
|Matriz de Confusão 
 [[   0 2000]
 [   0   27]]
Acurácia: 0.013320177602368031
F1 Score:  0.02629016553067186
Precision 0.013320177602368031
Recall 1.0
roc: 0.5

|epsilon: 0.1
|Matriz de Confusão 
 [[   0 2000]
 [   0   27]]
Acurácia: 0.013320177602368031
F1 Score:  0.02629016553067186
Precision 0.013320177602368031
Recall 1.0
roc: 0.5

|epsilon: 0.010000000000000002
|Matriz de Confusão 
 [[   0 2000]
 [   0   27]]
Acurácia: 0.013320177602368031
F1 Score:  0.02629016553067186
Precision 0.013320177602368031
Recall 1.0
roc: 0.5

|epsilon: 0.0010000000000000002
|Matriz de Confusão 
 [[   0 2000]
 [   0   27]]
Acurácia: 0.013320177602368031
F1 Score:  0.02629016553067186
Precision 0.013320177602368031
Recall 1.0
roc: 0.5

|epsilon: 0.00010000000000000002
|Matriz de Confusão 
 [[   0 2000]
 [   0   27]]
Acurácia: 0.013320177602368031
F1 Score:  0.02629016553067186
Precision 0.013320177602368031
Recall 1.0
roc: 0.5

|epsilon: 1.0000000000000003e-05
|Matriz de Confusão 
 [[ 637

Verificamos que todos os parâmetros indicam que o melhor valor de epsilon é : 1.0000000000000005e-08.

Agora veremos os resultados do epsilon ótimo nos dados de teste:

In [262]:
p=1.0000000000000005e-08
detect = DetectorAnomalias(p)
detect.fit(xtreino)

ytestpred=[]
for t in range(len(test)):
    ytestpred.append(detect.isAnomaly(xtest[t]))

print('|epsilon: '+str(p))
print('|Matriz de Confusão \n',confusion_matrix(ytest, ytestpred))
print('|Acurácia:', accuracy_score(ytest, ytestpred))
print('|F1 Score: ', f1_score(y_true = ytest, y_pred = ytestpred))
print('|Precision', precision_score(y_true = ytest, y_pred = ytestpred))
print('|Recall', recall_score(y_true = ytest, y_pred = ytestpred))
print("|roc:", roc_auc_score(y_true = ytest, y_score = ytestpred))
print('======================= \n')


|epsilon: 1.0000000000000005e-08
|Matriz de Confusão 
 [[1996    4]
 [   0   27]]
|Acurácia: 0.9980266403552047
|F1 Score:  0.9310344827586207
|Precision 0.8709677419354839
|Recall 1.0
|roc: 0.999



Podemos verificar que os valores permanecem ótimos para o epsilon selecionado.


__2:__ 

Aborde o problema num contexto de aprendizado supervisionado, ou seja, treine modelos de classificação binária com o objetivo de detectar anomalias.

Compare os resultados entre as metodologias.

In [263]:
df2 = df.copy()

In [264]:
X=df2.drop('anomalia', axis=1).values
y=df2.anomalia.values

In [291]:
##55000 instâncias para treino/ 15000 para teste

Xtrain, Xtest, ytrain, ytest = X[:8080], X[8080:], y[:8080], y[8080:]

AttributeError: 'numpy.ndarray' object has no attribute 'randomint'

In [266]:
Xtrain.shape

(8080, 6)

In [267]:
df2.describe()

Unnamed: 0,x1,x2,x3,x4,x5,x6,anomalia
count,10100.0,10100.0,10100.0,10100.0,10100.0,10100.0,10100.0
mean,10.013737,20.023104,-0.004408,0.026576,10.008379,20.023252,0.005347
std,1.508796,1.750035,1.517486,4.987716,1.506807,1.73176,0.072928
min,4.374777,13.84815,-6.294481,-20.056818,3.599944,13.409739,0.0
25%,8.992216,18.852003,-1.044131,-3.31714,8.970816,18.867652,0.0
50%,9.994226,20.04379,0.000206,0.046865,10.006417,20.032764,0.0
75%,11.036389,21.217542,1.018719,3.345901,11.030485,21.186058,0.0
max,15.279412,26.248279,5.623409,19.868016,15.467202,26.089327,1.0


In [268]:
from sklearn.model_selection import KFold
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
import time

In [289]:
kf = KFold(n_splits = 5)
t0 = time.time()


classif__ = LogisticRegression() 
lista_acuracia_treino = []
lista_acuracia_validacao = []

for train_index, val_index in kf.split(Xtrain, ytrain):
    
    Xtrain_folds = Xtrain[train_index]
    ytrain_folds = ytrain[train_index]
    Xval_fold = Xtrain[val_index]
    yval_fold = ytrain[val_index]
    
    classif__.fit(Xtrain_folds, ytrain_folds)
    
    pred_treino = classif__.predict(Xtrain_folds)
    pred_validacao = classif__.predict(Xval_fold)
    
    lista_acuracia_treino.append(accuracy_score(y_true = ytrain_folds, y_pred = pred_treino))
    lista_acuracia_validacao.append(accuracy_score(y_true = yval_fold, y_pred = pred_validacao))
    
    
print("acurácias em treino: \n", lista_acuracia_treino, " \n| média: ", np.mean(lista_acuracia_treino))
print()
print("acurácias em validação: \n", lista_acuracia_validacao, " \n| média: ", np.mean(lista_acuracia_validacao))


t1 = time.time()
print("tempo (em segundos) para execução: ", np.round(t1-t0,2))

print('|Matriz de Confusão \n',confusion_matrix(ytrain_folds, pred_treino))
print('Acurácia:', accuracy_score(ytrain_folds, pred_treino))
print('F1 Score: ', f1_score(y_true = ytrain_folds, y_pred = pred_treino))
print('Precision', precision_score(y_true = ytrain_folds, y_pred = pred_treino))
print('Recall', recall_score(y_true = ytrain_folds, y_pred = pred_treino))
print("roc:", roc_auc_score(y_true = ytrain_folds, y_score = pred_treino))
print('======================= \n')

acurácias em treino: 
 [0.9942759900990099, 0.9939665841584159, 0.994894801980198, 0.9945853960396039, 0.994430693069307]  
| média:  0.9944306930693069

acurácias em validação: 
 [0.995049504950495, 0.9962871287128713, 0.9925742574257426, 0.9938118811881188, 0.994430693069307]  
| média:  0.994430693069307
tempo (em segundos) para execução:  0.2
|Matriz de Confusão 
 [[6428    0]
 [  36    0]]
Acurácia: 0.994430693069307
F1 Score:  0.0
Precision 0.0
Recall 0.0
roc: 0.5



  _warn_prf(average, modifier, msg_start, len(result))


Não foi possivel ter um bom resultado com o modelo supervisionado, pois tivemos F1 score, Precision e o recall com valor zero, bem como o roc=0.5 Para esse modelo é mais interessante fitar os dados com o Detector de Anomalias que aparentemente tiveram resultados melhores.  

Obs: Caso exista solução para o modelo supervisionado gostaria de saber como se faz pois não foi possível verificar com o mentor.