# **Métodos de Detecção de Outliers**

# 1. Imports

In [1]:
import numpy as np
import yfinance as yf
from utils import adf_test
from detect_outliers import statistical_method, isolation_forest_method, LSTMAutoencoder
from plots import time_series_plot

2024-06-19 18:13:15.120315: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-06-19 18:13:15.553762: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-06-19 18:13:15.553838: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-06-19 18:13:15.583994: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-06-19 18:13:15.700051: I tensorflow/core/platform/cpu_feature_guar

# 2. Dados

In [2]:
# Carregar os dados
df = yf.download("^NDX", period="2y", interval="1h")[["Adj Close"]]

# Resetar o índice para ter a coluna de data separada
df.reset_index(inplace=True)

[*********************100%%**********************]  1 of 1 completed


In [3]:
# df.to_csv("df.csv", index=False)

In [4]:
time_series_plot(
    df=df,
    target_col="Adj Close",
    # title="Nasdaq-100 Adjusted Close Price Over 2 Years (Hourly)",
    width=1500,
    height=600
)

### 2.1. Calculando os retornos da ação

In [5]:
# df.loc[:, 'returns'] = df.loc[:, 'Adj Close'] / df.loc[:, 'Adj Close'].shift(1)
df.loc[:, 'returns'] = df.loc[:, 'Adj Close'] / df.loc[:, 'Adj Close'].shift(1) - 1
df = df.dropna()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [6]:
df

Unnamed: 0,Datetime,Adj Close,returns
1,2022-06-21 10:30:00-04:00,11609.618164,0.001055
2,2022-06-21 11:30:00-04:00,11568.227539,-0.003565
3,2022-06-21 12:30:00-04:00,11579.245117,0.000952
4,2022-06-21 13:30:00-04:00,11561.703125,-0.001515
5,2022-06-21 14:30:00-04:00,11575.915039,0.001229
...,...,...,...
3490,2024-06-18 11:30:00-04:00,19888.472656,0.000982
3491,2024-06-18 12:30:00-04:00,19927.224609,0.001948
3492,2024-06-18 13:30:00-04:00,19898.933594,-0.001420
3493,2024-06-18 14:30:00-04:00,19906.302734,0.000370


In [7]:
# Plotar os dados usando Plotly
time_series_plot(
    df=df,
    target_col="returns",
    title="Nasdaq-100 Returns Over 2 Years (Hourly)",
)

### 2.2. Teste de estacionariedade

In [8]:
adf_test(df["Adj Close"])
adf_test(df["returns"])

p-value (Adj Close): 0.981365
Não rejeitamos a hipótese nula de que a série Adj Close tem uma raiz unitária. Portanto, a série Adj Close não é estacionária.
p-value (returns): 0.000000
Rejeitamos a hipótese nula de que a série returns tem uma raiz unitária. Portanto, a série returns é estacionária.


# 3. Detecção de Outliers - Método 1 - Estatístico

In [9]:
statistical_method(df=df, target_col="returns", threshold=3)

In [10]:
# px.histogram(df["log_return"])

Problemas desse método:
- Outliers extremos podem inflar o desvio padrão, tornando mais difícil a detecção de outros outliers.
- Não considera dependência temporal
- O método não se adapta a mudanças na volatilidade ao longo do tempo. Em períodos de alta volatilidade, mais pontos podem ser classificados erroneamente como outliers, enquanto em períodos de baixa volatilidade, menos outliers podem ser detectado
  
Ou seja, é um método muito simples para dados mais complexos, como séries temporais financeiras, porém podem ser úteis em tratamento de outliers em problemas supervisionados mais tradicionais, em especial para dados mais simples e estruturados.

# 4. Detecção de Outliers - Método 2 - Isolation Forest

In [11]:
isolation_forest_method(
    df=df, target_col="returns", n_estimators=500, contamination=0.001
)

Prós desse método:

- Robusto a outliers extremos, por se tratar de um método de ensemble.
- Consegue detectar padrões mais complexos nos dados, em relação ao primeiro método

Contras desse método:

- Ainda não considera dependência temporal, ou seja, não é um método específico para séries temporais, apesar de apresentar resultados mais interessantes. Mas no geral, vai funcionar muito bem para dados tabulares tradicionais, justamente por temos mais variáveis.

# 5. Detecção de Outliers - Método 3 - Autoencoder

In [12]:
autoencoder = LSTMAutoencoder(
    df=df,
    target_col="returns",
    sequence_length=24,
    lstm_layers=64,
    latent_dim=4,
    dropout=0.2,
    epochs=300,
    batch_size=32,
    threshold_quantile=0.99,
)

In [4]:
# import tensorflow as tf
# print("GPUs disponíveis: ", tf.config.list_physical_devices('GPU'))

GPUs disponíveis:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


2024-06-19 12:24:04.019182: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-19 12:24:04.151028: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-19 12:24:04.151421: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

In [13]:
X = autoencoder.preprocessing()

In [14]:
X.shape

(3470, 24, 1)

In [13]:
model = autoencoder.build_model(X=X)



2024-06-09 16:26:22.389295: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-09 16:26:22.414469: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-06-09 16:26:22.414635: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-

In [14]:
history = autoencoder.train(model, X)

Epoch 1/300


2024-06-09 16:26:25.885318: I external/local_xla/xla/service/service.cc:168] XLA service 0x702c8026c450 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-06-09 16:26:25.885335: I external/local_xla/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA GeForce RTX 3050 Laptop GPU, Compute Capability 8.6
2024-06-09 16:26:25.889258: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-06-09 16:26:25.900153: I external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:454] Loaded cuDNN version 8907
I0000 00:00:1717961185.956514   44193 device_compiler.h:186] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78/300
Epoch 7

In [15]:
df_anomalies = autoencoder.get_anomalies(X, model)



In [16]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 24, 1)]           0         
                                                                 
 bidirectional (Bidirection  (None, 128)               33792     
 al)                                                             
                                                                 
 dropout (Dropout)           (None, 128)               0         
                                                                 
 dense (Dense)               (None, 4)                 516       
                                                                 
 repeat_vector (RepeatVecto  (None, 24, 4)             0         
 r)                                                              
                                                                 
 bidirectional_1 (Bidirecti  (None, 24, 128)           35328 

In [17]:
autoencoder.plot_anomalies()