# Explorando o dataset TEP

Cada arquivo **.RData** é uma representação externa de um dataframe em R. Para carregar o *dataset* no Python, foi utilizado o módulo **pyreadr** que pode ser instalado usando o seguinte comando: 

```bash 
pip install pyreadr
```

As variáveis carregadas são chamadas de 'faultfreetraining', 'faultfreetesting', 'faultytesting', and 'faultytraining', correspondendo aos arquivos RData. Cada dataframe contém 55 colunas:

- A coluna 1 ('faultNumber') varia de 1 a 20 nos conjuntos de dados "com defeito" e representa o tipo de falha no TEP. Os conjuntos de dados "FaultFree" contém apenas a falha 0 (ou seja, condições normais de operação).

- A coluna 2 ('simulationRun') varia de 1 a 500 e representa um estado do gerador de número aleatório diferente a partir do qual um conjunto de dados TEP completo foi gerado (Nota: as sementes reais usadas para gerar conjuntos de dados de treinamento e teste não se sobrepõem).

- A coluna 3 ('sample') varia de 1 a 500 (conjuntos de dados de "Treinamento") ou de 1 a 960 (conjuntos de dados de "Teste"). As variáveis TEP (colunas 4 a 55) foram amostradas a cada 3 minutos por uma duração total de 25 horas e 48 horas, respectivamente. Observe que as falhas foram introduzidas em 1 e 8 horas nos conjuntos de dados Faulty Training e Faulty Testing, respectivamente.

- As colunas 4 a 55 contém as variáveis do processo; os nomes das colunas retêm os nomes das variáveis originais.

In [14]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import pyreadr

from tqdm.notebook import tqdm

## Carregando os dados de treinamento

In [15]:
normal = pyreadr.read_r('.\\data\\TEP_FaultFree_Training.RData')['fault_free_training']
faulty = pyreadr.read_r('.\\data\\TEP_Faulty_Training.RData')['faulty_training']

normal[['faultNumber', 'simulationRun']] = normal[['faultNumber', 'simulationRun']].astype('int32')
faulty[['faultNumber', 'simulationRun']] = faulty[['faultNumber', 'simulationRun']].astype('int32')

faulty.insert(1, 'fault', np.where(faulty['sample'] < 20, 0, faulty['faultNumber']))

normal.to_hdf('.\\data\\training.h5', key='normal')
faulty.to_hdf('.\\data\\training.h5', key='faulty')

print(f'Normal data: {normal.shape}')
print(f'Faulty data: {faulty.shape}')

Normal data: (250000, 55)
Faulty data: (5000000, 56)


## Processando os datasets de treinamento

In [3]:
normal.head()

Unnamed: 0,faultNumber,simulationRun,sample,xmeas_1,xmeas_2,xmeas_3,xmeas_4,xmeas_5,xmeas_6,xmeas_7,...,xmv_2,xmv_3,xmv_4,xmv_5,xmv_6,xmv_7,xmv_8,xmv_9,xmv_10,xmv_11
0,0,1,1,0.25038,3674.0,4529.0,9.232,26.889,42.402,2704.3,...,53.744,24.657,62.544,22.137,39.935,42.323,47.757,47.51,41.258,18.447
1,0,1,2,0.25109,3659.4,4556.6,9.4264,26.721,42.576,2705.0,...,53.414,24.588,59.259,22.084,40.176,38.554,43.692,47.427,41.359,17.194
2,0,1,3,0.25038,3660.3,4477.8,9.4426,26.875,42.07,2706.2,...,54.357,24.666,61.275,22.38,40.244,38.99,46.699,47.468,41.199,20.53
3,0,1,4,0.24977,3661.3,4512.1,9.4776,26.758,42.063,2707.2,...,53.946,24.725,59.856,22.277,40.257,38.072,47.541,47.658,41.643,18.089
4,0,1,5,0.29405,3679.0,4497.0,9.3381,26.889,42.65,2705.1,...,53.658,28.797,60.717,21.947,39.144,41.955,47.645,47.346,41.507,18.461


In [4]:
faulty.head()

Unnamed: 0,faultNumber,fault,simulationRun,sample,xmeas_1,xmeas_2,xmeas_3,xmeas_4,xmeas_5,xmeas_6,...,xmv_2,xmv_3,xmv_4,xmv_5,xmv_6,xmv_7,xmv_8,xmv_9,xmv_10,xmv_11
0,1,0,1,1,0.25038,3674.0,4529.0,9.232,26.889,42.402,...,53.744,24.657,62.544,22.137,39.935,42.323,47.757,47.51,41.258,18.447
1,1,0,1,2,0.25109,3659.4,4556.6,9.4264,26.721,42.576,...,53.414,24.588,59.259,22.084,40.176,38.554,43.692,47.427,41.359,17.194
2,1,0,1,3,0.25038,3660.3,4477.8,9.4426,26.875,42.07,...,54.357,24.666,61.275,22.38,40.244,38.99,46.699,47.468,41.199,20.53
3,1,0,1,4,0.24977,3661.3,4512.1,9.4776,26.758,42.063,...,53.946,24.725,59.856,22.277,40.257,38.072,47.541,47.658,41.643,18.089
4,1,0,1,5,0.29405,3679.0,4497.0,9.3381,26.889,42.65,...,53.658,28.797,60.717,21.947,39.144,41.955,47.645,47.346,41.507,18.461


In [13]:
data = faulty[(faulty['simulationRun'] == 1) & (faulty['faultNumber'] == 2)]

fig = go.Figure([
    go.Scatter(x=data.index, y=data[col], name=col) for col in data.columns.tolist()[4:]
])
fig.show()