# 01 - Pré-processamento dos dados

Neste notebook, é realizada a leitura dos dados originais, tratamento e exportação para a base que será submetida à análises e ao treinamento dos modelos.

## Importações

In [1]:
# Bibliotecas padrão
import json
import pickle

# Bibliotecas utilitárias de terceiros
import numpy as np
import pandas as pd
from scipy.io import arff

## Constantes e sets

In [2]:
pd.set_option("future.no_silent_downcasting", True)  # Para suprimir warnings de deprecation

## Scripts

### Obtenção dos dados

Para este projeto utilizaremos os dados obtidos de [steel-plates-fault](https://www.openml.org/search?type=data&status=active&id=1504&sort=runs) que contém informações sobre defeitos em chapas de aço.

In [3]:
arff_data = arff.loadarff(open('../data/raw/steel-plates-fault.arff', 'r', encoding='utf-8'))
df = pd.DataFrame(arff_data[0])

df

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V25,V26,V27,V28,V29,V30,V31,V32,V33,Class
0,42.0,50.0,270900.0,270944.0,267.0,17.0,44.0,24220.0,76.0,108.0,...,0.8182,-0.2913,0.5822,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
1,645.0,651.0,2538079.0,2538108.0,108.0,10.0,30.0,11397.0,84.0,123.0,...,0.7931,-0.1756,0.2984,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
2,829.0,835.0,1553913.0,1553931.0,71.0,8.0,19.0,7972.0,99.0,125.0,...,0.6667,-0.1228,0.2150,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
3,853.0,860.0,369370.0,369415.0,176.0,13.0,45.0,18996.0,99.0,126.0,...,0.8444,-0.1568,0.5212,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
4,1289.0,1306.0,498078.0,498335.0,2409.0,60.0,260.0,246930.0,37.0,126.0,...,0.9338,-0.1992,1.0000,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1936,249.0,277.0,325780.0,325796.0,273.0,54.0,22.0,35033.0,119.0,141.0,...,-0.4286,0.0026,0.7254,0.0,0.0,0.0,0.0,0.0,0.0,b'2'
1937,144.0,175.0,340581.0,340598.0,287.0,44.0,24.0,34599.0,112.0,133.0,...,-0.4516,-0.0582,0.8173,0.0,0.0,0.0,0.0,0.0,0.0,b'2'
1938,145.0,174.0,386779.0,386794.0,292.0,40.0,22.0,37572.0,120.0,140.0,...,-0.4828,0.0052,0.7079,0.0,0.0,0.0,0.0,0.0,0.0,b'2'
1939,137.0,170.0,422497.0,422528.0,419.0,97.0,47.0,52715.0,117.0,140.0,...,-0.0606,-0.0171,0.9919,0.0,0.0,0.0,0.0,0.0,0.0,b'2'


Para facilitar na leitura, vamos renomear as colunas para os nomes originais das features:

In [4]:
with open('../data/raw/dataset_description.json', 'r') as file:
    col_desc = json.load(file)

col_desc = col_desc['data_set_description']['description'].split('###')[1].split('\n*')
col_desc = [t.strip() for t in col_desc if 'Attribute Information' not in t]
col_desc = {k: v for k, v in map(lambda x : x.split(': '), col_desc)}

col_desc

{'V1': 'X_Minimum',
 'V2': 'X_Maximum',
 'V3': 'Y_Minimum',
 'V4': 'Y_Maximum',
 'V5': 'Pixels_Areas',
 'V6': 'X_Perimeter',
 'V7': 'Y_Perimeter',
 'V8': 'Sum_of_Luminosity',
 'V9': 'Minimum_of_Luminosity',
 'V10': 'Maximum_of_Luminosity',
 'V11': 'Length_of_Conveyer',
 'V12': 'TypeOfSteel_A300',
 'V13': 'TypeOfSteel_A400',
 'V14': 'Steel_Plate_Thickness',
 'V15': 'Edges_Index',
 'V16': 'Empty_Index',
 'V17': 'Square_Index',
 'V18': 'Outside_X_Index',
 'V19': 'Edges_X_Index',
 'V20': 'Edges_Y_Index',
 'V21': 'Outside_Global_Index',
 'V22': 'LogOfAreas',
 'V23': 'Log_X_Index',
 'V24': 'Log_Y_Index',
 'V25': 'Orientation_Index',
 'V26': 'Luminosity_Index',
 'V27': 'SigmoidOfAreas',
 'V28': 'Pastry',
 'V29': 'Z_Scratch',
 'V30': 'K_Scatch',
 'V31': 'Stains',
 'V32': 'Dirtiness',
 'V33': 'Bumps',
 'Class': 'Other_Faults'}

In [5]:
df.rename(columns=col_desc, inplace=True)
df

Unnamed: 0,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,...,Orientation_Index,Luminosity_Index,SigmoidOfAreas,Pastry,Z_Scratch,K_Scatch,Stains,Dirtiness,Bumps,Other_Faults
0,42.0,50.0,270900.0,270944.0,267.0,17.0,44.0,24220.0,76.0,108.0,...,0.8182,-0.2913,0.5822,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
1,645.0,651.0,2538079.0,2538108.0,108.0,10.0,30.0,11397.0,84.0,123.0,...,0.7931,-0.1756,0.2984,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
2,829.0,835.0,1553913.0,1553931.0,71.0,8.0,19.0,7972.0,99.0,125.0,...,0.6667,-0.1228,0.2150,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
3,853.0,860.0,369370.0,369415.0,176.0,13.0,45.0,18996.0,99.0,126.0,...,0.8444,-0.1568,0.5212,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
4,1289.0,1306.0,498078.0,498335.0,2409.0,60.0,260.0,246930.0,37.0,126.0,...,0.9338,-0.1992,1.0000,1.0,0.0,0.0,0.0,0.0,0.0,b'1'
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1936,249.0,277.0,325780.0,325796.0,273.0,54.0,22.0,35033.0,119.0,141.0,...,-0.4286,0.0026,0.7254,0.0,0.0,0.0,0.0,0.0,0.0,b'2'
1937,144.0,175.0,340581.0,340598.0,287.0,44.0,24.0,34599.0,112.0,133.0,...,-0.4516,-0.0582,0.8173,0.0,0.0,0.0,0.0,0.0,0.0,b'2'
1938,145.0,174.0,386779.0,386794.0,292.0,40.0,22.0,37572.0,120.0,140.0,...,-0.4828,0.0052,0.7079,0.0,0.0,0.0,0.0,0.0,0.0,b'2'
1939,137.0,170.0,422497.0,422528.0,419.0,97.0,47.0,52715.0,117.0,140.0,...,-0.0606,-0.0171,0.9919,0.0,0.0,0.0,0.0,0.0,0.0,b'2'


Além disso, vamos converter o tipo do dado da coluna `Other_Faults` para inteiro e mudar de 1 para 0 e 2 para 1.

In [6]:
df['Other_Faults'] = df['Other_Faults'].replace({b'1': 0, b'2': 1})
df

Unnamed: 0,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,...,Orientation_Index,Luminosity_Index,SigmoidOfAreas,Pastry,Z_Scratch,K_Scatch,Stains,Dirtiness,Bumps,Other_Faults
0,42.0,50.0,270900.0,270944.0,267.0,17.0,44.0,24220.0,76.0,108.0,...,0.8182,-0.2913,0.5822,1.0,0.0,0.0,0.0,0.0,0.0,0
1,645.0,651.0,2538079.0,2538108.0,108.0,10.0,30.0,11397.0,84.0,123.0,...,0.7931,-0.1756,0.2984,1.0,0.0,0.0,0.0,0.0,0.0,0
2,829.0,835.0,1553913.0,1553931.0,71.0,8.0,19.0,7972.0,99.0,125.0,...,0.6667,-0.1228,0.2150,1.0,0.0,0.0,0.0,0.0,0.0,0
3,853.0,860.0,369370.0,369415.0,176.0,13.0,45.0,18996.0,99.0,126.0,...,0.8444,-0.1568,0.5212,1.0,0.0,0.0,0.0,0.0,0.0,0
4,1289.0,1306.0,498078.0,498335.0,2409.0,60.0,260.0,246930.0,37.0,126.0,...,0.9338,-0.1992,1.0000,1.0,0.0,0.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1936,249.0,277.0,325780.0,325796.0,273.0,54.0,22.0,35033.0,119.0,141.0,...,-0.4286,0.0026,0.7254,0.0,0.0,0.0,0.0,0.0,0.0,1
1937,144.0,175.0,340581.0,340598.0,287.0,44.0,24.0,34599.0,112.0,133.0,...,-0.4516,-0.0582,0.8173,0.0,0.0,0.0,0.0,0.0,0.0,1
1938,145.0,174.0,386779.0,386794.0,292.0,40.0,22.0,37572.0,120.0,140.0,...,-0.4828,0.0052,0.7079,0.0,0.0,0.0,0.0,0.0,0.0,1
1939,137.0,170.0,422497.0,422528.0,419.0,97.0,47.0,52715.0,117.0,140.0,...,-0.0606,-0.0171,0.9919,0.0,0.0,0.0,0.0,0.0,0.0,1


Com isso, vamos criar um dicionário que faz o mapeamento do target (tipo do dano) para um inteiro. Isto será utilizado para o treinamento do modelo:

In [7]:
target_cols = ['Pastry', 'Z_Scratch', 'K_Scatch', 'Stains', 'Dirtiness', 'Bumps', 'Other_Faults'] 
target_maps = {i: j for i,j in enumerate(target_cols)}
target_maps

{0: 'Pastry',
 1: 'Z_Scratch',
 2: 'K_Scatch',
 3: 'Stains',
 4: 'Dirtiness',
 5: 'Bumps',
 6: 'Other_Faults'}

In [8]:
df['target'] = df.apply(lambda s: np.argmax(s[target_cols]), axis=1)
df

Unnamed: 0,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,...,Luminosity_Index,SigmoidOfAreas,Pastry,Z_Scratch,K_Scatch,Stains,Dirtiness,Bumps,Other_Faults,target
0,42.0,50.0,270900.0,270944.0,267.0,17.0,44.0,24220.0,76.0,108.0,...,-0.2913,0.5822,1.0,0.0,0.0,0.0,0.0,0.0,0,0
1,645.0,651.0,2538079.0,2538108.0,108.0,10.0,30.0,11397.0,84.0,123.0,...,-0.1756,0.2984,1.0,0.0,0.0,0.0,0.0,0.0,0,0
2,829.0,835.0,1553913.0,1553931.0,71.0,8.0,19.0,7972.0,99.0,125.0,...,-0.1228,0.2150,1.0,0.0,0.0,0.0,0.0,0.0,0,0
3,853.0,860.0,369370.0,369415.0,176.0,13.0,45.0,18996.0,99.0,126.0,...,-0.1568,0.5212,1.0,0.0,0.0,0.0,0.0,0.0,0,0
4,1289.0,1306.0,498078.0,498335.0,2409.0,60.0,260.0,246930.0,37.0,126.0,...,-0.1992,1.0000,1.0,0.0,0.0,0.0,0.0,0.0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1936,249.0,277.0,325780.0,325796.0,273.0,54.0,22.0,35033.0,119.0,141.0,...,0.0026,0.7254,0.0,0.0,0.0,0.0,0.0,0.0,1,6
1937,144.0,175.0,340581.0,340598.0,287.0,44.0,24.0,34599.0,112.0,133.0,...,-0.0582,0.8173,0.0,0.0,0.0,0.0,0.0,0.0,1,6
1938,145.0,174.0,386779.0,386794.0,292.0,40.0,22.0,37572.0,120.0,140.0,...,0.0052,0.7079,0.0,0.0,0.0,0.0,0.0,0.0,1,6
1939,137.0,170.0,422497.0,422528.0,419.0,97.0,47.0,52715.0,117.0,140.0,...,-0.0171,0.9919,0.0,0.0,0.0,0.0,0.0,0.0,1,6


In [9]:
df['target'].map(target_maps).value_counts()

target
Other_Faults    673
Bumps           402
K_Scatch        391
Z_Scratch       190
Pastry          158
Stains           72
Dirtiness        55
Name: count, dtype: int64

Como o nosso principal objetivo consiste em identificar problemas conhecidos, vamos descartar por ora aqueles dados cuja a falha é classificada como "Outra":

In [10]:
df = df[df['target'].map(target_maps) != 'Other_Faults']
df['target'].map(target_maps).value_counts()

target
Bumps        402
K_Scatch     391
Z_Scratch    190
Pastry       158
Stains        72
Dirtiness     55
Name: count, dtype: int64

In [11]:
df = df.drop(columns=target_cols)
df

Unnamed: 0,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,...,Edges_X_Index,Edges_Y_Index,Outside_Global_Index,LogOfAreas,Log_X_Index,Log_Y_Index,Orientation_Index,Luminosity_Index,SigmoidOfAreas,target
0,42.0,50.0,270900.0,270944.0,267.0,17.0,44.0,24220.0,76.0,108.0,...,0.4706,1.0000,1.0,2.4265,0.9031,1.6435,0.8182,-0.2913,0.5822,0
1,645.0,651.0,2538079.0,2538108.0,108.0,10.0,30.0,11397.0,84.0,123.0,...,0.6000,0.9667,1.0,2.0334,0.7782,1.4624,0.7931,-0.1756,0.2984,0
2,829.0,835.0,1553913.0,1553931.0,71.0,8.0,19.0,7972.0,99.0,125.0,...,0.7500,0.9474,1.0,1.8513,0.7782,1.2553,0.6667,-0.1228,0.2150,0
3,853.0,860.0,369370.0,369415.0,176.0,13.0,45.0,18996.0,99.0,126.0,...,0.5385,1.0000,1.0,2.2455,0.8451,1.6532,0.8444,-0.1568,0.5212,0
4,1289.0,1306.0,498078.0,498335.0,2409.0,60.0,260.0,246930.0,37.0,126.0,...,0.2833,0.9885,1.0,3.3818,1.2305,2.4099,0.9338,-0.1992,1.0000,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1263,221.0,242.0,3948212.0,3948253.0,519.0,33.0,41.0,48309.0,65.0,124.0,...,0.6364,1.0000,1.0,2.7152,1.3222,1.6128,0.4878,-0.2728,0.9765,5
1264,1111.0,1121.0,4032298.0,4032320.0,110.0,20.0,22.0,12351.0,100.0,127.0,...,0.5000,1.0000,1.0,2.0414,1.0000,1.3424,0.5454,-0.1228,0.3663,5
1265,995.0,1006.0,4085316.0,4085344.0,140.0,25.0,28.0,16076.0,103.0,132.0,...,0.4400,1.0000,1.0,2.1461,1.0414,1.4472,0.6071,-0.1029,0.5096,5
1266,396.0,418.0,4116853.0,4116868.0,231.0,26.0,16.0,25096.0,56.0,141.0,...,0.8461,0.9375,0.0,2.3636,1.3424,1.1761,-0.3182,-0.1512,0.5461,5


Finalmente, pode-se confirmar que as restrições dadas para o projeto são respeitadas:

- [X] Tem-se 6 classes (era exigido que o problema tivesse ao menos 3 classes)
- [X] Tem-se 128 atributos (era exigido que o problema tivesse ao menos 10 atributos)
- [X] Tem-se 1268 instâncias (era exigido ao menos 1000 instâncias)

Por fim, antes de exportar os dados vamos nos assegurar que não há nenhum registro incompleto:

In [12]:
df.isna().sum()

X_Minimum                0
X_Maximum                0
Y_Minimum                0
Y_Maximum                0
Pixels_Areas             0
X_Perimeter              0
Y_Perimeter              0
Sum_of_Luminosity        0
Minimum_of_Luminosity    0
Maximum_of_Luminosity    0
Length_of_Conveyer       0
TypeOfSteel_A300         0
TypeOfSteel_A400         0
Steel_Plate_Thickness    0
Edges_Index              0
Empty_Index              0
Square_Index             0
Outside_X_Index          0
Edges_X_Index            0
Edges_Y_Index            0
Outside_Global_Index     0
LogOfAreas               0
Log_X_Index              0
Log_Y_Index              0
Orientation_Index        0
Luminosity_Index         0
SigmoidOfAreas           0
target                   0
dtype: int64

Para, então, exportar o DataFrame e o mapa de targets:

In [13]:
df.to_pickle('../data/processed/steel-plates-fault.pkl')

with open('../data/processed/target_maps.pkl', 'wb') as file:
    pickle.dump(target_maps, file)