# Tratamento dos arquivos de treinamento e teste

Nesse notebook trataremos os arquivos `trasact_train.csv`, `transact_class.csv` e `realclass_t1.csv`.

O objetivo é preparar os dados na forma de CSV para utiliza-los na fase seguinte, escolha de paramêtros dos algoritmo de classificação.

Ao final do processo serão gerados 3 pares de arquivos, cada par referente a uma estratégia para tratamento de missing values.

### Importar as bibliotecas necessárias

In [2]:
import pandas as pd
import numpy as np

### Ler os arquivos csv

In [5]:
transact_train = pd.read_csv('data/transact_train.csv', sep='|', na_values='?')
transact_class = pd.read_csv('data/transact_class.csv', sep='|', na_values='?')

### Diminuir a granularidade dos dados

In [13]:
def reduceGranularity(data):
    ant = data['sessionNo'][0]
    indexes = []
    reduced_data = pd.DataFrame()
    for index, row in data.iterrows():
        if row['sessionNo'] != ant:
            indexes.append(index)      
        ant = row['sessionNo']
    for index in range(0, len(indexes)):
        indexes[index] -= 1
    indexes.append(len(data) - 1)
    reduced_data = data.iloc[indexes].set_index('sessionNo')
    return reduced_data


In [16]:
training_data = reduceGranularity(transact_train)
testing_data = reduceGranularity(transact_class)

In [18]:
testing_data

Unnamed: 0_level_0,startHour,startWeekday,duration,cCount,cMinPrice,cMaxPrice,cSumPrice,bCount,bMinPrice,bMaxPrice,...,onlineStatus,availability,customerNo,maxVal,customerScore,accountLifetime,payments,age,address,lastOrder
sessionNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,18,7,624.606,11,16.99,39.99,207.91,5,16.99,39.99,...,y,completely orderable,25039.0,1300.0,489.0,188.0,5.0,49.0,1.0,65.0
2,18,7,2804.705,16,34.99,34.99,174.95,2,34.99,34.99,...,y,completely orderable,25040.0,1200.0,543.0,43.0,5.0,29.0,2.0,184.0
3,18,7,7401.384,119,7.99,59.95,3263.57,12,12.49,39.95,...,y,completely orderable,25041.0,600.0,552.0,17.0,4.0,37.0,2.0,107.0
4,18,7,2853.550,152,3.99,239.99,5642.50,4,9.99,14.99,...,,,25042.0,8500.0,535.0,226.0,19.0,49.0,2.0,17.0
5,18,7,48.145,2,29.99,29.99,59.98,1,29.99,29.99,...,y,completely orderable,25043.0,600.0,543.0,39.0,2.0,53.0,2.0,234.0
6,18,7,3464.238,51,7.99,39.99,449.34,3,7.99,10.99,...,,,25044.0,4000.0,513.0,352.0,9.0,82.0,1.0,28.0
7,18,7,482.112,8,14.99,19.99,129.92,2,14.99,14.99,...,y,completely orderable,,,,,,,,
8,18,7,1844.763,40,8.99,99.99,648.74,2,12.99,14.99,...,,,,,,,,,,
9,18,7,68.599,4,59.99,79.99,299.96,1,79.99,79.99,...,,,25045.0,1500.0,433.0,73.0,14.0,65.0,2.0,4.0
10,18,7,5852.879,149,5.00,39.99,891.32,5,7.99,19.99,...,y,completely orderable,,,,,,,,


In [None]:
for col in training_data:
    if training_data[col].dtype == 'object':
        training_data[col].fillna(value=training_data[col].mode().iloc[0], inplace=True)
        print("Object", col)
    else:
        training_data[col].fillna(value=training_data[col].mean(), inplace=True)
training_data

In [None]:
training_data.to_csv('training_data_mean.csv')