### Pré-Processamento dos dados de COVID-19 no Piauí

#### Atividade (1.0 pt): Pré-Processamento dos dados de COVID-19 no Piauí - Apresentar um jupyter notebook gere 4 pickles X_train.pickle, y_train.pickle, X_test.pickle e y_test.pickle referente ao dados pré-processados considerando o dataset dos casos de COVID-19 no estado do Piauí. Considere que que o alvo (y) do dataset é o atributo número de mortes (deaths)

### Importando as Bibliotecas

In [1]:
import pandas as pd
import numpy as np
import os
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pickle

### Carregando os dados

In [2]:
caminho = '/home/daniel/Área de Trabalho/Sistemas Inteligentes/Sistemas-Inteligentes/Atv 2'
nomeArq = 'meuCSV.csv'

def carregarDados(dataset_path=caminho, dataset_name=nomeArq):
    csv_path = os.path.join(dataset_path, dataset_name)
    return pd.read_csv(csv_path)

housing = carregarDados(caminho, nomeArq)
housing

Unnamed: 0,city,city_ibge_code,confirmed,confirmed_per_100k_inhabitants,date,death_rate,deaths,estimated_population,estimated_population_2019,is_last,order_for_place,place_type,state
0,Acauã,2200053,228,3210.36328,2022-03-26,0.0088,2,7102,7084,True,658,city,PI
1,Agricolândia,2200103,829,16156.69460,2022-03-26,0.0121,10,5131,5139,True,650,city,PI
2,Água Branca,2200202,1702,9742.41557,2022-03-26,0.0452,77,17470,17411,True,699,city,PI
3,Alagoinha do Piauí,2200251,402,5244.61840,2022-03-26,0.0323,13,7665,7651,True,664,city,PI
4,Alegrete do Piauí,2200277,553,11244.40830,2022-03-26,0.0163,9,4918,4915,True,646,city,PI
...,...,...,...,...,...,...,...,...,...,...,...,...,...
220,Várzea Grande,2211407,561,12790.69767,2022-03-26,0.0018,1,4386,4391,True,697,city,PI
221,Vera Mendes,2211506,400,12987.01299,2022-03-26,0.0075,3,3080,3077,True,681,city,PI
222,Vila Nova do Piauí,2211605,349,11822.49322,2022-03-26,0.0086,3,2952,2971,True,689,city,PI
223,Wall Ferraz,2211704,335,7492.73093,2022-03-26,0.0090,3,4471,4462,True,642,city,PI


### Tratando dados faltantes

In [3]:
# Eliminando as linhas que estão nulas apenas na coluna 'city'
dataset = housing.dropna(subset=["city"])
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 224 entries, 0 to 223
Data columns (total 13 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   city                            224 non-null    object 
 1   city_ibge_code                  224 non-null    int64  
 2   confirmed                       224 non-null    int64  
 3   confirmed_per_100k_inhabitants  224 non-null    float64
 4   date                            224 non-null    object 
 5   death_rate                      224 non-null    float64
 6   deaths                          224 non-null    int64  
 7   estimated_population            224 non-null    int64  
 8   estimated_population_2019       224 non-null    int64  
 9   is_last                         224 non-null    bool   
 10  order_for_place                 224 non-null    int64  
 11  place_type                      224 non-null    object 
 12  state                           224 

### Tratando dados categóricos

In [4]:
# removendo as columas especificas, e se axis = 1 devem rótulos descartados da coluna
dataset = housing.drop(['state', 'place_type', 'is_last', 'date', 'city'], axis=1)
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 225 entries, 0 to 224
Data columns (total 8 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   city_ibge_code                  225 non-null    int64  
 1   confirmed                       225 non-null    int64  
 2   confirmed_per_100k_inhabitants  225 non-null    float64
 3   death_rate                      225 non-null    float64
 4   deaths                          225 non-null    int64  
 5   estimated_population            225 non-null    int64  
 6   estimated_population_2019       225 non-null    int64  
 7   order_for_place                 225 non-null    int64  
dtypes: float64(2), int64(6)
memory usage: 14.2 KB


### Dividando os dados em treino e teste

#### Separando os Alvos

In [5]:
dataset_target = dataset['deaths'].copy()
# removendo as columas especificas, e se axis = 1 devem rótulos descartados da coluna
dataset = dataset.drop('deaths', axis=1)
dataset

Unnamed: 0,city_ibge_code,confirmed,confirmed_per_100k_inhabitants,death_rate,estimated_population,estimated_population_2019,order_for_place
0,2200053,228,3210.36328,0.0088,7102,7084,658
1,2200103,829,16156.69460,0.0121,5131,5139,650
2,2200202,1702,9742.41557,0.0452,17470,17411,699
3,2200251,402,5244.61840,0.0323,7665,7651,664
4,2200277,553,11244.40830,0.0163,4918,4915,646
...,...,...,...,...,...,...,...
220,2211407,561,12790.69767,0.0018,4386,4391,697
221,2211506,400,12987.01299,0.0075,3080,3077,681
222,2211605,349,11822.49322,0.0086,2952,2971,689
223,2211704,335,7492.73093,0.0090,4471,4462,642


In [6]:
# Dividindo o conjunto de dados em subconjuntos 
# test_size = número que define o tamanho do conjunto de teste.
# random_state = é o objeto que controla a randomização durante a divisão
# shuffle = determina se o conjunto de dados deve ser embaralhado antes de aplicar a divisão.

X_train, X_test, Y_train, Y_test = train_test_split(dataset, dataset_target, test_size = 0.2, random_state=1, shuffle=True)


In [7]:
X_train.shape


(180, 7)

In [8]:
X_test.shape


(45, 7)

In [9]:
224 * 0.8


179.20000000000002

### Feature Scaling

In [10]:
#z = (x - u) / s
sc = StandardScaler() # Instanciando o modelo
X_train = sc.fit_transform(X_train) # Treinando o modelo, transformando os dados e padronizá-los.
# Transformando dados de teste para aplicando os valores dos parâmetros nos dados reais.
X_test = sc.transform(X_test)

In [11]:
# Gerando os pickle para que possa ser usado para serializar e desserializar objetos
pickle.dump(X_train, open('X_train.pickle', 'wb'))
pickle.dump(X_test, open('X_test.pickle', 'wb'))
pickle.dump(Y_train, open('y_train.pickle', 'wb'))
pickle.dump(Y_test, open('y_test.pickle', 'wb'))