# Pré-processamento e Transformação

## Introdução

Vamos pré-processar, transformar e salvar os dados para a etapa de modelagem.  Os dados que usaremos estão em um arquivo chamado  `Orange_Telecom_Churn_Data.csv` disponível no  [GitHub repository](https://github.com/rosalvoneto/InteligenciaComputacional).

## Questão 1

* Importe os dados;
* Examine as colunas.

In [1]:
# Importar os dados
import pandas as pd

churn_data = pd.read_csv('data/Orange_Telecom_Churn_Data.csv')

churn_data.head()

Unnamed: 0,state,account_length,area_code,phone_number,intl_plan,voice_mail_plan,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,...,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,churned
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


## Questão 2

* Elimine as variáveis irrelevantes: 'phone_number', 'area_code', 'state'.

In [2]:
# Remove phone_number column

churn_data.drop(['phone_number', 'area_code', 'state'], inplace=True, axis=1)

In [11]:
churn_data.columns

Index(['account_length', 'intl_plan', 'voice_mail_plan',
       'number_vmail_messages', 'total_day_minutes', 'total_day_calls',
       'total_day_charge', 'total_eve_minutes', 'total_eve_calls',
       'total_eve_charge', 'total_night_minutes', 'total_night_calls',
       'total_night_charge', 'total_intl_minutes', 'total_intl_calls',
       'total_intl_charge', 'number_customer_service_calls', 'churned'],
      dtype='object')

## Questão 3

* Separate os dados em dois: X_data (variáveis de entrada) e y_data (variável alvo `churned`);

* Dividir os dados em treinamento e teste usando train_test_split do sklearn.model_selection: X_train, X_test, y_train, y_test

In [12]:
# Separate os dados: X_data e  y_data
y_data = churn_data.churned
X_data = churn_data.drop(['churned'], axis=1)

print(X_data.head())
print(y_data.head())

# Dividir os dados em treinamento e teste

from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X_data, test_size=0.3)
y_train, y_test = train_test_split(y_data, test_size=0.3)

   account_length intl_plan voice_mail_plan  number_vmail_messages  \
0             128        no             yes                     25   
1             107        no             yes                     26   
2             137        no              no                      0   
3              84       yes              no                      0   
4              75       yes              no                      0   

   total_day_minutes  total_day_calls  total_day_charge  total_eve_minutes  \
0              265.1              110             45.07              197.4   
1              161.6              123             27.47              195.5   
2              243.4              114             41.38              121.2   
3              299.4               71             50.90               61.9   
4              166.7              113             28.34              148.3   

   total_eve_calls  total_eve_charge  total_night_minutes  total_night_calls  \
0               99            

## Questão 4

* Crie variáveis Dummies;
* Converta a variável alvo em numérica.

In [13]:
# Dummies

X_train = pd.get_dummies(X_train, prefix_sep='_', drop_first=True)
X_test = pd.get_dummies(X_test, prefix_sep='_', drop_first=True)

X_train.head()

Unnamed: 0,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,intl_plan_yes,voice_mail_plan_yes
157,139,23,157.6,129,26.79,247.0,96,21.0,259.2,112,11.66,13.7,2,3.7,0,0,1
4578,116,0,85.8,88,14.59,115.8,112,9.84,195.9,91,8.82,11.0,2,2.97,1,0,0
4037,69,0,245.6,128,41.75,173.0,101,14.71,111.8,118,5.03,14.2,7,3.83,2,0,0
1351,13,0,58.4,121,9.93,262.2,64,22.29,159.0,115,7.15,11.9,5,3.21,1,0,0
1350,55,0,285.7,124,48.57,230.9,106,19.63,230.7,140,10.38,14.8,7,4.0,0,0,0


In [15]:
# Converter a variável alvo

y_train = y_train.astype(int)
y_test = y_test.astype(int)

y_train.head()

830     1
3528    0
1169    0
2090    0
3955    0
Name: churned, dtype: int64

## Questão 5

* Normalize X_train -> X_train_norm e X_test -> X_test_norm: use MinMaxScaler do sklearn.preprocessing;OBS: MinMaxScaler retorna um numpy.ndarray e não um DataFrame do pandas;
* Atualize X_train e X_test com os dados normalizados.

In [17]:
# Normalizar

from sklearn.preprocessing import MinMaxScaler
import numpy as np

scaler = MinMaxScaler()

X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.fit_transform(X_test)

np.take(X_train_norm, range(3), axis=1)

array([[0.59482759, 0.44230769, 0.44836415],
       [0.49568966, 0.        , 0.24409673],
       [0.29310345, 0.        , 0.69871977],
       ...,
       [0.56465517, 0.19230769, 0.52034139],
       [0.61206897, 0.        , 0.47965861],
       [0.65086207, 0.        , 0.52403983]])

In [26]:
# Atualizar X_train com os dados normalizado

X_train = pd.DataFrame(dict(zip(X_train.columns.values, X_train_norm.T)))

In [27]:
# Atualizar/Criar X_train com os dados normalizado
X_test = pd.DataFrame(dict(zip(X_test.columns.values, X_test_norm.T)))

## Questão 6

* Examine os Dataframes;
* Salve os quatro arquivos.


In [28]:
# Examinando X_train 

X_train.head()

Unnamed: 0,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,intl_plan_yes,voice_mail_plan_yes
0,0.594828,0.442308,0.448364,0.781818,0.448293,0.658172,0.531646,0.658394,0.63475,0.613497,0.634788,0.685,0.1,0.685185,0.0,0.0,1.0
1,0.49569,0.0,0.244097,0.533333,0.244143,0.273872,0.632911,0.273699,0.464497,0.484663,0.465033,0.55,0.1,0.55,0.111111,0.0,0.0
2,0.293103,0.0,0.69872,0.775758,0.698628,0.441418,0.563291,0.441572,0.2383,0.650307,0.238494,0.71,0.35,0.709259,0.222222,0.0,0.0
3,0.051724,0.0,0.166145,0.733333,0.166165,0.702695,0.329114,0.702861,0.36525,0.631902,0.365212,0.595,0.25,0.594444,0.111111,0.0,0.0
4,0.232759,0.0,0.812802,0.751515,0.812751,0.611013,0.594937,0.611169,0.558096,0.785276,0.558279,0.74,0.35,0.740741,0.0,0.0,0.0


In [29]:
# Examinando X_test

X_test.head()

Unnamed: 0,account_length,number_vmail_messages,total_day_minutes,total_day_calls,total_day_charge,total_eve_minutes,total_eve_calls,total_eve_charge,total_night_minutes,total_night_calls,total_night_charge,total_intl_minutes,total_intl_calls,total_intl_charge,number_customer_service_calls,intl_plan_yes,voice_mail_plan_yes
0,0.413223,0.0,0.434233,0.403226,0.434291,0.665529,0.542683,0.66544,0.514009,0.466667,0.513671,0.588832,0.117647,0.588346,0.444444,0.0,0.0
1,0.471074,0.0,0.224584,0.33871,0.224662,0.556598,0.554878,0.556373,0.497774,0.4,0.497382,0.593909,0.352941,0.593985,0.111111,0.0,0.0
2,0.293388,0.0,0.556864,0.395161,0.556926,0.451081,0.786585,0.450987,0.705158,0.715152,0.705061,0.345178,0.176471,0.345865,0.0,0.0,0.0
3,0.5,0.0,0.134693,0.604839,0.134797,0.608931,0.597561,0.608899,0.405604,0.727273,0.405468,0.507614,0.235294,0.507519,0.0,1.0,0.0
4,0.330579,0.0,0.410684,0.16129,0.410642,0.818828,0.79878,0.818668,0.475779,0.733333,0.475858,0.467005,0.235294,0.466165,0.222222,0.0,0.0


In [30]:
# Salvando os dados em arquivos

X_train.to_csv('data/Orange_Telecom_Churn_Data_X_train.csv')
X_test.to_csv('data/Orange_Telecom_Churn_Data_X_test.csv')

y_train.to_csv('data/Orange_Telecom_Churn_Data_y_train.csv')
y_test.to_csv('data/Orange_Telecom_Churn_Data_y_test.csv')