
# <font color='green'>Estudando estratégias de Encoding como preparação para Modelagem Preditiva.</font>


 Técnicas de Encoding são formas de transformar uma variável em um formato apropriado para o objetivo da análise. 
 Nesse projeto de estudo, as variáveis foram convertidas para o formato numérico, uma vez que o objetivo final era analisar as técnicas para criar um modelo preditivo para o preço do aluguel.

 Durante o projeto de estudo será visto técnicas como: *Label Encoding, One-Hot-Encoding* e no final há um exemplo de utilização da técnica de *Count/Frequency Encoding*. 

 O dataset usado nesse estudo pode ser encontrado em: https://www.kaggle.com/datasets/rubenssjr/brasilian-houses-to-rent

In [1]:
#Instalando e Carregando os Pacotes

import pandas as pd
import numpy as np
from sklearn import linear_model
import warnings
warnings.filterwarnings("ignore")

## Carregando os Dados 
Dataset de casas para alugar (2020) no Brasil.

In [2]:
df = pd.read_csv('houses_to_rent_v2.csv')

In [3]:
df.shape

(10692, 13)

In [4]:
df

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$)
0,São Paulo,70,2,1,1,7,acept,furnished,2065,3300,211,42,5618
1,São Paulo,320,4,4,0,20,acept,not furnished,1200,4960,1750,63,7973
2,Porto Alegre,80,1,1,1,6,acept,not furnished,1000,2800,0,41,3841
3,Porto Alegre,51,2,1,0,2,acept,not furnished,270,1112,22,17,1421
4,São Paulo,25,1,1,0,1,not acept,not furnished,0,800,25,11,836
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10687,Porto Alegre,63,2,1,1,5,not acept,furnished,402,1478,24,22,1926
10688,São Paulo,285,4,4,4,17,acept,not furnished,3100,15000,973,191,19260
10689,Rio de Janeiro,70,3,3,0,8,not acept,furnished,980,6000,332,78,7390
10690,Rio de Janeiro,120,2,2,2,8,acept,furnished,1585,12000,279,155,14020


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10692 entries, 0 to 10691
Data columns (total 13 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   city                 10692 non-null  object
 1   area                 10692 non-null  int64 
 2   rooms                10692 non-null  int64 
 3   bathroom             10692 non-null  int64 
 4   parking spaces       10692 non-null  int64 
 5   floor                10692 non-null  object
 6   animal               10692 non-null  object
 7   furniture            10692 non-null  object
 8   hoa (R$)             10692 non-null  int64 
 9   rent amount (R$)     10692 non-null  int64 
 10  property tax (R$)    10692 non-null  int64 
 11  fire insurance (R$)  10692 non-null  int64 
 12  total (R$)           10692 non-null  int64 
dtypes: int64(9), object(4)
memory usage: 1.1+ MB


## Aplicando One-Hot-Encoding

Essa técnica codifica cada variável categórica em diferentes variáveis booleanas (também chamadas de variáveis dummy) que assumem valores 0 ou 1 indicando se a categoria está ou não sendo observada no conjunto de dados, muito usada em Processamento de Linguagem Natural. 

In [6]:
df.city.value_counts()

São Paulo         5887
Rio de Janeiro    1501
Belo Horizonte    1258
Porto Alegre      1193
Campinas           853
Name: city, dtype: int64

In [7]:
# Escolhendo as variáveis para implementar variáveis dummy: city, animal e furniture. Irei aplicar em uma por vez. 
df_dummies = pd.get_dummies(df['city'])

In [8]:
df_dummies

Unnamed: 0,Belo Horizonte,Campinas,Porto Alegre,Rio de Janeiro,São Paulo
0,0,0,0,0,1
1,0,0,0,0,1
2,0,0,1,0,0
3,0,0,1,0,0
4,0,0,0,0,1
...,...,...,...,...,...
10687,0,0,1,0,0
10688,0,0,0,0,1
10689,0,0,0,1,0
10690,0,0,0,1,0


In [9]:
# Concatenando os dataframes passo a passo
df_conc = pd.concat([df, df_dummies], axis = 'columns')

In [10]:
df_dummies2 = pd.get_dummies(df['animal'])

In [11]:
df_dummies2

Unnamed: 0,acept,not acept
0,1,0
1,1,0
2,1,0
3,1,0
4,0,1
...,...,...
10687,0,1
10688,1,0
10689,0,1
10690,1,0


In [12]:
# Concatenando os dataframes
df_conc2 = pd.concat([df_conc, df_dummies2], axis = 'columns')

In [13]:
df_conc2

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$),Belo Horizonte,Campinas,Porto Alegre,Rio de Janeiro,São Paulo,acept,not acept
0,São Paulo,70,2,1,1,7,acept,furnished,2065,3300,211,42,5618,0,0,0,0,1,1,0
1,São Paulo,320,4,4,0,20,acept,not furnished,1200,4960,1750,63,7973,0,0,0,0,1,1,0
2,Porto Alegre,80,1,1,1,6,acept,not furnished,1000,2800,0,41,3841,0,0,1,0,0,1,0
3,Porto Alegre,51,2,1,0,2,acept,not furnished,270,1112,22,17,1421,0,0,1,0,0,1,0
4,São Paulo,25,1,1,0,1,not acept,not furnished,0,800,25,11,836,0,0,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10687,Porto Alegre,63,2,1,1,5,not acept,furnished,402,1478,24,22,1926,0,0,1,0,0,0,1
10688,São Paulo,285,4,4,4,17,acept,not furnished,3100,15000,973,191,19260,0,0,0,0,1,1,0
10689,Rio de Janeiro,70,3,3,0,8,not acept,furnished,980,6000,332,78,7390,0,0,0,1,0,0,1
10690,Rio de Janeiro,120,2,2,2,8,acept,furnished,1585,12000,279,155,14020,0,0,0,1,0,1,0


In [14]:
df_dummies3 = pd.get_dummies(df['furniture'])

In [15]:
df_dummies3

Unnamed: 0,furnished,not furnished
0,1,0
1,0,1
2,0,1
3,0,1
4,0,1
...,...,...
10687,1,0
10688,0,1
10689,1,0
10690,1,0


In [16]:
# Concatenando os dataframes
df_conc3 = pd.concat([df_conc2, df_dummies3], axis = 'columns')

In [17]:
df_conc3

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),...,total (R$),Belo Horizonte,Campinas,Porto Alegre,Rio de Janeiro,São Paulo,acept,not acept,furnished,not furnished
0,São Paulo,70,2,1,1,7,acept,furnished,2065,3300,...,5618,0,0,0,0,1,1,0,1,0
1,São Paulo,320,4,4,0,20,acept,not furnished,1200,4960,...,7973,0,0,0,0,1,1,0,0,1
2,Porto Alegre,80,1,1,1,6,acept,not furnished,1000,2800,...,3841,0,0,1,0,0,1,0,0,1
3,Porto Alegre,51,2,1,0,2,acept,not furnished,270,1112,...,1421,0,0,1,0,0,1,0,0,1
4,São Paulo,25,1,1,0,1,not acept,not furnished,0,800,...,836,0,0,0,0,1,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10687,Porto Alegre,63,2,1,1,5,not acept,furnished,402,1478,...,1926,0,0,1,0,0,0,1,1,0
10688,São Paulo,285,4,4,4,17,acept,not furnished,3100,15000,...,19260,0,0,0,0,1,1,0,0,1
10689,Rio de Janeiro,70,3,3,0,8,not acept,furnished,980,6000,...,7390,0,0,0,1,0,0,1,1,0
10690,Rio de Janeiro,120,2,2,2,8,acept,furnished,1585,12000,...,14020,0,0,0,1,0,1,0,1,0


In [18]:
# Excluindo as colunas que estavam como tipo object para não haver duplicidade na visualização dos dados
newdf =  df_conc3.drop(['furniture', 'animal', 'city'], axis = 1)

In [19]:
newdf

Unnamed: 0,area,rooms,bathroom,parking spaces,floor,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$),Belo Horizonte,Campinas,Porto Alegre,Rio de Janeiro,São Paulo,acept,not acept,furnished,not furnished
0,70,2,1,1,7,2065,3300,211,42,5618,0,0,0,0,1,1,0,1,0
1,320,4,4,0,20,1200,4960,1750,63,7973,0,0,0,0,1,1,0,0,1
2,80,1,1,1,6,1000,2800,0,41,3841,0,0,1,0,0,1,0,0,1
3,51,2,1,0,2,270,1112,22,17,1421,0,0,1,0,0,1,0,0,1
4,25,1,1,0,1,0,800,25,11,836,0,0,0,0,1,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10687,63,2,1,1,5,402,1478,24,22,1926,0,0,1,0,0,0,1,1,0
10688,285,4,4,4,17,3100,15000,973,191,19260,0,0,0,0,1,1,0,0,1
10689,70,3,3,0,8,980,6000,332,78,7390,0,0,0,1,0,0,1,1,0
10690,120,2,2,2,8,1585,12000,279,155,14020,0,0,0,1,0,1,0,1,0


In [20]:
# Renomeando as colunas para melhor visualização posteriormente do modelo preditivo 
newdf.rename(columns={'acept': 'animal acept', 'not acept': 'animal not acept' }, inplace=True)

In [21]:
newdf

Unnamed: 0,area,rooms,bathroom,parking spaces,floor,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$),Belo Horizonte,Campinas,Porto Alegre,Rio de Janeiro,São Paulo,animal acept,animal not acept,furnished,not furnished
0,70,2,1,1,7,2065,3300,211,42,5618,0,0,0,0,1,1,0,1,0
1,320,4,4,0,20,1200,4960,1750,63,7973,0,0,0,0,1,1,0,0,1
2,80,1,1,1,6,1000,2800,0,41,3841,0,0,1,0,0,1,0,0,1
3,51,2,1,0,2,270,1112,22,17,1421,0,0,1,0,0,1,0,0,1
4,25,1,1,0,1,0,800,25,11,836,0,0,0,0,1,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10687,63,2,1,1,5,402,1478,24,22,1926,0,0,1,0,0,0,1,1,0
10688,285,4,4,4,17,3100,15000,973,191,19260,0,0,0,0,1,1,0,0,1
10689,70,3,3,0,8,980,6000,332,78,7390,0,0,0,1,0,0,1,1,0
10690,120,2,2,2,8,1585,12000,279,155,14020,0,0,0,1,0,1,0,1,0


In [22]:
newdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10692 entries, 0 to 10691
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   area                 10692 non-null  int64 
 1   rooms                10692 non-null  int64 
 2   bathroom             10692 non-null  int64 
 3   parking spaces       10692 non-null  int64 
 4   floor                10692 non-null  object
 5   hoa (R$)             10692 non-null  int64 
 6   rent amount (R$)     10692 non-null  int64 
 7   property tax (R$)    10692 non-null  int64 
 8   fire insurance (R$)  10692 non-null  int64 
 9   total (R$)           10692 non-null  int64 
 10  Belo Horizonte       10692 non-null  uint8 
 11  Campinas             10692 non-null  uint8 
 12  Porto Alegre         10692 non-null  uint8 
 13  Rio de Janeiro       10692 non-null  uint8 
 14  São Paulo            10692 non-null  uint8 
 15  animal acept         10692 non-null  uint8 
 16  anim

In [23]:
# Para criar o processo de treinamento de modelo, primeiro é necessário converter a variável "floor" em int.
# Foi verificado que a variável não possui valores nulos, porém possue o caracter '-' o que a torna como tipo de dado object.
newdf['floor'].value_counts()

-      2461
1      1081
2       985
3       931
4       748
5       600
6       539
7       497
8       490
9       369
10      357
11      303
12      257
13      200
14      170
15      147
16      109
17       96
18       75
19       53
20       44
21       42
25       25
23       25
22       24
26       20
24       19
27        8
28        6
29        5
32        2
35        1
46        1
301       1
51        1
Name: floor, dtype: int64

O foco desse estudo é estratégias de Encoding, por isso irei somente transformar as linhas que contêm ' - ' como NaN e posteriormente aplicar drop nas linhas nulas. 


In [24]:
newdf.floor.isna().sum()

0

In [25]:
# Transformando as linhas '-' em NaN
newdf.floor = newdf.floor.replace({'-':np.NaN})

In [26]:
# Conferindo os valores nulos
newdf.floor.isna().sum()

2461

In [27]:
# Excluindo os valores nulos
newdf.dropna(subset = ['floor'], inplace= True)

In [28]:
# Confirmando a exclusão 
newdf.floor.isna().sum()

0

In [29]:
newdf.isna().any()

area                   False
rooms                  False
bathroom               False
parking spaces         False
floor                  False
hoa (R$)               False
rent amount (R$)       False
property tax (R$)      False
fire insurance (R$)    False
total (R$)             False
Belo Horizonte         False
Campinas               False
Porto Alegre           False
Rio de Janeiro         False
São Paulo              False
animal acept           False
animal not acept       False
furnished              False
not furnished          False
dtype: bool

In [30]:
# Definindo os valores de X. Drop da variável target que será usada como Y
x= newdf.drop('total (R$)', axis= 1)
x

Unnamed: 0,area,rooms,bathroom,parking spaces,floor,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),Belo Horizonte,Campinas,Porto Alegre,Rio de Janeiro,São Paulo,animal acept,animal not acept,furnished,not furnished
0,70,2,1,1,7,2065,3300,211,42,0,0,0,0,1,1,0,1,0
1,320,4,4,0,20,1200,4960,1750,63,0,0,0,0,1,1,0,0,1
2,80,1,1,1,6,1000,2800,0,41,0,0,1,0,0,1,0,0,1
3,51,2,1,0,2,270,1112,22,17,0,0,1,0,0,1,0,0,1
4,25,1,1,0,1,0,800,25,11,0,0,0,0,1,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10686,150,3,3,2,8,0,13500,0,172,0,0,0,0,1,0,1,1,0
10687,63,2,1,1,5,402,1478,24,22,0,0,1,0,0,0,1,1,0
10688,285,4,4,4,17,3100,15000,973,191,0,0,0,0,1,1,0,0,1
10689,70,3,3,0,8,980,6000,332,78,0,0,0,1,0,0,1,1,0


In [31]:
# Definindo os valores de Y para modelagem preditiva 
y = newdf['total (R$)']
y

0         5618
1         7973
2         3841
3         1421
4          836
         ...  
10686    13670
10687     1926
10688    19260
10689     7390
10690    14020
Name: total (R$), Length: 8231, dtype: int64

In [32]:
# Criando o modelo
model_h1 = linear_model.LinearRegression()

In [33]:
# Treinando o modelo 
model_h1.fit(x, y)

In [34]:
# Previsão do modelo
model_h1.predict(x)

array([ 5618.29770906,  7973.18041748,  3841.1803818 , ...,
       19264.21042159,  7389.7344188 , 14018.89372217])

In [35]:
# Um previsão com parâmetros aleatórios para testar o modelo
model_h1.predict([[70, 3, 1, 0, 6, 1200, 2800, 22, 11, 0, 0, 0, 0, 1, 1, 0, 1, 0]])

array([4033.08069158])

In [36]:
# Calculando acurácia do modelo
model_h1.score(x, y)

0.9999998613575312

## Aplicando Label Encoding

Essa técnica utiliza a substituição de uma categoria por sua representação numérica correspondente. Muito usada em casos de baixo números de categorias (de 10 a 15).

In [37]:
from sklearn.preprocessing import LabelEncoder 

In [38]:
# Cópia do dataframe original sem interferência da técnica anterior 
dados = df

In [39]:
dados

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$)
0,São Paulo,70,2,1,1,7,acept,furnished,2065,3300,211,42,5618
1,São Paulo,320,4,4,0,20,acept,not furnished,1200,4960,1750,63,7973
2,Porto Alegre,80,1,1,1,6,acept,not furnished,1000,2800,0,41,3841
3,Porto Alegre,51,2,1,0,2,acept,not furnished,270,1112,22,17,1421
4,São Paulo,25,1,1,0,1,not acept,not furnished,0,800,25,11,836
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10687,Porto Alegre,63,2,1,1,5,not acept,furnished,402,1478,24,22,1926
10688,São Paulo,285,4,4,4,17,acept,not furnished,3100,15000,973,191,19260
10689,Rio de Janeiro,70,3,3,0,8,not acept,furnished,980,6000,332,78,7390
10690,Rio de Janeiro,120,2,2,2,8,acept,furnished,1585,12000,279,155,14020


In [40]:
# Criando objeto encoder
le = LabelEncoder()

In [41]:
# Aplicando o encoder nas variáveis object 
dados['city'] = le.fit_transform(dados['city'])
dados['animal'] = le.fit_transform(dados['animal'])
dados['furniture'] = le.fit_transform(dados['furniture'])

In [42]:
dados

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$)
0,4,70,2,1,1,7,0,0,2065,3300,211,42,5618
1,4,320,4,4,0,20,0,1,1200,4960,1750,63,7973
2,2,80,1,1,1,6,0,1,1000,2800,0,41,3841
3,2,51,2,1,0,2,0,1,270,1112,22,17,1421
4,4,25,1,1,0,1,1,1,0,800,25,11,836
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10687,2,63,2,1,1,5,1,0,402,1478,24,22,1926
10688,4,285,4,4,4,17,0,1,3100,15000,973,191,19260
10689,3,70,3,3,0,8,1,0,980,6000,332,78,7390
10690,3,120,2,2,2,8,0,0,1585,12000,279,155,14020


In [43]:
# Substituindo novamente os valores da variável 'floor' pelo mesmo motivo informado na técnica acima.
dados.floor= dados.floor.replace({'-':np.NaN})

In [44]:
# Excluindo os valores nulos 
dados.dropna(subset = ['floor'], inplace= True)

In [45]:
# Conferindo a exclusão
dados.floor.isna().any()

False

In [46]:
# Definindo os valores de X para modelagem preditiva. Drop na variável target que será usada como valor de Y
x = dados.drop(['total (R$)'] , axis= 1)

In [47]:
# Definindo os valores de Y para modelagem preditiva 
y = dados['total (R$)']
y

0         5618
1         7973
2         3841
3         1421
4          836
         ...  
10686    13670
10687     1926
10688    19260
10689     7390
10690    14020
Name: total (R$), Length: 8231, dtype: int64

In [48]:
# Criando a versão 2 do modelo
model_h2 = linear_model.LinearRegression()

In [49]:
# Treinando o modelo
model_h2.fit(x, y)

In [50]:
# Fazendo a previsão
model_h2.predict(x)

array([ 5618.17586752,  7973.04616516,  3841.34355289, ...,
       19264.24993004,  7389.94946031, 14019.15121231])

In [51]:
# Um previsão com parâmetros aleatórios para testar o modelo
model_h2.predict([[4, 80, 1, 1, 7, 1, 1, 3000, 4000, 100, 40, 5000]])

array([9695.52886781])

In [52]:
# Calculando a acurácia
model_h2.score(x, y)

0.9999998612773692

### Salvando o Modelo Treinado 

In [53]:
import joblib

In [54]:
# Dump do modelo
#joblib.dump(model_h2, 'meu-projeto/modelo.pkl')

## Exemplo de como usar a Técnica Count/ Frequency Encoding

Essa técnica é usada substituindo os valores das variáveis pela sua contagem ou frequência (número decimal) no conjunto de dados.

In [55]:
# Carregando os dados novamente 
df2 = pd.read_csv('houses_to_rent_v2.csv')

In [56]:
# Valores únicos dentro da coluna 'city'
len(df2.city.unique())

5

In [57]:
# Contagem/frequência das categorias da coluna 'city'
frequency =  df2.city.value_counts().to_dict()
frequency

{'São Paulo': 5887,
 'Rio de Janeiro': 1501,
 'Belo Horizonte': 1258,
 'Porto Alegre': 1193,
 'Campinas': 853}

In [58]:
# Fazendo replace de cada categoria pela contagem/frequencia na coluna 'city'
df2.city = df2.city.map(frequency)
df2

Unnamed: 0,city,area,rooms,bathroom,parking spaces,floor,animal,furniture,hoa (R$),rent amount (R$),property tax (R$),fire insurance (R$),total (R$)
0,5887,70,2,1,1,7,acept,furnished,2065,3300,211,42,5618
1,5887,320,4,4,0,20,acept,not furnished,1200,4960,1750,63,7973
2,1193,80,1,1,1,6,acept,not furnished,1000,2800,0,41,3841
3,1193,51,2,1,0,2,acept,not furnished,270,1112,22,17,1421
4,5887,25,1,1,0,1,not acept,not furnished,0,800,25,11,836
...,...,...,...,...,...,...,...,...,...,...,...,...,...
10687,1193,63,2,1,1,5,not acept,furnished,402,1478,24,22,1926
10688,5887,285,4,4,4,17,acept,not furnished,3100,15000,973,191,19260
10689,1501,70,3,3,0,8,not acept,furnished,980,6000,332,78,7390
10690,1501,120,2,2,2,8,acept,furnished,1585,12000,279,155,14020


### Conclusão 

Esse projeto visa documentar minha aprendizagem no tema de Encoding. Como visto nas técnicas aplicadas acima, os modelos apresentaram uma ótima acurácia, mas seria necessário aplicar novos parâmetros e técnicas para escolher o melhor modelo preditivo, além de trabalhar na limpeza e processamento de dados que não era o foco desse projeto de estudo. 

Analisando somente os modelos criados, ambos poderiam ser usados para a finalidade inicial: prever o valor total de aluguel na região escolhida. 