# Samambaia House Price Prediction

Here are the steps used in this preprocesing file

1.  Check Dataset House Locations
2.  Mark Samambaia Houses
3.  Filter Samambaia Houses
4.  Create new columns
5.  Transform existing columns to improve data analysis
6.  Final dataset

##### Loading libraries

In [56]:
# Import libraries
import pandas as pd
import numpy as np

##### Loading data to pandas

In [57]:
# Reading data from excel file
df_houses = pd.read_excel('./data/houses.xlsx', index_col=[0])

# Showing first rows of the dataset
df_houses.head()

Unnamed: 0,house_name,house_price,house_description,house_location,house_hypterlink
0,Samambaia - Apartamento Padrão - Samambaia Sul...,R$ 152.000,\n1 quarto\n38m²\nCondomínio: R$ 329,"Brasília, Samambaia Sul (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...
1,Samambaia - Apartamento Padrão - Samambaia Sul...,R$ 408.000,\n2 quartos\n65m²\nCondomínio: R$ 432\n1 vaga,"Brasília, Samambaia Sul (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...
2,Samambaia 2 qts! Desocupado! mude já! Próx. Me...,R$ 145.000,\n2 quartos\n63m²\nCondomínio: R$ 10,"Brasília, Samambaia Norte (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...
3,"Samambaia sul, qr 120, apto 2 qtos,varanda, ar...",R$ 190.000,\n2 quartos\n55m²\nCondomínio: R$ 350\n1 vaga,"Brasília, Samambaia Sul (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...
4,Samambaia Norte - QR 402 - Casa de 2 Quartos -...,R$ 290.000,\n2 quartos\n105m²\n1 vaga,"Brasília, Samambaia Norte (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...


## 1. Check Dataset House Locations

##### The first thing to do is checking whether we are really dealing with Samambaia houses. For that, we can use the command below:

In [58]:
df_houses['house_location'].unique()

array(['Brasília, Samambaia Sul (Samambaia) - DDD 61',
       'Brasília, Samambaia Norte (Samambaia) - DDD 61',
       'Brasília, Candangolândia - DDD 61',
       'Valparaíso de Goiás, Parque Esplanada V - DDD 61',
       'Brasília, Ceilândia Sul (Ceilândia) - DDD 61',
       'Brasília, Setor Habitacional Arniqueira (Águas Claras) - DDD 61',
       'Brasília, Setor Habitacional Vicente Pires - Trecho 3 - DDD 61',
       'Brasília, St H Arniqueiras - DDD 61',
       'Brasília, Guará II - DDD 61',
       'Brasília, Setor Habitacional Vicente Pires - DDD 61',
       'Brasília, Ceilândia Norte (Ceilândia) - DDD 61',
       'Brasília, Sul (Águas Claras) - DDD 61',
       'Brasília, Ceilândia Centro (Ceilândia) - DDD 61',
       'Brasília, Cond P L Roriz - DDD 61',
       'Brasília, Vila São José (Vicente Pires) - DDD 61',
       'Brasília, Riacho Fundo II - DDD 61',
       'Brasília, Setor Habitacional Sol Nascente (Ceilândia) - DDD 61',
       'Brasilia, Riacho Fundo I - DDD 61',
       'V

##### Only these four locations indicate Samambaia addresses.

In [59]:
houses_location = [
        'Brasília, Samambaia Sul (Samambaia) - DDD 61',
        'Brasília, Samambaia Norte (Samambaia) - DDD 61',
        'Brasília, Samambaia Sul - DDD 61',
        'Brasília, Samambaia Norte - DDD 61'
]

## 2. Mark Samambaia Houses

##### Create a column to indicate which rows are samambaia houses or not

In [60]:
df_houses['Samambaia'] = df_houses['house_location'].apply(lambda x: True if x in houses_location else False)

##### There are 2926 rows where samambaia equals true. That means we have 2926 samambaia houses and apartments in our dataset.

In [61]:
df_houses[df_houses['Samambaia'] == True].info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2926 entries, 0 to 3461
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   house_name         2926 non-null   object
 1   house_price        2926 non-null   object
 2   house_description  2923 non-null   object
 3   house_location     2926 non-null   object
 4   house_hypterlink   2926 non-null   object
 5   Samambaia          2926 non-null   bool  
dtypes: bool(1), object(5)
memory usage: 140.0+ KB


## 3. Filter Samambaia Houses

Saving samambaia houses in df_samambaia variable, as show below:

In [62]:
df_samambaia = df_houses.loc[df_houses['Samambaia'] == True].copy()

In [63]:
df_samambaia.head()

Unnamed: 0,house_name,house_price,house_description,house_location,house_hypterlink,Samambaia
0,Samambaia - Apartamento Padrão - Samambaia Sul...,R$ 152.000,\n1 quarto\n38m²\nCondomínio: R$ 329,"Brasília, Samambaia Sul (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...,True
1,Samambaia - Apartamento Padrão - Samambaia Sul...,R$ 408.000,\n2 quartos\n65m²\nCondomínio: R$ 432\n1 vaga,"Brasília, Samambaia Sul (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...,True
2,Samambaia 2 qts! Desocupado! mude já! Próx. Me...,R$ 145.000,\n2 quartos\n63m²\nCondomínio: R$ 10,"Brasília, Samambaia Norte (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...,True
3,"Samambaia sul, qr 120, apto 2 qtos,varanda, ar...",R$ 190.000,\n2 quartos\n55m²\nCondomínio: R$ 350\n1 vaga,"Brasília, Samambaia Sul (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...,True
4,Samambaia Norte - QR 402 - Casa de 2 Quartos -...,R$ 290.000,\n2 quartos\n105m²\n1 vaga,"Brasília, Samambaia Norte (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...,True


## 4. Create new columns

Here, we are going to split the house_description column to various columns. We will create the following columns:
* n_rooms
* has_condominium
* value_condominium
* has_parking
* n_parking
* house_size

###### The first part of this code contains functions to get information from house_description column, so we can create new columns from that.

In [115]:
def get_number_rooms(house_description):
    '''
        From description column, which has lots
        of informations together, get the number
        of rooms and return as a float number.
    '''
    
    key_word = 'quarto'
    
    # Casting parameter to string. For some reason, the last row is float
    house_description = str(house_description)
    
    if key_word in house_description:
        
        # get 'quarto' or 'quartos' string
        rooms_text = ''
        for t in house_description.split('\n'):
            if key_word in t:
                rooms_text = t
                break
                
        # after that, get the number of rooms from string
        n_rooms = int(rooms_text.split()[0])
        
        return n_rooms
    return 0
    
def has_condominium(house_description):
    
    key_word = 'Condomínio'
    house_description = str(house_description)
    
    if key_word in house_description:
        return 1
    return 0

def get_condominium_value(house_description):
    
    key_word = 'Condomínio'
    house_description = str(house_description)
    
    if key_word in house_description:
        
        #get 'Condomínio' string
        condominium_text = ''
        for t in house_description.split('\n'):
            if key_word in t:
                condominium_text = t
                break
        
        #after that, get the condominium value from string
        condominium_value = float(condominium_text.split()[2].replace('.',''))
        
        return condominium_value
#     return np.nan
    return 0

def has_parking(house_description):
    
    key_word = 'vaga'
    house_description = str(house_description)
    
    if key_word in house_description:
        return 1
    return 0

def get_n_parking(house_description):
    
    key_word = 'vaga'
    house_description = str(house_description)
    
    if key_word in house_description:
        
        # Get 'vaga' string
        parking_text = ''
        for t in house_description.split('\n'):
            if key_word in t:
                parking_text = t
                break
                
        # Get that number of parkings
        parking_value = int(parking_text.split()[0])
        return parking_value
    return 0

def get_house_size(house_description):
    
    key_word = 'm²'
    house_description = str(house_description)
    
    if key_word in house_description:
        
        house_size_text = ''
        
        for t in house_description.split('\n'):
            if key_word in t:
                house_size_text = t
                break
        
        house_size = float(house_size_text.replace('m²',''))
        return house_size
    return 0

###### Now we are going to apply each function, so we can create new columns

In [65]:
df_samambaia['n_rooms'] = df_samambaia['house_description'].apply(get_number_rooms)
df_samambaia['has_condominium'] = df_samambaia['house_description'].apply(has_condominium)
df_samambaia['value_condominium'] = df_samambaia['house_description'].apply(get_condominium_value)
df_samambaia['has_parking'] = df_samambaia['house_description'].apply(has_parking)
df_samambaia['n_parking'] = df_samambaia['house_description'].apply(get_n_parking)
df_samambaia['house_size'] = df_samambaia['house_description'].apply(get_house_size)

###### Showing how the dataset looks like:

In [66]:
df_samambaia.head()

Unnamed: 0,house_name,house_price,house_description,house_location,house_hypterlink,Samambaia,n_rooms,has_condominium,value_condominium,has_parking,n_parking,house_size
0,Samambaia - Apartamento Padrão - Samambaia Sul...,R$ 152.000,\n1 quarto\n38m²\nCondomínio: R$ 329,"Brasília, Samambaia Sul (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...,True,1,1,329.0,0,0,38.0
1,Samambaia - Apartamento Padrão - Samambaia Sul...,R$ 408.000,\n2 quartos\n65m²\nCondomínio: R$ 432\n1 vaga,"Brasília, Samambaia Sul (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...,True,2,1,432.0,1,1,65.0
2,Samambaia 2 qts! Desocupado! mude já! Próx. Me...,R$ 145.000,\n2 quartos\n63m²\nCondomínio: R$ 10,"Brasília, Samambaia Norte (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...,True,2,1,10.0,0,0,63.0
3,"Samambaia sul, qr 120, apto 2 qtos,varanda, ar...",R$ 190.000,\n2 quartos\n55m²\nCondomínio: R$ 350\n1 vaga,"Brasília, Samambaia Sul (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...,True,2,1,350.0,1,1,55.0
4,Samambaia Norte - QR 402 - Casa de 2 Quartos -...,R$ 290.000,\n2 quartos\n105m²\n1 vaga,"Brasília, Samambaia Norte (Samambaia) - DDD 61",https://df.olx.com.br/distrito-federal-e-regia...,True,2,0,0.0,1,1,105.0


## 5. Transform existing columns to improve data analysis

###### Now, we will create new columns: new_house_price and new_house_location from house_price and house_location columns. We are doing that so we have the new columns in a format we can manipulate later on.

In [129]:
df_samambaia['new_house_price'] = df_samambaia['house_price'].apply(lambda x: float(x.split()[1].replace('.','')))
df_samambaia['new_house_location'] = df_samambaia['house_location']\
            .apply(lambda x: 'Samambaia norte' if 'Norte' in x else 'Samambaia sul')

df_samambaia['Is_samambaia_norte'] = df_samambaia['house_location']\
            .apply(lambda x: 1 if 'Norte' in x else 0)

In [130]:
df_samambaia[['house_price', 'new_house_price', 'house_location', 'new_house_location', 'Is_samambaia_norte']].head()

Unnamed: 0,house_price,new_house_price,house_location,new_house_location,Is_samambaia_norte
0,R$ 152.000,152000.0,"Brasília, Samambaia Sul (Samambaia) - DDD 61",Samambaia sul,0
1,R$ 408.000,408000.0,"Brasília, Samambaia Sul (Samambaia) - DDD 61",Samambaia sul,0
2,R$ 145.000,145000.0,"Brasília, Samambaia Norte (Samambaia) - DDD 61",Samambaia norte,1
3,R$ 190.000,190000.0,"Brasília, Samambaia Sul (Samambaia) - DDD 61",Samambaia sul,0
4,R$ 290.000,290000.0,"Brasília, Samambaia Norte (Samambaia) - DDD 61",Samambaia norte,1


## Final dataset preprocessing

##### These are the final columns of this file for now

In [133]:
final_columns = ['house_name', 'new_house_price', 'Is_samambaia_norte', 'n_rooms', 'has_condominium',
                 'value_condominium', 'has_parking', 'n_parking','house_size', '']
df_samambaia[final_columns]

Unnamed: 0,house_name,new_house_price,Is_samambaia_norte,n_rooms,has_condominium,value_condominium,has_parking,n_parking,house_size
0,Samambaia - Apartamento Padrão - Samambaia Sul...,152000.0,0,1,1,329.0,0,0,38.0
1,Samambaia - Apartamento Padrão - Samambaia Sul...,408000.0,0,2,1,432.0,1,1,65.0
2,Samambaia 2 qts! Desocupado! mude já! Próx. Me...,145000.0,1,2,1,10.0,0,0,63.0
3,"Samambaia sul, qr 120, apto 2 qtos,varanda, ar...",190000.0,0,2,1,350.0,1,1,55.0
4,Samambaia Norte - QR 402 - Casa de 2 Quartos -...,290000.0,1,2,0,0.0,1,1,105.0
...,...,...,...,...,...,...,...,...,...
3455,Apartamento 2 qrts conjugado com até zero de e...,168000.0,1,2,1,0.0,0,0,33.0
3456,Apartamento para venda com 2 quartos últimas u...,168000.0,1,2,1,0.0,1,1,33.0
3457,Apartamento para venda com 2 quartos até 100% ...,168000.0,1,2,1,0.0,1,1,33.0
3460,Apartamento 2 quartos com praticidade e conforto,212060.0,0,2,1,0.0,1,1,42.0
