# Samambaia Price House Prediction

## Table of contents

 1. Problem definition
 2. Two
 
 
 
 
## 1. Problem Definition

### 1.1 General objectives

My family wants to buy a new house Distrito Federal State, Brazil. My uncle Fabio (ficticious name) have said that Samambaia is the best place for us to live. So, in order to help my family buy a good house in Samambaia, I will explore some houses in OLX and help them to make the best decision.

For my family, the two most important features for a house or appartment are the number of bedrooms (3 or more) and the house price (between 180k and 300k). Another thing I consider important is choosing a good neighbor in Samambaia city - that probably increases the average value of the houses, but sometimes it's worth it for several reasons: there are more stores nearby, the train station is nearby, it's easier to sell it in the future, and so on. So, if we have to choose among houses of the same value, I will probably consider the neighbor which its average houses value is higher.


### 1.2 Specific objectives

* Plot Samambaia Norte x Samambaia Sul average prices;
* Plot Samambaia average prices per category;
* Plot Samambaia average prices per neighborhood;
* Plot Samambaia average prices for 3 or more bedrooms;
* Plot Samambaia prices x house size;


## Import Libraries

In [62]:
import pandas as pd
import plotly.express as px
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

#column names
HOUSE_PRICE = 'new_house_price'
HOUSE_CATEGORY = 'house_category'
HOUSE_SIZE = 'house_size'
HOUSE_N_ROOMS = 'n_rooms'
HOUSE_REGION = 'Is_samambaia_norte'
HOUSE_HAS_CONDOMI = 'has_condominium'
HOUSE_CONDOMI_VALUE = 'value_condominium'
HOUSE_N_PARKING = 'n_parking'
HOUSE_HAS_PARKING = 'has_parking'
HOUSE_N_BATH = 'n_bathrooms'
HOUSE_CEP = 'CEP'
HOUSE_LOGRADOURO = 'Logradouro'

## Reading the dataset

In [64]:
df_samambaia = pd.read_csv('./data/samambaia_houses.csv', index_col=[0])
df_samambaia.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2926 entries, 0 to 2925
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   new_house_price     2926 non-null   float64
 1   Is_samambaia_norte  2926 non-null   int64  
 2   n_rooms             2926 non-null   int64  
 3   has_condominium     2926 non-null   int64  
 4   value_condominium   2926 non-null   float64
 5   has_parking         2926 non-null   int64  
 6   n_parking           2926 non-null   int64  
 7   house_size          2926 non-null   float64
 8   house_hyperlink     2926 non-null   object 
 9   house_category      2498 non-null   object 
 10  n_bathrooms         2476 non-null   object 
 11  CEP                 2498 non-null   float64
 12  Logradouro          2436 non-null   object 
dtypes: float64(4), int64(5), object(4)
memory usage: 320.0+ KB


## Filtering too much expensive houses

#### The first thing to do is have a look at the variables using the describe method.  

In [41]:
df_samambaia.describe()

Unnamed: 0,new_house_price,Is_samambaia_norte,n_rooms,has_condominium,value_condominium,has_parking,n_parking,house_size,CEP
count,2926.0,2926.0,2926.0,2926.0,2926.0,2926.0,2926.0,2926.0,2498.0
mean,698515.3,0.569036,2.3838,0.824334,341.821941,0.775461,1.3838,126.509228,72316570.0
std,6946811.0,0.495296,0.935162,0.380601,7724.161203,0.417349,1.187397,1071.153251,12523.34
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,72161200.0
25%,165000.0,0.0,2.0,1.0,0.0,1.0,1.0,34.0,72309300.0
50%,209000.0,1.0,2.0,1.0,0.0,1.0,1.0,54.0,72318210.0
75%,285000.0,1.0,3.0,1.0,0.0,1.0,2.0,126.0,72321010.0
max,150000000.0,1.0,5.0,1.0,280000.0,1.0,5.0,47600.0,72660310.0


#### Taking a closer look at new_house_price, we can see that the mean is very high and the standard deviation is also high. This might happen when we have a few houses that are too much expensive compared to the rest of the dataset - also known as outliers. So, we will apply a threshold and see how the data looks like:

In [45]:
threshold = 10**6
df_samambaia_filtered = df_samambaia[df_samambaia['new_house_price'] < threshold]
df_samambaia_filtered.describe()

Unnamed: 0,new_house_price,Is_samambaia_norte,n_rooms,has_condominium,value_condominium,has_parking,n_parking,house_size,CEP
count,2889.0,2889.0,2889.0,2889.0,2889.0,2889.0,2889.0,2889.0,2467.0
mean,231244.67982,0.571478,2.375562,0.823122,345.419522,0.775701,1.371409,121.765317,72316580.0
std,117986.812973,0.49495,0.925836,0.381631,7773.407465,0.417192,1.168089,1074.196268,12566.12
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,72161200.0
25%,165000.0,0.0,2.0,1.0,0.0,1.0,1.0,33.0,72309300.0
50%,209000.0,1.0,2.0,1.0,0.0,1.0,1.0,54.0,72318210.0
75%,280000.0,1.0,3.0,1.0,0.0,1.0,2.0,126.0,72321010.0
max,990000.0,1.0,5.0,1.0,280000.0,1.0,5.0,47600.0,72660310.0


#### It looks much better now. Just to make sure we didn't lose too much of the dataset, let's see how many we have left:

In [57]:
total_size = df_samambaia.shape[0]
new_size = df_samambaia_filtered.shape[0]
percent = new_size/total_size
diff = total_size - new_size

print(f'The filtered dataset has {diff} less instances than the original dataset of {total_size} instances.')
print(f'That represent {percent} of the data in the filtered dataset. So, it was worth it!')

The filtered dataset has 37 less instances than the original dataset of 2926 instances.
That represent 0.9873547505126452 of the data in the filtered dataset. So, it was worth it!


#### So, we will work on the filtered dataset, df_samambaia_filtered

## Samambaia norte x Samambaia sul

In [65]:
fig = px.box(data_frame=df_samambaia_filtered, x=HOUSE_CATEGORY, y=HOUSE_PRICE)

fig.show()

# REMOVER TAMBÉM OS MUITO BARATOS. TÁ ERRADO PROVAVELMENTE!!!

In [5]:
df_samambaia['new_house_price'].mean()

698515.3376623377

In [27]:
df_samambaia

Unnamed: 0,new_house_price,Is_samambaia_norte,n_rooms,has_condominium,value_condominium,has_parking,n_parking,house_size,house_hyperlink,house_category,n_bathrooms,CEP,Logradouro
0,152000.0,0,1,1,329.0,0,0,38.0,https://df.olx.com.br/distrito-federal-e-regia...,Apartamentos,1,72302705.0,QR 116 Conjunto 4-A Comércio
1,408000.0,0,2,1,432.0,1,1,65.0,https://df.olx.com.br/distrito-federal-e-regia...,Apartamentos,2,72300533.0,Quadra 301 Conjunto 2
2,145000.0,1,2,1,10.0,0,0,63.0,https://df.olx.com.br/distrito-federal-e-regia...,Apartamentos,1,72316080.0,QR 204
3,190000.0,0,2,1,350.0,1,1,55.0,https://df.olx.com.br/distrito-federal-e-regia...,Apartamentos,1,72304051.0,QN 120 Conjunto 1
4,290000.0,1,2,0,0.0,1,1,105.0,https://df.olx.com.br/distrito-federal-e-regia...,Casas,1,72318030.0,QR 402 Conjunto 29
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2921,168000.0,1,2,1,0.0,0,0,33.0,https://df.olx.com.br/distrito-federal-e-regia...,Apartamentos,1,72321000.0,QR 407
2922,168000.0,1,2,1,0.0,1,1,33.0,https://df.olx.com.br/distrito-federal-e-regia...,Apartamentos,1,72321000.0,QR 407
2923,168000.0,1,2,1,0.0,1,1,33.0,https://df.olx.com.br/distrito-federal-e-regia...,Apartamentos,1,72321000.0,QR 407
2924,212060.0,0,2,1,0.0,1,1,42.0,https://df.olx.com.br/distrito-federal-e-regia...,,,,
