# Madrid Houses Hackathon

In [None]:
import pandas as pd
import numpy as np

In [None]:
try:
    import folium
except:
    !pip install folium

### (!) Action Required - Upload the data

- Activate the empty cell below (cursor should blink in the empty cell below).
- Click the data symbol on the right
- Find your data set > Insert to code > Insert pandas DataFrame

Make sure to upload & import **madrid_train.csv** and **madrid_test.csv**.

For importing **madrid_test.csv**, you should create a new cell by going to `Insert` > `Insert Cell Below`
    
![as](https://i.imgur.com/mafVWHP.png)


### (!) Action Required - Rename the DataFrame below to the fresly imported df_data_NN

Most likely DSX have imported the data as `df_data_5`, `df_data_6` or similar. It is good practive to rename the data in the next cell, and continue from there.

* Store the `df_data_X` (where X is a number), from **madrid_train.csv** in `madrid_train`.
* Store the `df_data_X` (where X is a number), from **madrid_test.csv** in `madrid_test`.

In [None]:
madrid_train = df_data_7
madrid_test = df_data_8

## Visualization Example - Foliem

You have the freedom to install packages like Folium. Folium (Leaflet) is a very nice geo data vizualisation tool - Use the !pip install [package] code to install other packages

https://folium.readthedocs.io/en/latest/

https://github.com/python-visualization/folium


Using folium, and the coordinates for Cybele Palace as an indicator for the centre of Madrid, lets look at the density of houses for sale on a map.

In less than 10 lines of code, we have in interactive heatmap, showing the popular places where houses are for sale.

In [None]:
import folium

cybele_palace = (40.418906, -3.692084)

lat_lng_list = list( zip( list(madrid_train.lat), list(madrid_train.lng) ) )
house_density = [ (lat,lng,0.3) for (lat,lng) in lat_lng_list ]
centre_madrid = cybele_palace

from folium.plugins import HeatMap

map_with_houses = folium.Map(centre_madrid, tiles='stamentoner', zoom_start=11)

HeatMap(house_density).add_to(map_with_houses)
map_with_houses

## A Classic Linear Regression on the size of property to predict the price

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html


In [None]:
# Build a two data frames. square metre on X, and price on y axis
import numpy as np
from sklearn import datasets, linear_model

x = madrid_train["mts2"].to_frame()
y = madrid_train["price"].to_frame()

## Matplot lib visualization

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline  

regr = linear_model.LinearRegression()
regr.fit(x, y)

# plot it as in the example at http://scikit-learn.org/
plt.figure(figsize=(17, 9))
title = "Linear Regression shows, each square-m is worth {0:.2f} euro".format( regr.coef_.flatten()[0] )
plt.title(title, fontsize=28) 
plt.scatter(madrid_train.mts2, madrid_train.price,  color='black', alpha=0.7)
plt.xticks((np.arange(100,2000,100)))
plt.yticks((np.arange(20000,10000000,1000000)))
plt.xlabel('Area', fontsize=20, color='green')
plt.ylabel('Price', fontsize=20, color='green')
plt.plot(x, regr.predict(x), color='blue', linewidth=3)

plt.show()

In [None]:
print( "Intercept: ", regr.intercept_)
print( "Cooef: ", regr.coef_)
print( "")
print("A house of 100 square m, in Madrid, according to the model will cost about:" ,regr.predict(100.0))

## What is the model Performance?

### For today - a House is well predicted, if the predicted price is less than 10% off from the true price

To test this, we extend the pandas dataframe with a column 'prediction'. The function `percentage_quite_well_predicted()` will later return the model performance. 

For your own model, repeat those steps to find your model performance;

1. Create a new dataframe  from `madrid_test`, and extend it with your predictions in a new column `predictions`
2. Call `percentage_quite_well_predicted()` on the newly create test dataframe

Example for the Linear Model

In [None]:
preds = regr.predict(x)

test = madrid_test.copy()

prediction_test = regr.predict(test.mts2.to_frame() )
test["prediction"] = prediction_test

test.head()

In [None]:
def percentage_quite_well_predicted( dataframe_with_predictions ):
    """
        How much houses can we predict well? 
        
        Input: Data frame, with a "price" column, and a "prediction" column.
        
        Output: Proportion of houses that are predicted with a MAXIMUM_RELATIVE_ERROR.
        
        E.g, a output of 0.2, tells us 20% is well predicted.
        E.g, a output of 0.99, tells us 99% is well predicted.
    """
    MAXIMUM_RELATIVE_ERROR = 0.1
    
    assert type(dataframe_with_predictions) == pd.core.frame.DataFrame, "Please provide a DataFrame as argument..."
    assert "prediction" in dataframe_with_predictions.columns, "Make sure your predictions are in the 'prediction' column..."
    assert "price" in dataframe_with_predictions.columns, "Make sure your the true price is in the 'price' column..."
    
    proportion_well_predicted = np.mean( ( np.abs(test.prediction - test.price) ) / test.price < MAXIMUM_RELATIVE_ERROR )
    return( proportion_well_predicted )

In [None]:
percentage_quite_well_predicted( test )

In [None]:
prediction_test = regr.predict(test.mts2.to_frame() )
test["prediction"] = prediction_test

"With a linear model- we are able to predict {0:.2f}% of the houses well!".format( percentage_quite_well_predicted(test)*100 )

As you can see, a modest regression on square-m, will predict with an error rate of 7%.

That is, with only the square-m we are able to predict 7% of the houses.

## Example with Random Forest

---



In [None]:
COLUMNS_TO_INDEX_AS_CATEGORIES = ["property", "property_state", "district"]

madrid_one_hot_encoded = madrid_train
madrid_one_hot_encoded_test = madrid_test

for col in COLUMNS_TO_INDEX_AS_CATEGORIES:
    if col in madrid_one_hot_encoded.columns:
        temp_res = pd.get_dummies(  madrid_one_hot_encoded[ col ], prefix=col )
        madrid_one_hot_encoded[ temp_res.columns ] = temp_res
        madrid_one_hot_encoded = madrid_one_hot_encoded.drop(col, 1)
        
        temp_res = pd.get_dummies(  madrid_one_hot_encoded_test[ col ], prefix=col )
        madrid_one_hot_encoded_test[ temp_res.columns ] = temp_res
        madrid_one_hot_encoded_test = madrid_one_hot_encoded_test.drop(col, 1)
        
madrid_one_hot_encoded.head()

### Displaying all new columns

In [None]:
for col in madrid_one_hot_encoded.columns[1:]:
    print( col, end=", ")

### Feature selection - example code (note recommendation for good model)

In [None]:
"""
   Edit the line below to change the features used by the model.
   
   FEATURE_SELECTION = list(madrid_one_hot_encoded.columns)
   
   will use ALL features
   
"""
SELECTED_FEATURE_KEYWORDS = ["distance_to_centre", "property_state", "district", "mts2", "sauna"]

FEATURE_SELECTION = []

for feature in SELECTED_FEATURE_KEYWORDS:
    for potential_feature in list(madrid_one_hot_encoded.columns):
        if feature in potential_feature:
            FEATURE_SELECTION.append(potential_feature)
            
for feat in FEATURE_SELECTION:
    print(feat, end=", ")
            

## Test / Train set creation

In [None]:
msk = np.random.rand(len(madrid_one_hot_encoded)) < 0.8

train = madrid_one_hot_encoded[msk]
test = madrid_one_hot_encoded[~msk]

In [None]:
from sklearn.ensemble import RandomForestRegressor

"""
    Change the code below, to make the model perform better, and use all information from data as best as possible
"""
clf = RandomForestRegressor(max_depth=25, n_estimators=5)

target_t = train[ "price" ]
features_t = train[ FEATURE_SELECTION ]

clf.fit(  features_t, target_t )

features_t.head()

### The following cell attempts to identify features which influence the property price

In [None]:
feature_imp_dict = {}
importance_list = []
for (imp, label) in list( zip( list(clf.feature_importances_), FEATURE_SELECTION ) ):
    feature_imp_dict[label] = imp
    importance_list.append(imp)

treshold = sorted(importance_list)[-2] 
important_features = {}

for key in feature_imp_dict:
    if feature_imp_dict[key] >= treshold:
        important_features[key] = feature_imp_dict[key]
        
for k,v in sorted(important_features.items(), key=lambda x:-x[1]):
    print(k, v)


### Feature Importance Plot

In [None]:
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

dictionary = plt.figure()
dictionary.set_size_inches(35, 10.5)

D = important_features

keys = [ s[:20] for s in D.keys()]

plt.bar(range(len(D)), D.values(), align='center')
plt.xticks(fontsize=14)  
plt.xticks(range(len(D)), keys)

### Test model performance on Test set (unseen during model training)

In [None]:
test = madrid_one_hot_encoded_test.copy()


preds = clf.predict(test[ FEATURE_SELECTION ])
test["prediction"] = preds


print("{0:.2f}% is well predicted".format(100 * percentage_quite_well_predicted(test)))