# Regression Challenge

Predicting the selling price of a residential property depends on a number of factors, including the property age, availability of local amenities, and location.

In this challenge, you will use a dataset of real estate sales transactions to predict the price-per-unit of a property based on its features. The price-per-unit in this data is based on a unit measurement of 3.3 square meters.

> **Citation**: The data used in this exercise originates from the following study:
>
> *Yeh, I. C., & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft Computing, 65, 260-271.*
>
> It was obtained from the UCI dataset repository (Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science).

## Review the data

Run the following cell to load the data and view the first few rows.

In [2]:
import pandas as pd

# load the training dataset
data = pd.read_csv('data/real_estate.csv')
print(data.head())
data.describe()

   transaction_date  house_age  transit_distance  local_convenience_stores  \
0          2012.917       32.0          84.87882                        10   
1          2012.917       19.5         306.59470                         9   
2          2013.583       13.3         561.98450                         5   
3          2013.500       13.3         561.98450                         5   
4          2012.833        5.0         390.56840                         5   

   latitude  longitude  price_per_unit  
0  24.98298  121.54024            37.9  
1  24.98034  121.53951            42.2  
2  24.98746  121.54391            47.3  
3  24.98746  121.54391            54.8  
4  24.97937  121.54245            43.1  


Unnamed: 0,transaction_date,house_age,transit_distance,local_convenience_stores,latitude,longitude,price_per_unit
count,414.0,414.0,414.0,414.0,414.0,414.0,414.0
mean,2013.148971,17.71256,1083.885689,4.094203,24.96903,121.533361,37.980193
std,0.281967,11.392485,1262.109595,2.945562,0.01241,0.015347,13.606488
min,2012.667,0.0,23.38284,0.0,24.93207,121.47353,7.6
25%,2012.917,9.025,289.3248,1.0,24.963,121.528085,27.7
50%,2013.167,16.1,492.2313,4.0,24.9711,121.53863,38.45
75%,2013.417,28.15,1454.279,6.0,24.977455,121.543305,46.6
max,2013.583,43.8,6488.021,10.0,25.01459,121.56627,117.5


The data consists of the following variables:

- **transaction_date** - the transaction date (for example, 2013.250=2013 March, 2013.500=2013 June, etc.)
- **house_age** - the house age (in years)
- **transit_distance** - the distance to the nearest light rail station (in meters)
- **local_convenience_stores** - the number of convenience stores within walking distance
- **latitude** - the geographic coordinate, latitude
- **longitude** - the geographic coordinate, longitude
- **price_per_unit** house price of unit area (3.3 square meters)

## Train a Regression Model

Your challenge is to explore and prepare the data, identify predictive features that will help predict the **price_per_unit** label, and train a regression model that achieves the lowest Root Mean Square Error (RMSE) you can achieve (which must be less than **7**) when evaluated against a test subset of data.

Add markdown and code cells as required to create your solution.

> **Note**: There is no single "correct" solution. A sample solution is provided in [02 - Real Estate Regression Solution.ipynb](02%20-%20Real%20Estate%20Regression%20Solution.ipynb).

In [24]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score

y = data['price_per_unit']
X = data.drop(columns=['price_per_unit','house_age', 'transaction_date', 'local_convenience_stores'])

# Normalize numerical data

# X['house_age'] = (X['house_age']-X['house_age'].min())/(X['house_age'].max()-X['house_age'].min())
# X['transaction_date'] = (X['transaction_date']-X['transaction_date'].min())/(X['transaction_date'].max()-X['transaction_date'].min())
X['transit_distance'] = X['transit_distance']/X['transit_distance'].abs().max()
# X['local_convenience_stores'] = X['local_convenience_stores']/X['local_convenience_stores'].abs().max()
X['latitude'] = (X['latitude']-X['latitude'].min())/(X['latitude'].max()-X['latitude'].min())
X['longitude'] = (X['longitude']-X['longitude'].min())/(X['longitude'].max()-X['longitude'].min())

val_size = 0.1

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25, random_state=0)
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,test_size=0.15, random_state=0)

model = GradientBoostingRegressor().fit(X_train,y_train)

# Hyperparameter tuning

learning_rates = [0.001,0.02,0.05, 0.075, 0.09, 0.1,0.2,0.3,0.4]
n_estimators = [10, 50, 200, 500]

best_lr = 0
best_mse = 9999
best_model = None
for lr in learning_rates:
    model = GradientBoostingRegressor(learning_rate=lr).fit(X_train,y_train)
    y_hat = model.predict(X_val)
    mse = mean_squared_error(y_val, y_hat)
    if mse < best_mse:
        best_lr = lr
        best_mse = mse
        best_model = model
        print(f"New best mse score: {mse} using learning rate: {lr}")

best_n_estimator = 100
for n_est in n_estimators:
    model = GradientBoostingRegressor(learning_rate=best_lr, n_estimators=n_est).fit(X_train,y_train)
    y_hat = model.predict(X_val)
    mse = mean_squared_error(y_val, y_hat)
    if mse < best_mse:
        best_n_estimator = n_est
        best_mse = mse
        best_model = model
        print(f"New best mse score: {mse} using {n_est} estimators")




New best mse score: 147.08276691365398 using learning rate: 0.001
New best mse score: 43.36316795693412 using learning rate: 0.02
New best mse score: 34.81326014244254 using learning rate: 0.05
New best mse score: 33.665779078011234 using learning rate: 0.075
New best mse score: 33.118881683821314 using learning rate: 0.1
New best mse score: 32.60116755537857 using learning rate: 0.2
New best mse score: 32.55775861330839 using 50 estimators


## Use the Trained Model

Save your trained model, and then use it to predict the price-per-unit for the following real estate transactions:

| transaction_date | house_age | transit_distance | local_convenience_stores | latitude | longitude |
| ---------------- | --------- | ---------------- | ------------------------ | -------- | --------- |
|2013.167|16.2|289.3248|5|24.98203|121.54348|
|2013.000|13.6|4082.015|0|24.94155|121.50381|

In [25]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

Y_hat = best_model.predict(X_test)

mse = mean_squared_error(y_test, Y_hat)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, Y_hat)

print(f"MSE: {mse}\nRMSE: {rmse}\nr2: {r2}")


MSE: 56.59136541814433
RMSE: 7.522723271405398
r2: 0.6631444672994229
