<div>
<img src="https://drive.google.com/uc?export=view&id=1vK33e_EqaHgBHcbRV_m38hx6IkG0blK_" width="350"/>
</div> 

#**Artificial Intelligence - MSc**
##ET5003 - MACHINE LEARNING APPLICATIONS 

###Instructor: Enrique Naredo
###ET5003_Etivity-2

In [1]:
#@title Current Date
Today = '2021-09-27' #@param {type:"date"}


In [2]:
#@markdown ---
#@markdown ### Enter your details here:
Student_ID = "20197772" #@param {type:"string"}
Student_full_name = "Barry Lawton" #@param {type:"string"}
#@markdown ---

In [3]:
#@title Notebook information
Notebook_type = 'Example' #@param ["Example", "Lab", "Practice", "Etivity", "Assignment", "Exam"]
Version = 'Draft' #@param ["Draft", "Final"] {type:"raw"}
Submission = False #@param {type:"boolean"}

# INTRODUCTION

**Piecewise regression**, extract from [Wikipedia](https://en.wikipedia.org/wiki/Segmented_regression):

Segmented regression, also known as piecewise regression or broken-stick regression, is a method in regression analysis in which the independent variable is partitioned into intervals and a separate line segment is fit to each interval. 

* Segmented regression analysis can also be performed on 
multivariate data by partitioning the various independent variables. 
* Segmented regression is useful when the independent variables, clustered into different groups, exhibit different relationships between the variables in these regions. 

* The boundaries between the segments are breakpoints.

* Segmented linear regression is segmented regression whereby the relations in the intervals are obtained by linear regression. 

***The goal is to use advanced Machine Learning methods to predict House price.***

## Imports

In [1]:
# Suppressing Warnings:
import warnings
warnings.filterwarnings("ignore")

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
import arviz as az
from sklearn.preprocessing import StandardScaler

In [3]:
# to plot
import matplotlib.colors
from mpl_toolkits.mplot3d import Axes3D

# to generate classification, regression and clustering datasets
import sklearn.datasets as dt

# to create data frames
from pandas import DataFrame

# to generate data from an existing dataset
from sklearn.neighbors import KernelDensity
from sklearn.model_selection import GridSearchCV

In [4]:
# Define the seed so that results can be reproduced
seed = 11
rand_state = 11

# Define the color maps for plots
color_map = plt.cm.get_cmap('RdYlBu')
color_map_discrete = matplotlib.colors.LinearSegmentedColormap.from_list("", ["red","cyan","magenta","blue"])

# DATASET

Extract from this [paper](https://ieeexplore.ieee.org/document/9300074):

* House prices are a significant impression of the economy, and its value ranges are of great concerns for the clients and property dealers. 

* Housing price escalate every year that eventually reinforced the need of strategy or technique that could predict house prices in future. 

* There are certain factors that influence house prices including physical conditions, locations, number of bedrooms and others.


1. [Download the dataset](https://github.com/UL-ET5003/ET5003_SEM1_2021-2/tree/main/Week-3). 

2. Upload the dataset into your folder.



The challenge is to predict the final price of each house.

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
# training dataset: 
#training_file = syntPath+filename1
# test dataset: 
#testing_file = syntPath+filename2
# cost dataset: 
#cost_file = syntPath+filename3

In [25]:
# Path from my google Drive

Path = '/content/drive/My Drive/Data/ET5003_Etivity_2/'
# house data
train_data = Path + 'house_train.csv'
test_data = Path + 'house_test.csv'
true_price = Path + 'true_price.csv'

# train data
df_train = pd.read_csv(train_data)
print("Training dataset shape : ",df_train.shape)

# test data
df_test = pd.read_csv(test_data)
print("Test dataset shape : ",df_test.shape)

# true price data
df_cost = pd.read_csv(true_price)
print("True Price dataset shape : ",df_cost.shape) 

Training dataset shape :  (2982, 17)
Test dataset shape :  (500, 16)
True Price dataset shape :  (500, 2)


In [18]:
# Make a copy of the origianl dataset in case I need it later 
df_train_copy = df_train
df_test_copy = df_test
df_cost_copy = df_cost

In [50]:
df_train.columns

Index(['ad_id', 'area', 'bathrooms', 'beds', 'ber_classification', 'county',
       'description_block', 'environment', 'facility', 'features', 'latitude',
       'longitude', 'no_of_units', 'price', 'property_category',
       'property_type', 'surface'],
      dtype='object')

In [51]:
df_test.columns

Index(['ad_id', 'area', 'bathrooms', 'beds', 'ber_classification', 'county',
       'description_block', 'environment', 'facility', 'features', 'latitude',
       'longitude', 'no_of_units', 'property_category', 'property_type',
       'surface'],
      dtype='object')

In [26]:
df_cost.columns

Index(['Id', 'Expected'], dtype='object')

In [27]:
# Lets make the column names consistent with df_train and df_test
df_cost = df_cost.rename(columns={'Id':'ad_id','Expected':'price'})

In [28]:
df_train.sample(3)

Unnamed: 0,ad_id,area,bathrooms,beds,ber_classification,county,description_block,environment,facility,features,latitude,longitude,no_of_units,price,property_category,property_type,surface
2519,12414389,Finglas,3.0,2.0,B3,Dublin,MOVEHOME ESTATE AGENTS... are delighted to pre...,prod,,Spacious Two Bedroom apartment\nThree Bathroom...,53.40576,-6.284561,,225000.0,sale,apartment,88.0
554,12209906,Baldoyle,1.0,1.0,B2,Dublin,No. 45 Myrtle House is a fantastic one bedroom...,prod,,,53.398305,-6.147146,,220000.0,sale,apartment,59.0
2641,12417278,Ongar,3.0,3.0,C1,Dublin,The Property Shop are delighted to offer this ...,prod,,Gas Fired Central Heating\nDouble Glazed uPVC ...,53.39613,-6.442373,,325000.0,sale,terraced,129.0


In [29]:
df_test.head(3)

Unnamed: 0,ad_id,area,bathrooms,beds,ber_classification,county,description_block,environment,facility,features,latitude,longitude,no_of_units,property_category,property_type,surface
0,12373510,Skerries,2.0,4.0,G,Dublin,"It's all in the name ""Island View"";. Truly won...",prod,"Parking,Alarm,Oil Fired Central Heating",Breath-taking panoramic views radiate from thi...,53.566881,-6.101148,,sale,bungalow,142.0
1,12422623,Lucan,2.0,3.0,C1,Dublin,REA McDonald - Lucan' s longest established es...,prod,,Gas fired central heating.\nDouble glazed wind...,53.362992,-6.452909,,sale,terraced,114.0
2,12377408,Swords,3.0,4.0,B3,Dublin,REA Grimes are proud to present to the market ...,prod,,Pristine condition throughout\nHighly sought-a...,53.454198,-6.262964,,sale,semi-detached,172.0


In [30]:
df_cost.sample(5)

Unnamed: 0,ad_id,price
220,12382907,390000.0
67,12092890,1495000.0
87,12412824,175000.0
456,12390545,325000.0
306,12223134,250000.0


In [31]:
df_train.dtypes
#df_test.dtypes
#df_cost.dtypes

ad_id                   int64
area                   object
bathrooms             float64
beds                  float64
ber_classification     object
county                 object
description_block      object
environment            object
facility               object
features               object
latitude              float64
longitude             float64
no_of_units           float64
price                 float64
property_category      object
property_type          object
surface               float64
dtype: object

In [32]:
df_train.isna().sum()
#df_test.isna().sum()

ad_id                    0
area                     0
bathrooms               51
beds                    51
ber_classification     677
county                   0
description_block        0
environment              0
facility              2017
features                 0
latitude                 0
longitude                0
no_of_units           2923
price                   90
property_category        0
property_type           51
surface                551
dtype: int64

Now lets explore each of the variables, decide what features to extract

## Numerical features

In [33]:
df_train.describe()

Unnamed: 0,ad_id,bathrooms,beds,latitude,longitude,no_of_units,price,surface
count,2982.0,2931.0,2931.0,2982.0,2982.0,59.0,2892.0,2431.0
mean,12240650.0,1.998635,2.979188,53.355991,-6.257175,7.440678,532353.6,318.851787
std,579303.7,1.291875,1.468408,0.086748,0.141906,8.937081,567814.8,4389.423136
min,996887.0,0.0,0.0,51.458439,-6.521183,0.0,19995.0,3.4
25%,12268130.0,1.0,2.0,53.298929,-6.314064,2.0,280000.0,74.1
50%,12377580.0,2.0,3.0,53.345497,-6.252254,3.0,380000.0,100.0
75%,12402940.0,3.0,4.0,53.388845,-6.196049,8.0,575000.0,142.0
max,12428360.0,18.0,27.0,53.630588,-1.744995,36.0,9995000.0,182108.539008


`ad_id`: is an identifier for each advert, this will not be neccessary for training ML model so will be dropped.      
`bathrooms`: no. of bathrooms in a house. could be a good indicator on a house price, however there is some listings with $0$ bathrooms and others with $18$, so I will need to check for outliers.   
`beds`: no. of beds in a house. this is intuitively a good indicator of a house price, but outliers will need addressed, especially those with $0$.    
`latitude` and `logitude`: Geographical coordinates of a house location. This could alos be a good indicator of a house's price.     
`no_of_units`: Number of units, presumably for bulk buys, there are very few adverts with information on this attribute, so I will drop it.   
`price`: Price on an advert, this is the variable we are trying to predict, not every observation has value for price so these will be dropped.    
`surface`: Surface area of a property, intuitively a good indicator of a properties price, however outliers will need to be investigated.




In [39]:
remove_cols = ['ad_id','no_of_units']
df_train.drop(remove_cols, axis=1, inplace = True)
df_test.drop(remove_cols, axis = 1, inplace = True)


## Cathegorical features

In [42]:
df_train.describe(include=object)

Unnamed: 0,area,ber_classification,county,description_block,environment,facility,features,property_category,property_type
count,2982,2305,2982,2982,2982,965,2982.0,2982,2931
unique,156,16,1,2978,1,34,1882.0,2,10
top,Finglas,D1,Dublin,LEONARD WILSON KEENAN ESTATE &amp; LETTING AGE...,prod,"Parking,Gas Fired Central Heating",,sale,apartment
freq,94,283,2982,2,2982,184,1095.0,2923,759


`area`: corresponds to a geographical area by name, could be beneficial to include althought `latitude` and `logitude` should provide the same information.    
`ber_classification`: BER rating a house has, should be a good indicator for the model.     
`county`: County where the property is listed, there is only one observed value $Dublin$ so this feature will be dropped.    
`description_block`: Free text description of the property, seems to be unique for almost every advert. This feature can not be encoded very easily so will be dropped.    
`environment`: this feature has only one observed value so will dropped.     
`facility`: A list of facilities which a property has. There is only $965$ examples with a value here, will need to be investigated further.    
`features`: A free text description of the features a property has. Can not be encoded very easily so this feature will be dropped.
`property_category`: 

Looking at the variables which have missing values, `facility` and `no_of_units` have a high frequency of `NaN` in both the training and testing datasets so I will to drop those columns.

In the Training dataset, the variables `bathrooms`, `beds`, `ber_classification`, `property_category`, `property_type` & `surface` have some missing values so I will need to decide on what to do with these, but will need to look into them further.

I will remove any observations in the training dataset which have a missing value for`price`, given this is our predictor variable.

In [None]:
#for col in remove_cols:
#  df_train = df_train.drop(col, axis=1)
#  df_test =  df_test.drop(col, axis=1)

In [9]:
# split data into training and test
from sklearn.model_selection import train_test_split
y = 
# training: 70% (0.7), test: 30% (0.3) 
# you could try any other combination 
# but consider 50% of training as the low boundary
#X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.3)

## Training & Test Data

### Train dataset

In [None]:
# show first data frame rows 
dftrain.head()

In [None]:
# Generate descriptive statistics
dftrain.describe()

### Test dataset

In [None]:
# show first data frame rows 
dftest.head()

In [None]:
# Generate descriptive statistics
dftest.describe()

### Expected Cost dataset

In [None]:
# Generate descriptive statistics
dfcost.describe()

# PIECEWISE REGRESSION

## Full Model

In [None]:
# select some features columns just for the baseline model
# assume not all of the features are informative or useful
# in this exercise you could try all of them if possible

featrain = ['feature_1','feature_2','feature_3','cost']
# dropna: remove missing values
df_subset_train = dftrain[featrain].dropna(axis=0)

featest = ['feature_1','feature_2','feature_3']
df_subset_test  =  dftest[featest].dropna(axis=0)

# cost
df_cost = df_cost[df_cost.index.isin(df_subset_test.index)]

In [None]:
# model
with pm.Model() as model:
    #prior over the parameters of linear regression
    alpha = pm.Normal('alpha', mu=0, sigma=30)
    #we have one beta for each column of Xn
    beta = pm.Normal('beta', mu=0, sigma=30, shape=Xn_train.shape[1])
    #prior over the variance of the noise
    sigma = pm.HalfCauchy('sigma_n', 5)
    #linear regression model in matrix form
    mu = alpha + pm.math.dot(beta, Xn_train.T)
    #likelihood, be sure that observed is a 1d vector
    like = pm.Normal('like', mu=mu, sigma=sigma, observed=yn_train[:,0])
    

In [None]:
# prediction
ll=np.mean(posterior['alpha']) + np.dot(np.mean(posterior['beta'],axis=0), Xn_test.T)
y_pred_BLR = np.exp(yscaler.inverse_transform(ll.reshape(-1,1)))[:,0]
print("MAE = ",(np.mean(abs(y_pred_BLR - y_test))))
print("MAPE = ",(np.mean(abs(y_pred_BLR - y_test) / y_test)))

## Clustering

### Full Model

In [None]:
# training gaussian mixture model 
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=4)


### Clusters

In [None]:
# train clusters



In [None]:
# test clusters


## Piecewise Model

In [None]:
# model_0
with pm.Model() as model_0:
  # prior over the parameters of linear regression
  alpha = pm.Normal('alpha', mu=0, sigma=30)
  # we have a beta for each column of Xn0
  beta = pm.Normal('beta', mu=0, sigma=30, shape=Xn0.shape[1])
  # prior over the variance of the noise
  sigma = pm.HalfCauchy('sigma_n', 5)
  # linear regression relationship
  #linear regression model in matrix form
  mu = alpha + pm.math.dot(beta, Xn0.T)
  # likelihood, be sure that observed is a 1d vector
  like = pm.Normal('like', mu=mu, sigma=sigma, observed=yn0[:,0])



##Simulations

### Only Cluster 0

## Overall

## Test set performance

### PPC on the Test set



# SUMMARY