# AIAP Housing Price Assignment

# Part 2 - Machine Learning Pipeline
this version does not use sklearn pipeline()

* Ingest data
* data cleanse/prep
* Select Algo
* Tune
* Eval

Also,

* use Fns
* output relevant training/eval metrics 
* Create config files for easy experimentation of different algo and parameters
* create bash script run.sh at base folder
* create requirements.txt 
* create readme.md that explains program design and usage




# Imports

In [1]:
# numerical libs and data handing
import numpy as np
import pandas as pd

# visualisation
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import seaborn as sns
from sklearn.manifold import TSNE

# Data Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import VarianceThreshold
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import QuantileTransformer
# QuantitleTransformer can force any arbitrary distribution into a gaussian, 
# provided that there are enough training samples (thousands). 
# Because it is a non-parametric method, it is harder to interpret than 
# the parametric ones (Box-Cox and Yeo-Johnson).


import geohash2 # for latitude/longtitude feature engineering
# https://anaconda.org/conda-forge/geohash2

# Dimensionality Reduction
from sklearn.decomposition import PCA

# ML Model
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import KFold
from sklearn.cluster import KMeans
import xgboost as xgb
#import sklearn.linear_model as linear_model
# neural networks?
# K-Nearest Neighbors vs KMeans
# random forest?

# Hyperparameter Tuning
from sklearn.model_selection import GridSearchCV

# Pipeline
from sklearn.pipeline import Pipeline

# Evaluate Model
from sklearn.metrics import mean_squared_error

# ipython display
%matplotlib inline
#pd.options.display.max_rows = 1000
#pd.options.display.max_columns = 20

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
#warnings.filterwarnings("ignore")

# Globals

In [2]:
# setting the number of cross validations used in the Model part 
cvNum = 5

# on/off switch for using log values for House Price and features     
use_logvals = 1    

# target used for correlation 
target = 'Y house price of unit area'

# attributes shortcuts
x1 = 'X1 transaction date'
x2 = 'X2 house age'
x3 = 'X3 distance to the nearest MRT station'
x4 = 'X4 number of convenience stores'
x5 = 'X5 latitude'
x6 = 'X6 longitude'

# only columns with correlation above this threshold value are used for the ML Regressors 
corr_Floor = 0.4    
    
# on/off switch for dropping columns that are similar to others already used and show a high correlation to these     
drop_similar = 1

# 2.1 Load Data, transform and split into train/test

In [3]:
df = pd.read_csv('https://aisgaiap.blob.core.windows.net/aiap4-assessment/real_estate.csv')

# load actions to load into pipeline()
# drop id, drop others?
# transform g
### do the geohash

# split
X_train, X_test, y_train, y_test = train_test_split(df.drop([target],axis=1), df[target], test_size=0.33, random_state=88)


# 2.2 Data Preprocessing
Feature Engineering and Selection

To prep data for ingestion into ML model


### 2.2.1 Drop Unique Identifer columns

In [4]:
if 'No' in X_train:
    X_train.drop(['No'],axis=1,inplace=True)

if 'No' in X_test:
    X_test.drop(['No'],axis=1,inplace=True)

### 2.2.2 Drop Constant or Quasi-Constant Features

In [5]:
print('Features that are constant or quasi-constant have no or very little informational value to the ML model.')
print('Rather than suffer additional computational complexity, it is better to drop those attributes.')

features_const = [ feat for feat in X_test.columns if X_test[feat].std() ==0]

print('\nConstant Features is \n{}'.format(features_const))

print('\nThere are no constant features')



Features that are constant or quasi-constant have no or very little informational value to the ML model.
Rather than suffer additional computational complexity, it is better to drop those attributes.

Constant Features is 
[]

There are no constant features


### 2.2.3 Select Features based on correlation to Target

In [6]:
X_test = X_test.drop([x1,x2,x5,x6],axis=1)
print()
print(X_test.columns)
print()

X_train = X_train.drop([x1,x2,x5,x6],axis=1)


Index(['X3 distance to the nearest MRT station', 'X4 number of convenience stores'], dtype='object')



### 2.2.4 Distribution of Target and Features

In [7]:
# https://scikit-learn.org/stable/auto_examples/preprocessing/plot_map_data_to_normal.html
# On “small” datasets (less than a few hundred points), the quantile transformer is prone to overfitting. 
# The use of the power transform is then recommended.

#bc = PowerTransformer(method='box-cox') # can only be used for +ve values
#X_trans_bc = bc.fit(X_train).transform(X_test)

#yj = PowerTransformer(method='yeo-johnson') 
#X_trans_yj = yj.fit(X_train).transform(X_test)
# error: Input contains infinity or a value too large for dtype('float64').

#qt = QuantileTransformer(output_distribution='normal', random_state=88)
#X_trans_qt = qt.fit(X_train).transform(X_test)
#type(X_trans_qt)
#dfa = pd.DataFrame(X_trans_qt,columns=[x2,x3])

print('Currently, this is not used.')

Currently, this is not used.


### 2.2.4 Dimensionality Reduction

As I have only two attributes, it is not necessary to do dimensionality reduction using algorithms such as PCA

# 2.3 Select Machine Learn Algorithm and GridSearch Params

In [8]:
# choose ElasticNet as default

# if others, load from config file

# 2.4 Load Pipeline()

In [9]:
# https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

steps = [('scaler', StandardScaler()),
         ('elasticnet', ElasticNet())]
         
# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# 2.4 Hyperparameter Tuning

In [10]:
# https://scikit-learn.org/stable/modules/grid_search.html

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn


# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio':np.linspace(0,1,30)}

# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, parameters)

# Fit to the training set
gm_cv.fit(X_train, y_train)

# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))

Tuned ElasticNet Alpha: {'elasticnet__l1_ratio': 1.0}
Tuned ElasticNet R squared: 0.481090617451612


# 2.5 Predict and Evaluate Model -ElasticNet

In [11]:
pred_test = gm_cv.predict(X_test)

In [12]:
# Compute and print R^2 and RMSE

print("R^2: {}".format(gm_cv.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, pred_test))
print("Root Mean Squared Error: {}".format(rmse))


R^2: 0.481090617451612
Root Mean Squared Error: 8.621735231318153


# Linear Regression


In [13]:
# Create the regressor: reg
reg = LinearRegression()

# Fit the regressor to the training data
reg.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))


R^2: 0.48025659735708476
Root Mean Squared Error: 8.628661116328413


# GridSearch Ridge Regression

In [14]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model, param_grid=dict(alpha=alphas))

grid.fit(X_test, y_test)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print()
print(grid.best_estimator_.alpha)

print('\n')
y_pred = grid.predict(X_test)
# Compute and print R^2 and RMSE
print("R^2: {}".format(reg.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))


GridSearchCV(cv='warn', error_score='raise-deprecating',
       estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'alpha': array([1.e+00, 1.e-01, 1.e-02, 1.e-03, 1.e-04, 0.e+00])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)
0.5074324734475906

1.0


R^2: 0.48025659735708476
Root Mean Squared Error: 8.217553206553424
