<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 2: Singapore Housing Data and Kaggle Challenge

## Problem Statement

When it comes to property prices in Singapore, the on-ground sentiments we often hear are — "so expensive", "so far", "is it worth the price?"

Yet, there still seems to be a preference amongst some to stay in the central region due to convenience and general accessibility. In our study, we examine how the price of resale HDB flats, particularly in the central region, are influenced by various factors like floor area, age of HDB flat, maximum floor level, proximity to amenities and public transport connectivity.

Ultimately, we aim to address the following question: **"Are resale prices of central region HDBs influenced primarily by its location?"**

In doing so, we hope to empower our target audience to make more calculated and informed decisions on housing whether it's young couples buying their first flats, or older families looking to sell their flats

## Importing Libraries

In [10]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
from pycaret.regression import setup, compare_models, pull
from scipy import stats
from sklearn.linear_model import LinearRegression, Ridge, Lasso, LassoCV, RidgeCV, ElasticNet, ElasticNetCV
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
from sklearn.metrics import r2_score
from sklearn import metrics
from sklearn import datasets
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

import pickle
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-pastel')
pd.set_option('display.max_rows', 200)

## Import EDA and Visualised Datasets

In [2]:
train_edavis = pd.read_csv('../datasets/train_edavis.csv')
test_edavis = pd.read_csv('../datasets/test_edavis.csv')

## Feature Filtration

Here, we explain the reasons to which we drop the following features before we create the models:

|Dropped Feature|Justification|
|---|---|
|tranc_yearmonth, tranc_year, tranc_month|Datetime, it is not suitable to be used for modelling|
|lease_commence_date, year_completed|Redundant columns, can be represented as `hdb_age`|
|block, street_name, address, postal_code|Redundant for modelling, this model is based on planning area| 
|full_flat_type|combination of `flat_type` and `flat_model`
|storey_range, upper, mid, lower|Represented by mid_storey|
|floor_area_sqm|Same as `floor_area_sqft`|
|price_per_sqft|price should not be used to predict resale_price|
|residential|all column values is 1, no correlation identified|
|commercial|Irrelevant to modelling|
|total_dwelling_units|total units in a HDB, the model will use more detailed breakdown of quantity of 1 room, 2 room, 3 room, etc|
|1room_rental, 2room_rental, 3room_rental, other_room_rental|Irrelevant to predict resale_price|
|hawker_nearest_dist, hawker_within_500m, hawker_within_1km|analysis will focus on quantity of hawker within 2km, this covers for hawker within 500m and 1km|
|all latitude, longitude|Can be represented with nearest distance|
|mrt_name, bus_stop_name|Irrelevant to be modelled in this project|
|all mall columns|based on domain knowledge, malls should not have impact HDB price|
|market_hawker, multistorey_carpark, precint_pavilion, bus_interchange, mrt_interchange, bus_stop_nearest_distance, vacancy|has very minimum correlation to `resale_price`|

## Data Pre-Processing

The data pre-processing will be relevant to our study, and assist in our analyses of data and thereafter modellling. The pre-processing will be done to categorical variables on 3 fronts:
* Converting pri_sec_sch and sec_sch names and Binarising to top_sec_sch and top_pri_sch
* Binarise planning_area (Central vs. Non-Central regions)
* Binarise town (Mature vs. Non-Mature)

#### Convert categorical values `pri_sch_name` and `sec_sch_name` to numerical values. We binarise the top 10 schools and others (those that are not in the top 10) as follows:
* Top schools = 1
* Others schools = 0

The definition of the Top Schools are based on the following sources:
* [`2022 Primary School Ranking`](https://schoolbell.sg/primary-school-ranking/)
* [`2022 Secondary School Ranking`](https://schoolbell.sg/secondary-school-ranking/)

In [3]:
# create top school list for primary school and secondary school
top_pri_sch = ['Rosyth School','Nan Hua Primary School',"Saint Hilda's Primary School",
               'Catholic High School','Henry Park Primary School','Nanyang Primary School',
               'Tao Nan School',"Raffles Girls' School"]

top_sec_sch = ["Raffles Girls' School", 'Raffles Institution',"CHIJ Saint Nicholas Girls' School",
               'Anglo-Chinese School',"Methodist Girls' School",'Dunman High School','Catholic High School',
               "Cedar Girls' Secondary School",'Temasek Junior College','River Valley High School']

# convert school name to numerical value based on top school or not
# convert train datasets 
train_edavis['pri_sch_name'] = train_edavis['pri_sch_name'].apply(lambda x: 1 if x in top_pri_sch else 0)
train_edavis['sec_sch_name'] = train_edavis['sec_sch_name'].apply(lambda x: 1 if x in top_sec_sch else 0)

# convert test datasets
test_edavis['pri_sch_name'] = test_edavis['pri_sch_name'].apply(lambda x: 1 if x in top_pri_sch else 0)
test_edavis['sec_sch_name'] = test_edavis['sec_sch_name'].apply(lambda x: 1 if x in top_sec_sch else 0)

#### Create new column to binarise `planning_area`:
* Central Region (cr) = 1
* Non-Central Region (ncr) = 0

Source: [`Definition of CCR, RCR and NCR`](https://www.propertyguru.com.sg/property-guides/ccr-ocr-rcr-region-singapore-ura-map-21045). This article explains the distribution of each `planning_area` based on the region

Note: From the source there are distrubition of 3 areas which area CCR(Central), RCR (Rest of Central Region) and OCR. For this project, CCR and RCR are combined as one, and defined as `Central Region` . While, anything outside of `Central Region` would be considered as `Non-Central Region`

In [4]:
#create list of central region 
cr = ['Bukit Timah','Downtown Core','Novena','Tanglin','Bishan',
      'Geylang','Kallang','Marine Parade','Outram','Queenstown',
      'Rochor','Toa Payoh', 'Bukit Merah']

#create list of non-central region
ncr = ['Ang Mo Kio', 'Bedok','Bukit Batok','Bukit Panjang', 'Changi', 'Choa Chu Kang',
       'Clementi','Hougang','Jurong East','Jurong West','Pasir Ris','Punggol','Sembawang',
       'Sengkang','Serangoon','Tampines','Western Water Catchment','Woodlands','Yishun']

#create empty list of region
region_train = []
for place in train_edavis['planning_area']:
    if place in cr:
        region_train.append(1)
    else:
        region_train.append(0)

#convert from list to dataframe
region_train = pd.DataFrame(region_train, columns = ['region'])

#combine to train datasets 
train_clean = pd.concat([train_edavis,region_train], axis=1)

#create region dataset for test model
region_test = []
for place in test_edavis['planning_area']:
    if place in cr:
        region_test.append(1)
    else:
        region_test.append(0)

#convert from list to dataframe
region_test = pd.DataFrame(region_test, columns = ['region'])

#combine to test datasets 
test_edavis = pd.concat([test_edavis,region_test], axis=1)

In [5]:
train_edavis['region'] = train_edavis['region'].apply(lambda x: 1 if x == 'Central Region' else 0)

# Display the first few rows to verify the transformation
print(train_edavis.head())

       id             town  flat_type storey_range  flat_model  resale_price  \
0   88471  KALLANG/WHAMPOA     4 ROOM     10 TO 12     Model A      680000.0   
1  122598           BISHAN     5 ROOM     07 TO 09    Improved      665000.0   
2  170897      BUKIT BATOK  EXECUTIVE     13 TO 15   Apartment      838000.0   
3   86070           BISHAN     4 ROOM     01 TO 05     Model A      550000.0   
4  153632           YISHUN     4 ROOM     01 TO 03  Simplified      298000.0   

   tranc_year  mid_storey  lower  upper  ...  pri_sch_nearest_distance  \
0        2016          11     10     12  ...               1138.633422   
1        2012           8      7      9  ...                415.607357   
2        2013          14     13     15  ...                498.849039   
3        2012           3      1      5  ...                389.515528   
4        2017           2      1      3  ...                401.200584   

   pri_sch_name  vacancy  sec_sch_nearest_dist  sec_sch_name  cutoff_point

#### Convert `town` from categorical to numerical value by:

* Mature Estate = 1
* Non-Mature Estate = 0 

Source: [`Non-Mature and Mature Estates`](https://www.propertyguru.com.sg/property-guides/non-mature-vs-mature-bto-55760). This article explains the distribution of `town` based on mature and non-mature estate

In [6]:
# create mature estate list 
mature_estate_list = ['ANG MO KIO','BEDOK','BISHAN','BUKIT MERAH',
                      'BUKIT TIMAH','CENTRAL AREA','CLEMENTI',
                      'GEYLANG','KALLANG/WHAMPOA','MARINE PARADE',
                      'PASIR RIS','QUEENSTOWN','SERANGOON',
                      'TAMPINES','TOA PAYOH']

# train model
train_edavis['town'] = train_edavis['town'].apply(lambda x: 1 if x in mature_estate_list else 0)

# test model
test_edavis['town'] = test_edavis['town'].apply(lambda x: 1 if x in mature_estate_list else 0)

### **Model Creation**

Based on the domain knowledge and correlation coefficient, below is the list of numerical features selected:

**Selected Features**

|Feature|Type|Remarks                             
|---|---|---|
|town|*integer*|Featured to: Mature estate = 1, non-mature estate = 0|
|floor_area_sqft|*float*|No changes from original dataset|
|hdb_age|*integer*|No changes from original dataset|
|max_floor_lvl|*integer*|No changes from original dataset|
|mid_storey|*integer*|No changes from original dataset|
|1room_sold|*integer*|No changes from original dataset|
|2room_sold|*integer*|No changes from original dataset|
|3room_sold|*integer*|No changes from original dataset|
|4room_sold|*integer*|No changes from original dataset|
|5room_sold|*integer*|No changes from original dataset|
|exec_sold|*integer*|No changes from original dataset|
|multigen_sold|*integer*|No changes from original dataset|
|studio_apartment_sold|*integer*|No changes from original dataset|
|cutoff_point|*integer*|No changes from original dataset|
|affiliation|*integer*|No changes from original dataset|
|sec_sch_name|*integer*|Featured to: Top secondary school = 1, other secondary school = 0|
|pri_sch_name|*integer*|Featured to: Top primary school = 1, other primary school = 0|
|mrt_nearest_distance|*float*|No changes from original dataset|
|pri_sch_nearest_distance|*float*|No changes from original dataset|
|sec_sch_nearest_dist|*float*|No changes from original dataset|
|hawker_within_2km|*float*|No changes from original dataset|
|region|*integer*|added feature to represent: Central Region = 1, Outside Central Region = 0|
|planning_area|*str*|No changes from original dataset|
|flat_type|*str*|No changes from original dataset|
|flat_model|*str*|No changes from original dataset|

#### Features Selection

#### Dummifying Variables

In [7]:
categorical_features = ['town', 'flat_type', 'flat_model', 'storey_range', 'planning_area', 'mrt_name', 'pri_sch_name', 'sec_sch_name', 'Region']
numeric_features = ['floor_area_sqft', 'max_floor_lvl', 'total_dwelling_units', '1room_sold', '2room_sold', '3room_sold', '4room_sold', 
                    '5room_sold', 'exec_sold', 'multigen_sold', 'studio_apartment_sold', 'mall_nearest_distance', 'mall_within_500m', 
                    'mall_within_1km', 'mall_within_2km', 'hawker_nearest_distance', 'hawker_within_500m', 'hawker_within_1km', 
                    'hawker_within_2km', 'hawker_food_stalls', 'hawker_market_stalls', 'mrt_nearest_distance', 'bus_interchange', 
                    'mrt_interchange', 'bus_stop_nearest_distance', 'pri_sch_nearest_distance', 'sec_sch_nearest_dist', 'vacancy', 
                    'cutoff_point', 'affiliation', 'central']

In [8]:
#dummify planning area
planning_area_dummies = pd.get_dummies(train_edavis['planning_area'], drop_first = True, dtype = int)

#dummify flat type
flat_model_dummies = pd.get_dummies(train_edavis['flat_type'], drop_first = True, dtype = int)

#dummify flat model
flat_type_dummies = pd.get_dummies(train_edavis['flat_model'], drop_first = True, dtype = int)

#### Baseline Model Comparison

In [11]:
# Load your dataset
df = pd.read_csv('../datasets/train_edavis.csv')  # replace 'your_dataset.csv' with the path to your dataset

# Initialize the setup
reg = setup(data=df, target='resale_price', session_id=123)  # replace 'resale_price' with your target column name

# Compare baseline models
best_model = compare_models()

# Print the results
results = pull()
print(results)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,resale_price
2,Target type,Regression
3,Original data shape,"(150634, 51)"
4,Transformed data shape,"(150634, 100)"
5,Transformed train set shape,"(105443, 100)"
6,Transformed test set shape,"(45191, 100)"
7,Numeric features,40
8,Categorical features,10
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
rf,Random Forest Regressor,18105.4253,632018365.1788,25138.5227,0.9691,0.0544,0.041,22.223
et,Extra Trees Regressor,18594.2108,669886299.5076,25880.4261,0.9673,0.0558,0.042,18.143
xgboost,Extreme Gradient Boosting,19105.507,671437011.2,25908.3936,0.9672,0.0564,0.0433,0.69
lightgbm,Light Gradient Boosting Machine,21012.7882,811894194.7603,28491.8428,0.9603,0.0615,0.0474,0.678
dt,Decision Tree Regressor,24493.4319,1184275981.5193,34408.969,0.9421,0.0747,0.0554,0.637
gbr,Gradient Boosting Regressor,28972.3255,1588343071.8924,39849.7808,0.9224,0.0831,0.0645,7.431
knn,K Neighbors Regressor,33201.3705,2168727795.2,46561.3602,0.894,0.1031,0.0765,1.755
br,Bayesian Ridge,40613.8984,2840599439.3314,53295.1595,0.8612,0.1191,0.0935,0.566
ridge,Ridge Regression,40613.5542,2840586635.0706,53295.0442,0.8612,0.1191,0.0935,0.256
lasso,Lasso Regression,40614.5759,2840666573.6782,53295.7822,0.8612,0.1191,0.0935,2.92


                                    Model           MAE           MSE  \
rf                Random Forest Regressor  1.810543e+04  6.320184e+08   
et                  Extra Trees Regressor  1.859421e+04  6.698863e+08   
xgboost         Extreme Gradient Boosting  1.910551e+04  6.714370e+08   
lightgbm  Light Gradient Boosting Machine  2.101279e+04  8.118942e+08   
dt                Decision Tree Regressor  2.449343e+04  1.184276e+09   
gbr           Gradient Boosting Regressor  2.897233e+04  1.588343e+09   
knn                 K Neighbors Regressor  3.320137e+04  2.168728e+09   
br                         Bayesian Ridge  4.061390e+04  2.840599e+09   
ridge                    Ridge Regression  4.061355e+04  2.840587e+09   
lasso                    Lasso Regression  4.061458e+04  2.840667e+09   
lr                      Linear Regression  4.061327e+04  2.840677e+09   
llar         Lasso Least Angle Regression  4.152032e+04  2.995973e+09   
en                            Elastic Net  4.368226

### Modeling

Here we improved on the HDB project, by including the use of a second model. As advised by PyCaret, we will use a hypertuned Random Forest Regressor to predict HDB prices. We'll then compare that to our pre-existing model (Linear Regression)

In [9]:
# Load your dataset
df = pd.read_csv('../datasets/train_edavis.csv')  

# Define the features
categorical_features = ['town', 'flat_type', 'flat_model', 'storey_range', 'planning_area', 'mrt_name', 'pri_sch_name', 'sec_sch_name', 'Region']
numeric_features = ['floor_area_sqft', 'max_floor_lvl', 'total_dwelling_units', '1room_sold', '2room_sold', '3room_sold', '4room_sold', 
                    '5room_sold', 'exec_sold', 'multigen_sold', 'studio_apartment_sold', 'mall_nearest_distance', 'mall_within_500m', 
                    'mall_within_1km', 'mall_within_2km', 'hawker_nearest_distance', 'hawker_within_500m', 'hawker_within_1km', 
                    'hawker_within_2km', 'hawker_food_stalls', 'hawker_market_stalls', 'mrt_nearest_distance', 'bus_interchange', 
                    'mrt_interchange', 'bus_stop_nearest_distance', 'pri_sch_nearest_distance', 'sec_sch_nearest_dist', 'vacancy', 
                    'cutoff_point', 'affiliation', 'central']

# Creating a new feature
df['age_of_flat'] = df['tranc_year'] - df['hdb_age']
numeric_features.append('age_of_flat')

# Define preprocessing steps
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

#### Model #1: Linear Regression

In [13]:
# Define the model pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RidgeCV(alphas=np.linspace(0.1, 5, 100), cv=5))
])

# Splitting data into features (X) and target (y)
X = df.drop(['resale_price', 'id'], axis=1)
y = df['resale_price']

# Ensure 'age_of_flat' is in the DataFrame before splitting
print(X.columns)

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

# Fit the model
model_pipeline.fit(X_train, y_train)

# Predictions
y_train_preds = model_pipeline.predict(X_train)
y_test_preds = model_pipeline.predict(X_test)

# Cross-validation scores
cv_train_score = cross_val_score(model_pipeline, X_train, y_train, cv=5).mean()
cv_test_score = cross_val_score(model_pipeline, X_test, y_test, cv=5).mean()

# Print results
print(f'Ridge Regression cross_val_score on train: {cv_train_score}')
print(f'Ridge Regression cross_val_score on test: {cv_test_score}')
print(f"The model r_square value is: {metrics.r2_score(y_train, y_train_preds)}")
print(f"The model mean absolute percentage error is: {metrics.mean_absolute_percentage_error(y_train, y_train_preds)}")
print(f"The model mean squared error is: {metrics.mean_squared_error(y_train, y_train_preds)}")

Index(['town', 'flat_type', 'storey_range', 'flat_model', 'tranc_year',
       'mid_storey', 'lower', 'upper', 'mid', 'floor_area_sqft', 'hdb_age',
       'max_floor_lvl', 'market_hawker', 'multistorey_carpark',
       'total_dwelling_units', '1room_sold', '2room_sold', '3room_sold',
       '4room_sold', '5room_sold', 'exec_sold', 'multigen_sold',
       'studio_apartment_sold', 'planning_area', 'mall_nearest_distance',
       'mall_within_500m', 'mall_within_1km', 'mall_within_2km',
       'hawker_nearest_distance', 'hawker_within_500m', 'hawker_within_1km',
       'hawker_within_2km', 'hawker_food_stalls', 'hawker_market_stalls',
       'mrt_nearest_distance', 'mrt_name', 'bus_interchange',
       'mrt_interchange', 'bus_stop_nearest_distance',
       'pri_sch_nearest_distance', 'pri_sch_name', 'vacancy',
       'sec_sch_nearest_dist', 'sec_sch_name', 'cutoff_point', 'affiliation',
       'Region', 'central', 'region', 'age_of_flat'],
      dtype='object')
Ridge Regression cross_val_

#### Model #2: Random Forest Regressor

In [11]:
# Define the model pipeline
model_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(n_estimators=100, random_state=42))
])

# Splitting data into features (X) and target (y)
X = df.drop(['resale_price', 'id'], axis=1)
y = df['resale_price']

# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)

# Fit the model
model_pipeline.fit(X_train, y_train)

# Predictions
y_train_preds = model_pipeline.predict(X_train)
y_test_preds = model_pipeline.predict(X_test)

# Cross-validation scores
cv_train_score = cross_val_score(model_pipeline, X_train, y_train, cv=5).mean()
cv_test_score = cross_val_score(model_pipeline, X_test, y_test, cv=5).mean()

# Print results
print(f'Random Forest Regression cross_val_score on train: {cv_train_score}')
print(f'Random Forest Regression cross_val_score on test: {cv_test_score}')
print(f"The model r_square value is: {metrics.r2_score(y_train, y_train_preds)}")
print(f"The model mean absolute percentage error is: {metrics.mean_absolute_percentage_error(y_train, y_train_preds)}")
print(f"The model mean squared error is: {metrics.mean_squared_error(y_train, y_train_preds)}")

Random Forest Regression cross_val_score on train: 0.9635133784298381
Random Forest Regression cross_val_score on test: 0.9467493935805542
The model r_square value is: 0.9943061009083732
The model mean absolute percentage error is: 0.01797736227439258
The model mean squared error is: 117141628.12350309


In [12]:
# Save the model using pickle
with open('../model.pkl', 'wb') as file:
    pickle.dump(model_pipeline, file)

##### **Insights**

The Linear Regression model does not fit. Since we knew that the Ridge determines a higher Cross-Validation score, compared to the Lasso, we completely dropped Lasso in this comparison.

Random Forest Regressor has a higher cross-validation score on both training (0.964) and testing (0.947) datasets compared to Ridge Regression, implying it performs consistently better across different subsets of data.

##### **Model Evaluation**

Due to the below mentioned points, we will proceed on with the deployment of the predictive model based on the Random Forest regressor model. You may find the code in the notebook [here](asda)

* Model Selection: The Random Forest Regressor significantly outperforms Ridge Regression across all evaluation metrics. It provides a much higher R² score, lower MAPE, and lower MSE, indicating that it is better at capturing the underlying patterns in the data and making accurate predictions.

* Generalization: The cross-validation scores for the Random Forest Regressor indicate that it generalizes well to unseen data, making it a reliable choice for predicting resale prices of HDB flats in central Singapore.

* Business Impact: The high accuracy and low error rates of the Random Forest Regressor make it a valuable tool for stakeholders looking to make informed decisions regarding HDB resale prices. The ability to accurately predict prices can help buyers, sellers, and policymakers understand market trends and make data-driven decisions.

## **Conclusion**

Key Insights: 
- Yes, resale prices of central region HDBs seem to be influenced primarily by its location.
- Secondarily, buyers may be drawn to the generally larger floor areas of central HDBs. However, this may also be found in non-central mature estates.
- Other factors such as proximity to top schools, malls, hawker centres and connectivity via public transport don’t seem to affect resale prices in the central region as much.

Recommendation to Client:
- Client may consider towns that border the central region
    - More budget-friendly (resale price generally below $500k)
    - Towns such as Serangoon and Ang Mo Kio also offer good public transport connectivity
        - Serangoon has a bus and an MRT interchange
        - AMK has a bus interchange, and is one MRT stop away from an MRT interchange (Bishan)

Future Recommendation for Modelling:

|Model Limitation| Possible Solutions|
|---|---|
|Latest transactions in the dataset took place in 2021|Include transactions till 2023 to reflect the latest resale prices|
|Improved infrastructure such as new MRT lines and URA urban planning within certain towns may affect resale prices|Include “developing_towns” as an additional feature in dataset|
|Transaction volume alone may not be a sufficient indicator of supply & demand|Include “time_taken_to_sell” as an additional feature in dataset|

### Kaggle Submission of Predictive Model

In [None]:
test_edavis.head()

Unnamed: 0,id,town,flat_type,storey_range,flat_model,tranc_year,mid_storey,lower,upper,mid,...,mrt_interchange,bus_stop_nearest_distance,pri_sch_nearest_distance,pri_sch_name,vacancy,sec_sch_nearest_dist,sec_sch_name,cutoff_point,affiliation,region
0,114982,0,4 ROOM,07 TO 09,Simplified,2012,8,7,9,8,...,0,75.683952,426.46791,0,92,156.322353,0,218,0,0
1,95653,0,5 ROOM,04 TO 06,Premium Apartment,2019,5,4,6,5,...,0,88.993058,439.756851,0,45,739.371688,0,199,0,0
2,40303,1,3 ROOM,07 TO 09,New Generation,2013,8,7,9,8,...,0,86.303575,355.882207,0,36,305.071191,0,245,0,0
3,109506,0,4 ROOM,01 TO 03,New Generation,2017,2,1,3,2,...,0,108.459039,929.744711,0,54,433.454591,0,188,0,0
4,100149,0,4 ROOM,16 TO 18,Model A,2016,17,16,18,17,...,0,113.645431,309.926934,0,40,217.295361,0,223,0,0


In [None]:
#dummify planning_area model
test_planning_area_dummy = pd.get_dummies(test_edavis['planning_area'], drop_first=True, dtype= int)

#dummify flat_type
test_flat_type_dummy = pd.get_dummies(test_edavis['flat_type'], drop_first = True, dtype = int)

#dummify flat_model
test_flat_model_dummy = pd.get_dummies(test_edavis['flat_model'], drop_first = True, dtype = int)

In [None]:
x_test_df = pd.concat([test_edavis[features],test_planning_area_dummy,test_flat_type_dummy,test_flat_model_dummy],axis =1)

In [None]:
X.shape

(150634, 78)

In [None]:
x_test_df.shape

(16737, 77)

As the `name` column is missing from the test model, we'll need to add missing column to Kaggle test dataset and re-arrange

In [None]:
#identify missing columns:
missing = []
for name in X.columns:
    if name not in x_test_df:
        missing.append(name)

In [None]:
#add to test dataframe 
x_test_df[missing] = 0

In [None]:
#re-arrange test column based on train test column
x_test_df = x_test_df[X.columns]

Scaling the test features

In [None]:
kaggle_X_test = x_test_df
kaggle_X_test_ss = ss.transform(kaggle_X_test)

Fit the Test Model (Ridge Regression)

In [None]:
kaggle_y_pred = ridge.predict(kaggle_X_test_ss)

Create csv file for kaggle submission

In [None]:
#create dataframe for kaggle_y_pred
df_kaggle_y_pred = pd.DataFrame(kaggle_y_pred, columns=['Predicted'] )

In [None]:
#create csv file for kaggle submssion
df_kaggle_csv = pd.concat([test_edavis['id'],df_kaggle_y_pred],axis=1)

In [None]:
#rename Id based on kaggle requirement
df_kaggle_csv = df_kaggle_csv.rename(columns={'id':'Id'})

In [None]:
df_kaggle_csv

Unnamed: 0,Id,Predicted
0,114982,337320.535102
1,95653,522952.238502
2,40303,344528.849175
3,109506,286539.612963
4,100149,457762.286174
...,...,...
16732,23347,352624.470409
16733,54003,513245.578587
16734,128921,417384.356268
16735,69352,490348.402881


In [None]:
# Exporting the kaggle file to CSV
df_kaggle_csv.to_csv('kaggle/kaggle-submission.csv', index = False)

#### Proof of Submission

![Kaggle Submission](kaggle/group-kaggle-submission.png)