In [None]:
import numpy as np
import pandas as pd
from geopy.distance import distance

## 6. Feature Engineering - Other Features <a name='eng-2'></a>
[Back to top](#Contents)<br>

From the coordinates of the apartments, we can engineer a few features to better allow our models to predict the prices.

### 6.1 Evaluate the number of MRT and primary schools near each listing <a name='eng-21'></a>

We obtained two files, <b>school.csv</b> and <b>mrt.csv</b>.

They contain the coordinates of primary schools and MRT stations respectively. From these two files, we can get the number of MRT stations and primary schools near each listing, using the <b>geopy</b> package.

The source for <b>mrt.csv</b> is <a>https://www.kaggle.com/yxlee245/singapore-train-station-coordinates</a>.
The file <b>school.csv</b> is by courtesy of Cassandra.

In [70]:
house = pd.read_csv('../data/data_cleaned.csv')

In [71]:
# helper function to calculate distance between two points
# function not vector-friendly due to 'geopy.distance'
def facilities(coord,house_coord,dis):
    nearby= np.zeros(len(house_coord))[:,None]
    for i in range(len(house_coord)):
        count = 0
        for j in range(len(coord)):
            cal = distance(house_coord[i],coord[j]).m
            if cal <= dis:
                count +=1        
        nearby[i]= count
    return nearby

For MRT station, we define nearby as 500m within range, as it is a reasonable distance for a commuting person.


For primary schools, we define a range of 1km distance instead. This is important for some primary school with more registrants than vacancies, as there would be priority admission be given based on the child's citizenship and home-to-school distance. The first priority is given to students living within 1km of the school. 

In [72]:
mrt = pd.read_csv('../data/mrt.csv')
school = pd.read_csv('../data/school.csv')

school.drop(columns=['name'], inplace=True)
mrt.drop(columns=['station_name', 'type'], inplace=True)

house_coord = np.array(house[['latitude', 'longitude']])
mrt_coord = np.array(mrt)
sch_coord = np.array(school)

nearby_mrt = facilities(mrt_coord,house_coord,500)
nearby_sch = facilities(sch_coord, house_coord, 1000)

### 6.2 Evaluate the age of each apartment <a name='eng-22'></a>

We extract the age of the apartment, instead of using the year of construction.

In [73]:
built_on = house['Built'].astype(int)
age_of_building = (2020 - built_on)[:,None]
house.drop(columns=['Built'], inplace=True)

### 6.3 Other apartment-related features <a name='eng-23'></a>

We define a few other features related to the apartment, such as total bathroom area, total bedroom area, remaining living area, and a convenience score (MRT + School). Then, we add the engineered features to the housing data.

In [74]:
# Helper function to calculate distance between two coordinates
def dist_bet_coords(x):
    orchard = (1.303991, 103.831782)
    place2_coords = (x.latitude, x.longitude)
    return distance(orchard, place2_coords).km


house['school'] = nearby_sch
house['mrt'] = nearby_mrt
house['age_of_building'] = age_of_building
house['bathroom_area'] = 0.05 * house['bathroom'] * house['area']
house['bedroom_area'] = 0.125 * house['bedroom'] * house['area']
house['remaining_area'] = house['area'] - house['bathroom_area'] - house['bedroom_area']
house['convenience'] = house['mrt'] + house['school']
house['distance_from_central'] = house.apply(dist_bet_coords, axis = 1)

In [75]:
house = house[[c for c in house if c not in ['price_sqft']] + ['price_sqft']]
print("Number of features: ", house.shape[1])
house.head()

Number of features:  15


Unnamed: 0,area,bathroom,bedroom,Type,latitude,longitude,school,mrt,age_of_building,bathroom_area,bedroom_area,remaining_area,convenience,distance_from_central,price_sqft
0,1270.0,2,3,HDB,1.345383,103.746047,3.0,0.0,25,127.0,476.25,666.75,3.0,10.582454,429.92
1,1066.0,2,2,Condo,1.386702,103.743679,3.0,1.0,16,106.6,266.5,692.9,4.0,13.408199,919.32
2,926.0,2,2,Condo,1.295316,103.827096,1.0,0.0,6,92.6,231.5,601.9,1.0,1.091837,2699.78
3,668.0,1,1,Condo,1.280772,103.85266,0.0,3.0,12,33.4,83.5,551.1,3.0,3.462779,1871.26
4,1959.0,4,4,Condo,1.313388,103.827361,1.0,0.0,20,391.8,979.5,587.7,1.0,1.14965,2118.43


In [76]:
house.to_csv('../data/data_eng.csv', index=False)