# ReadMe: Get Location Data
The code here is part of Phase 2 of the Singapore HDB Resale Prices project: <https://carolynkpi.shinyapps.io/app_hdb/>. 

The code gets location data for each HDB block. The data includes:

1. Nearest MRT/LRT station
2. Radial distance to nearest MRT station. 
3. Walking distance to nearest MRT station. 

4. Number of MRT/LRT stations nearby.
5. Number of bus stations nearby.
6. Number of schools nearby.
7. Number of food places nearby. 
8. Number of shopping places nearby. 

I am using 3 Google APIs. Each API requires a separate key.  

1. Google Maps Geocoding (Location data 1)
2. Google Maps Distance Matrix (Location data 3)
3. Google Places (Location data 4 -8)

While I really **love** Google because it is **free**, there is a daily usage limit on each API.  

1. Google Maps Geocoding - 2500 requests per day, 50 requests per second
2. Google Maps Distance Matrix - 2500 elements per day, 100 elements per second. 
3. Google Places - 1000 request per day by default, upgradable to 150,000 elements per day for free but with billing enablement. 

With this constraint, the codes are designed to: 
- be run for several times across several days (to keep within usage limits)
- save each data point as it is retrieved (so that returned data points are not lost if the code stops unexpectedly)
- log failed requests separately for retrying later (to make sure retries do not burst the usage limits)  

While the project is done mostly in R, this data acquisition via API is done using Python. 
There is no special reason for that, just for my convenience because of my familiarity with Python when it comes to API. 

## Import Libraries

In [37]:
import pandas as pd
import requests
import json as json
import time

## Review Data
Let us first review our HDB data [Source: <https://data.gov.sg/dataset/resale-flat-prices>].

In [53]:
dataFile = 'resale_prices_2015_.csv'
raw = pd.read_csv(dataFile)
print(raw.head(), '\n')
address = 'Blk ' + raw['block'] + ' ' + raw['street_name'] 
print('Number of data points: ', len(address))
print('Number of unique data points:', len(address.unique()))


     month        town flat_type block        street_name storey_range  \
0  2015-01  ANG MO KIO    3 ROOM   174   ANG MO KIO AVE 4     07 TO 09   
1  2015-01  ANG MO KIO    3 ROOM   541  ANG MO KIO AVE 10     01 TO 03   
2  2015-01  ANG MO KIO    3 ROOM   163   ANG MO KIO AVE 4     01 TO 03   
3  2015-01  ANG MO KIO    3 ROOM   446  ANG MO KIO AVE 10     01 TO 03   
4  2015-01  ANG MO KIO    3 ROOM   557  ANG MO KIO AVE 10     07 TO 09   

   floor_area_sqm      flat_model  lease_commence_date  remaining_lease  \
0            60.0        Improved                 1986               70   
1            68.0  New Generation                 1981               65   
2            69.0  New Generation                 1980               64   
3            68.0  New Generation                 1979               63   
4            68.0  New Generation                 1980               64   

   resale_price  
0      255000.0  
1      275000.0  
2      285000.0  
3      290000.0  
4      290000.

To be eliminate redundancies in the API requests, each address (i.e. each HDB block) should only be queried once for each type of requests. 

Now let us look at the list of Singapore MRT/LRT stations [Source:<https://en.wikipedia.org/wiki/List_of_Singapore_MRT_stations> ]. The data is obtained from but Wikipedia, but I have done some manual editing to the data using Excel. 

In [57]:
mrtFile = 'MRT.csv'
mrt = pd.read_csv(mrtFile)
print(mrt.head(), '\n')
print('Number of MRT/LRT stations: ', len(mrt))

  Station Code Station Name  Planning Area      Region     NS     EW     CG  \
0         NS10    Admiralty      Woodlands       NORTH   True  False  False   
1          EW9     Aljunied        Geylang     CENTRAL  False   True  False   
2         NS16   Ang Mo Kio     Ang Mo Kio  NORTH-EAST   True  False  False   
3          SE3        Bakau       Sengkang  NORTH-EAST  False  False  False   
4          BP9      Bangkit  Bukit Panjang        WEST  False  False  False   

      NE     CC     DT                 ...                  LRT/MRT  \
0  False  False  False                 ...                      MRT   
1  False  False  False                 ...                      MRT   
2  False  False  False                 ...                      MRT   
3  False  False  False                 ...                      LRT   
4  False  False  False                 ...                      LRT   

   No of lines  No of lines opened  Opening 1 Opening 2  Opening 3  \
0            1              

## Data Acquisition Strategy
Finding the nearest MRT/LRT station and the distance is a little tricky. We can use the Google Maps Distance Matrix to get the walking distance from every HDB block to every MRT station, but that will result in 8260 HDB x 190 stations = 1,569,400 request, requiring 628 days = 1.7 years given the free usage limit! 

To work around this problem, I have taken the following Steps 1-4: 

1. **Get the coordinates for each HDB block and each MRT stations using Google Maps Geocoding API.** 
There will be 8260 HDB + 190 MRT = 8450  requests, requiring 3.4 days. 
2. **Calculate the radial distance from each HDB block to each MRT station. **
There will be 8260 HDB x 190 MRT = 1,569,400 distances to calculate locally. I am using the Vincenty distance from the geopy library. This task can be run quickly.
3. **Find the nearest MRT/LRT station to each HDB block by finding the minimal radial distance. ** 
4. **Get the walking distance from each HDB block to the nearest MRT/LRT station found in Step 3, using Google Maps Distance Matrix API. **
There will be 8260 requests, requiring 3.3 days

Steps 1-4 will get us Location Data 1 - 3. 

Location Data 4-8 will require direct query to the Google Places API (Step 5). 
Google classifies each place according to the types here: <https://developers.google.com/places/web-service/supported_types>. 
A close look at the different types tells us that we need to at least make 8 different types of queries to collect Location Data 4-8. Results are returned in batches of 20 up to 3 pages.  Hence, there will possibly be 8260 HDB x 8 types x 3 pages = 198,240 requests, requiring 1.3 days if we enable billing for Google API (*update:* we will see that this is not true in Step 5's code because the requests becomes limited by the inter-query sleep time instead of by the Google API). 

To complete location data acquisition, I took approximately 2 weeks. 

## Codes
The codes are written in a modular approach, i.e. the codes can be run separately and the results can be saved progressively.  This is to minimize repeat runs as the process is computationally expensive. 

**Files to be imported:** 

myFunctions.py: <link for myFunctions>. This file contains all the common functions necessary to query Google's API including tryGET, tryGETp, getParamString and getKeys. 

placesTracker.py: <link for placesTracker>. This file enforces a tracker that will track the usage of the Google Places API. This is because there is billing enabled to the Places API and I just wanted to make sure that I don't accidentally exceed the free 150,000 requests limit. But it turns out that wasn't necessary since the requests ended up being limited by the inter-query sleep time. This file is imported to myFunctions.py. It relies on a log file residing in the same project directory called 'PlacesTracker.txt'.  I have included a 'PlacesTracker - Master.txt' file to start off a new project. 

**Files for each step:**

Step 1: <link for MRT> <link for HDB>

Step 2: <link>

Step 3: <link> 

Step 4: <link>

Step 5: <link>

Because some steps can be split to several separate runs, this file is used to combine their outputs. <br>
Combine_Files: <link>


## Integrating Results from Each Step
Finally, the results from each step are integrated to produced one final output csv file for subsequent analysis of the project. 