# **Part I: Data collection**

## Note: This part (Part I) is framed in a seperated file with later parts(Part II - Part VI) since it takes long time to collect the data. 

### After running the codes in this file, we collect data with the features of:


*  **Up to date**: The time period is from April 1 to June 17 with the latest information for the past two months (**7393 days**)
* **Large volume**: The total raw data we collected for all parks in the above mentioned period contains **902338** rows, which helps us improve the model accuracy. (The total data for our vacancy prediction for Lee Garden One Car Park contains **6890** rows.)
*  **Requested types**: As described in the problem statement, we only collect data from car parks with “type” value “P” and “vacancy_type” value “A”
* **Different parks**: Although we deal with the basic problem, for this data collection part, we collected data with regards to all the parks

### We frame the data into three files:


*   Data_Onepark.csv: This *.csv* document is the one we used to solve our basic problem for the given Lee Garden One Car Park.
*   Data_MultiPark.csv: This *.csv* document is the one can be used to solve the advanced problem with informations of all parks included.
*   Data_Basic.csv: This *.csv* document for the basic information of each park, e.g. geographic information, telephone number, address, districts, etc.


### **Step 1: Import the libraries for collection data**

In [1]:
import json
import requests
import pandas as pd
import numpy as np

### **Step 2: Set intervals for the time period we choose for our data collection.**

In [2]:
data_interval = pd.date_range(start="2021/4/1", end="2021/6/17", freq="15T")

In [3]:
data_interval[36]

Timestamp('2021-04-01 09:00:00', freq='15T')

In [4]:
data_interval2 = list(data_interval)

In [5]:
data_interval2[-1]

Timestamp('2021-06-17 00:00:00', freq='15T')

In [6]:
# https://api.data.gov.hk/v1/historical-archive/get-file?url=https%3A%2F%2Fresource.data.one.gov.hk%2Ftd%2Fcarpark%2Fvacancy_all.json&time=
# 0210309-0900

data_interval2[-1].strftime('%Y%m%d-%H:%M').replace(":","") 

'20210617-0000'

### **Step 3: Time format conversion**

In [7]:
converted_dates = []

for i, date in enumerate(data_interval2):
  converted_dates.append(data_interval2[i].strftime('%Y%m%d-%H:%M').replace(":",""))

len(converted_dates)

7393

In [8]:
converted_dates[-1]

'20210617-0000'

In [9]:
base_url = 'https://api.data.gov.hk/v1/historical-archive/get-file?url=https%3A%2F%2Fresource.data.one.gov.hk%2Ftd%2Fcarpark%2Fvacancy_all.json&time=' 

def combine_url(date):
  return base_url+date

map(combine_url, converted_dates)

<map at 0x7f47389a3690>

In [10]:
# the url for each time interval
url_list = list(map(combine_url, converted_dates))

### **Step 4: The core algorithms for us to collect data from gov. API**

In [None]:
result_df_list = []

for url in url_list:
  json_data = requests.get(url).content
  data_CarPark = json.loads(json_data)
  CarPark_typeP= pd.json_normalize(data_CarPark['car_park'],record_path=['vehicle_type'],meta=['park_id'])
  CarPark_P = CarPark_typeP[CarPark_typeP['type']=='P']
  CarPark_notype = CarPark_P.drop(['type'],axis =1)
  CarPark_d2 =CarPark_notype.to_dict('records')
  CarPark_d3= pd.json_normalize(CarPark_d2,record_path=['service_category'],meta=['park_id'])
  Car_vacancy = CarPark_d3[CarPark_d3['vacancy_type']=='A']
  result_df_list.append(Car_vacancy)
vacancy_df = pd.concat(result_df_list)

### **Step 5: Save the data to the .csv file for later usage**

In [None]:
vacancy_df.to_csv(r'/content/sample_data/Data_Multiparks.csv', index = False)


### **Step 6: Read the file for the basic information**

In [None]:
# read the first document and get the geographical information
url_Basic = 'https://resource.data.one.gov.hk/td/carpark/basic_info_all.json'
data_Basic= json.loads(requests.get(url_Basic).content)
data_dis = pd.json_normalize(data_Basic['car_park'])

In [None]:
# show the data
data_dis

In [None]:
#drop unnecessary information
data_dis = data_dis.drop(['name_tc',
               'name_sc',
               'displayAddress_en',
               'displayAddress_tc',
               'displayAddress_sc',
               'district_tc',
               'district_sc',
               'contactNo',
               'height',
               'remark_en',
               'remark_tc',
               'remark_sc',
               'website_en',
               'website_tc',
               'website_sc',
               'carpark_photo'],axis=1)

In [None]:
# check opening_status
data_dis['opening_status'].value_counts()

In [None]:
# drop the rows that are not 'open' or 'Open'
data_diso = data_dis[(data_dis['opening_status'] == 'OPEN') | (data_dis['opening_status'] == 'open')]

In [None]:
# drop column 'opening_status' 
data_geo = data_diso.drop(['opening_status'], axis=1)

In [None]:
# have a check first how many different catogaries are there
len(data_geo['park_id'].unique()) 
# so all the parks are different 

In [None]:
data_geo['district_en'].value_counts()

In [None]:
data_geo.to_csv(r'/content/sample_data/Data_Basic.csv', index = False)