
## In this project, we will do the following:

1. Call the taxi availability API from Data.gov.sg to collect taxi data (Part I)
2. Perform data cleaning (Part II)
3. Perform exploratory data analysis (Part III)
4. Train a machine learning model to  (Part IV)

# Introduction

In Singapore, all of the taxis are connected to a central system that tracks their positions at all times. It's even cooler because the Singaporean government collects these data and anyone - including you - can obtain the data for analysis. 

We will collect the data for one month in 2019, and perform analysis, followed by modelling. 

In this notebook, you will do the following:
1. Import your pandas library
2. Call the Taxi Availability API from Data.gov.sg
3. Organize the JSON data
4. Export your DataFrame as a CSV file 

### Step 1: Import the following library
- pandas
- requests

In [None]:
# Step 1: Import the libraries you need
import pandas as pd
import requests

### Step 2: Visit the API website
Data from https://data.gov.sg/dataset/taxi-availability 


### Step 3: Test with one API call first with requests
Now that we've seen how it's done on the browser, we will be using Python to make an API call. 

Here are what I did:
1. use requests to get the response of the URL that you found from Step 4
2. save the response in a variable
3. use .json() to get the JSON data


In [None]:
# Step 5a: use requests to make a get API call at the URL and assign it to a variable
url="https://api.data.gov.sg/v1/transport/taxi-availability"

import datetime
date = datetime.datetime(2019, 2, 1, 14, 30, 00)
print(date.strftime("%Y-%m-%dT%H:%M:%S"))
params={
   "date_time" : date.strftime("%Y-%m-%dT%H:%M:%S")
}
data=requests.get(url,params=params)

# Step 5b: declare another variable, and save the JSON in it
import json
data_json=data.json()
# Step 5c: peek at your JSON
data_json

In [None]:
date_range_string[700].strip()

'2019-01-03T0:20:00'

### Step 4: Turn the JSON response to a DataFrame
We'll practise turning a JSON response into a DataFrame directly first.

![JSONtoDataFrame.png](attachment:JSONtoDataFrame.png)


In [None]:
json.dumps(data_json)

'{"type": "FeatureCollection", "crs": {"type": "link", "properties": {"href": "http://spatialreference.org/ref/epsg/4326/ogcwkt/", "type": "ogcwkt"}}, "features": [{"type": "Feature", "geometry": {"type": "MultiPoint", "coordinates": [[103.622996833333, 1.2750035], [103.6282, 1.31345], [103.6282, 1.31349], [103.6395715, 1.33025033333333], [103.64017, 1.33163], [103.64043, 1.33667], [103.6454, 1.3244], [103.65026, 1.32023], [103.65201, 1.32936], [103.65693, 1.32447], [103.66125, 1.32051], [103.6656, 1.30536], [103.66564, 1.30481], [103.66797, 1.31188], [103.67052, 1.3199], [103.67778, 1.34732], [103.678230833333, 1.32755733333333], [103.6791, 1.31465], [103.6824, 1.3439], [103.689151, 1.3402765], [103.691898666667, 1.34645383333333], [103.692461333333, 1.34280316666667], [103.69317, 1.37397], [103.694473666667, 1.34603716666667], [103.696197833333, 1.34168483333333], [103.696216, 1.34558], [103.69646, 1.32717], [103.69655, 1.34173], [103.696945, 1.347927], [103.697, 1.3558], [103.697741

In [None]:
!pip install fsspec

Collecting fsspec
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
[?25l[K     |██▌                             | 10 kB 19.6 MB/s eta 0:00:01[K     |█████                           | 20 kB 26.8 MB/s eta 0:00:01[K     |███████▍                        | 30 kB 27.3 MB/s eta 0:00:01[K     |█████████▉                      | 40 kB 20.0 MB/s eta 0:00:01[K     |████████████▎                   | 51 kB 22.4 MB/s eta 0:00:01[K     |██████████████▊                 | 61 kB 18.0 MB/s eta 0:00:01[K     |█████████████████▏              | 71 kB 17.6 MB/s eta 0:00:01[K     |███████████████████▊            | 81 kB 18.6 MB/s eta 0:00:01[K     |██████████████████████▏         | 92 kB 17.7 MB/s eta 0:00:01[K     |████████████████████████▋       | 102 kB 18.4 MB/s eta 0:00:01[K     |███████████████████████████     | 112 kB 18.4 MB/s eta 0:00:01[K     |█████████████████████████████▌  | 122 kB 18.4 MB/s eta 0:00:01[K     |████████████████████████████████| 133 kB 18.4 MB/s eta 

In [None]:
# Step 6: Turn the JSON response directly into a DataFrame
#pd.read_json(json.encoder(data_json))
df_nested_list = pd.json_normalize(data_json,record_path =['features'])
df_nested_list

Unnamed: 0,type,geometry.type,geometry.coordinates,properties.timestamp,properties.taxi_count,properties.api_info.status
0,Feature,MultiPoint,"[[103.622996833333, 1.2750035], [103.6282, 1.3...",2019-02-01T14:29:58+08:00,2549,healthy


### Step 5: Get the JSON's "features" only


In [None]:
# Step 7a: Declare a new variable that contains only your 'features' from the JSON
features=data_json['features'][0]['geometry']
features
# Step 7b: Turn it into a DataFrame
df = pd.json_normalize(data_json['features'])
df.head()

Unnamed: 0,type,geometry.type,geometry.coordinates,properties.timestamp,properties.taxi_count,properties.api_info.status
0,Feature,MultiPoint,"[[103.622996833333, 1.2750035], [103.6282, 1.3...",2019-02-01T14:29:58+08:00,2549,healthy



### Step 6: Dissect the API call to get a pattern
Okay, now that we're successful in turning the JSON into a DataFrame containing one row, we can now proceed with calling the rest of the month of January 2019. 

We want to be granular, but not too granular so we will be getting 5-min interval data. For example:
1. Starts at 2019-01-01T00:00:00
2. Next one is 2019-01-01T00:05:00
3. We go on until 2019-01-31T00:00:00


### Step 9: 
We are going to create a list containing all of the possible combinations of the date and time in 5-min intervals between 2019-01-01 and 2019-01-31.


In [None]:
# Step 9: Generate a date range in 5-min intervals
date_range=pd.date_range(start='1/1/2019', end='31/01/2019',freq="5T")
date_range

DatetimeIndex(['2019-01-01 00:00:00', '2019-01-01 00:05:00',
               '2019-01-01 00:10:00', '2019-01-01 00:15:00',
               '2019-01-01 00:20:00', '2019-01-01 00:25:00',
               '2019-01-01 00:30:00', '2019-01-01 00:35:00',
               '2019-01-01 00:40:00', '2019-01-01 00:45:00',
               ...
               '2019-01-30 23:15:00', '2019-01-30 23:20:00',
               '2019-01-30 23:25:00', '2019-01-30 23:30:00',
               '2019-01-30 23:35:00', '2019-01-30 23:40:00',
               '2019-01-30 23:45:00', '2019-01-30 23:50:00',
               '2019-01-30 23:55:00', '2019-01-31 00:00:00'],
              dtype='datetime64[ns]', length=8641, freq='5T')

### Step 7: Generate a list of datetime in proper format for API
If you noticed in the list, it's still not quite suitable for using in calling the API. 

You'll need the list to containing the properly formatted date and time string. 


1. create a list containing your date, along with a string "T" in it
2. create a list containing your hour
3. create a list containing your minute

In [None]:
# Step 10a: Create three new lists containing the formatted parts of the DateTime
date_range_string=[str(i)[:10]+"T"+str(i)[11:] for i in date_range]
date_range_string
# Step 10b: zip all of the three lists together (don't forget the %3A)

['2019-01-01T00:00:00',
 '2019-01-01T00:05:00',
 '2019-01-01T00:10:00',
 '2019-01-01T00:15:00',
 '2019-01-01T00:20:00',
 '2019-01-01T00:25:00',
 '2019-01-01T00:30:00',
 '2019-01-01T00:35:00',
 '2019-01-01T00:40:00',
 '2019-01-01T00:45:00',
 '2019-01-01T00:50:00',
 '2019-01-01T00:55:00',
 '2019-01-01T01:00:00',
 '2019-01-01T01:05:00',
 '2019-01-01T01:10:00',
 '2019-01-01T01:15:00',
 '2019-01-01T01:20:00',
 '2019-01-01T01:25:00',
 '2019-01-01T01:30:00',
 '2019-01-01T01:35:00',
 '2019-01-01T01:40:00',
 '2019-01-01T01:45:00',
 '2019-01-01T01:50:00',
 '2019-01-01T01:55:00',
 '2019-01-01T02:00:00',
 '2019-01-01T02:05:00',
 '2019-01-01T02:10:00',
 '2019-01-01T02:15:00',
 '2019-01-01T02:20:00',
 '2019-01-01T02:25:00',
 '2019-01-01T02:30:00',
 '2019-01-01T02:35:00',
 '2019-01-01T02:40:00',
 '2019-01-01T02:45:00',
 '2019-01-01T02:50:00',
 '2019-01-01T02:55:00',
 '2019-01-01T03:00:00',
 '2019-01-01T03:05:00',
 '2019-01-01T03:10:00',
 '2019-01-01T03:15:00',
 '2019-01-01T03:20:00',
 '2019-01-01T03:

### Step 8: Make your API calls for the entire duration (takes 1-2 hours)

This is the sequence of events:
1. declare your base URL string
2. declare variable containing an empty list
3. use a for loop to loop through the list of strings containing dates
4. in each loop, combine the base URL string with the date
5. perform the API call
6. get the response, extract only the feature
7. turn that feature into a DataFrame
8. append the DataFrame into the list you initialized
9. after the entire loop, concatenate all of the DataFrames you have in the list into a combined DataFrame
>

In [None]:
# Step 11: Make your API calls and build your DataFrame
url="https://api.data.gov.sg/v1/transport/taxi-availability"
data_list=[]
count=1
for time in date_range_string:
  params={
    "date_time" : time
  }
  data=requests.get(url,params=params)

  import json
  data_json=data.json()
  data_list.append(data_json)
  print(count)
  count=count+1

In [None]:
import json
data=pd.read_json("/content/drive/MyDrive/taxi data/data.txt")
data.head()

In [None]:
data["data"][1]

{'crs': {'properties': {'href': 'http://spatialreference.org/ref/epsg/4326/ogcwkt/',
   'type': 'ogcwkt'},
  'type': 'link'},
 'features': [{'geometry': {'coordinates': [[103.63213, 1.31121],
     [103.63766, 1.30045],
     [103.65474, 1.31342],
     [103.68523, 1.34969],
     [103.68578, 1.35084],
     [103.6873, 1.32112],
     [103.69082, 1.34283],
     [103.69117, 1.34623],
     [103.69163, 1.34406],
     [103.69169, 1.34408],
     [103.69214, 1.34748],
     [103.69563, 1.34215],
     [103.69903, 1.34562],
     [103.6996, 1.3453300000000001],
     [103.69969, 1.3443800000000001],
     [103.69995, 1.33939],
     [103.70076, 1.3362],
     [103.7016, 1.3223500000000001],
     [103.70181, 1.35985],
     [103.70182, 1.33987],
     [103.70218, 1.3420800000000002],
     [103.70382, 1.34216],
     [103.7039, 1.34074],
     [103.70391, 1.33576],
     [103.70508, 1.35244],
     [103.70626, 1.34429],
     [103.70899, 1.3403],
     [103.71033, 1.34625],
     [103.71039, 1.34737],
     [103.7107

In [None]:


# # use a for loop in the list you got from Step 10

#     # combine the base_url and the current date in the for loop
    
#     # make a get request using the combined URL
    
#     # get the JSON from the response of the get request
    
#     # declare a variable which contains only the 'features' part of the JSON response
    
#     # turn the variable into a DataFrame
    
#     # append the dataframe into the empty list above
    

# # concatenate all of the dataframes you appended into the empty list


In [None]:
dataframes=[]
for i in list(data.data):
    df = pd.json_normalize(i['features'])
    dataframes.append(df)
    
final_data=pd.concat(dataframes)
final_data

Unnamed: 0,type,geometry.type,geometry.coordinates,properties.timestamp,properties.taxi_count,properties.api_info.status
0,Feature,MultiPoint,"[[103.6267, 1.307992], [103.63226, 1.30884], [...",2018-12-31T23:59:44+08:00,5887,healthy
0,Feature,MultiPoint,"[[103.63213, 1.31121], [103.63766, 1.30045], [...",2019-01-01T00:04:44+08:00,4001,healthy
0,Feature,MultiPoint,"[[103.63145, 1.31125], [103.6376, 1.3002479999...",2019-01-01T00:09:44+08:00,5981,healthy
0,Feature,MultiPoint,"[[103.63132, 1.3216], [103.63314, 1.32474], [1...",2019-01-01T00:14:45+08:00,5461,healthy
0,Feature,MultiPoint,"[[103.628, 1.31262], [103.63714, 1.29914], [10...",2019-01-01T00:19:45+08:00,5003,healthy
...,...,...,...,...,...,...
0,Feature,MultiPoint,"[[103.62689, 1.31369], [103.62953, 1.30178], [...",2019-01-30T23:39:40+08:00,5782,healthy
0,Feature,MultiPoint,"[[103.615898833333, 1.27034783333333], [103.62...",2019-01-30T23:44:40+08:00,5843,healthy
0,Feature,MultiPoint,"[[103.624481, 1.30293333333333], [103.62871, 1...",2019-01-30T23:49:40+08:00,5825,healthy
0,Feature,MultiPoint,"[[103.62935, 1.2973], [103.62964, 1.29373], [1...",2019-01-30T23:54:40+08:00,5783,healthy


### Step 12: Create a new column "time" in the new DataFrame
Well done! Hope that didn't take long. 

Now that we've this exciting new DataFrame, we'll need to do one more thing - create a new column called time containing the date and time. Just use the list that you got from Step 9.

![CombinedDataFrameAPIFinalExpectation.png](attachment:CombinedDataFrameAPIFinalExpectation.png)

A few checks at the end after you're done:
1. 7 columns
2. 8,641 rows

In [None]:
# Step 12: create the 'time' column and assign it with the list from Step 9
final_data["time"]=date_range
final_data


Unnamed: 0,type,geometry.type,geometry.coordinates,properties.timestamp,properties.taxi_count,properties.api_info.status,time
0,Feature,MultiPoint,"[[103.6267, 1.307992], [103.63226, 1.30884], [...",2018-12-31T23:59:44+08:00,5887,healthy,2019-01-01 00:00:00
0,Feature,MultiPoint,"[[103.63213, 1.31121], [103.63766, 1.30045], [...",2019-01-01T00:04:44+08:00,4001,healthy,2019-01-01 00:05:00
0,Feature,MultiPoint,"[[103.63145, 1.31125], [103.6376, 1.3002479999...",2019-01-01T00:09:44+08:00,5981,healthy,2019-01-01 00:10:00
0,Feature,MultiPoint,"[[103.63132, 1.3216], [103.63314, 1.32474], [1...",2019-01-01T00:14:45+08:00,5461,healthy,2019-01-01 00:15:00
0,Feature,MultiPoint,"[[103.628, 1.31262], [103.63714, 1.29914], [10...",2019-01-01T00:19:45+08:00,5003,healthy,2019-01-01 00:20:00
...,...,...,...,...,...,...,...
0,Feature,MultiPoint,"[[103.62689, 1.31369], [103.62953, 1.30178], [...",2019-01-30T23:39:40+08:00,5782,healthy,2019-01-30 23:40:00
0,Feature,MultiPoint,"[[103.615898833333, 1.27034783333333], [103.62...",2019-01-30T23:44:40+08:00,5843,healthy,2019-01-30 23:45:00
0,Feature,MultiPoint,"[[103.624481, 1.30293333333333], [103.62871, 1...",2019-01-30T23:49:40+08:00,5825,healthy,2019-01-30 23:50:00
0,Feature,MultiPoint,"[[103.62935, 1.2973], [103.62964, 1.29373], [1...",2019-01-30T23:54:40+08:00,5783,healthy,2019-01-30 23:55:00


### Step 13: Export your DataFrame as CSV 

In [None]:
# Step 13: Export your DataFrame to CSV
final_data.to_csv("/content/drive/MyDrive/taxi data/taxi_jan2019_data.csv")