### NOTE
To get a complete view of the notebook (incluidng the interactive maps), please copy the [github link](https://github.com/chengzwk/Porto-taxi/blob/main/exploratory_analysis.ipynb) of this notebook and paste it into [nbviewer](https://nbviewer.org), as folium maps are not rendered on GitHub natively.

# Exploratory analysis on the original data

As the first step of the project, we perform an exploratory analysis on the original data. 

### Dataset
- Basic information about the dataset: This dataset describes a complete year (from 01/07/2013 to 30/06/2014) of the trajectories for all the 442 taxis running in the city of Porto, in Portugal (i.e. one CSV file named "train.csv").
- Detailed information can be found on the [dataset page](https://www.kaggle.com/datasets/crailtap/taxi-trajectory/data).

### Load the dataset
Open "train.csv" with a text editor (in my case TextEdit on MacOS), we can see the file already contains a header row. The data is separated by comma.

In [2]:
import pandas as pd

raw_data = pd.read_csv('train.csv', sep=',')

The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.


### Inspect the data
We can inspect the data in the following way:

In [3]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1710670 entries, 0 to 1710669
Data columns (total 9 columns):
 #   Column        Dtype  
---  ------        -----  
 0   TRIP_ID       int64  
 1   CALL_TYPE     object 
 2   ORIGIN_CALL   float64
 3   ORIGIN_STAND  float64
 4   TAXI_ID       int64  
 5   TIMESTAMP     int64  
 6   DAY_TYPE      object 
 7   MISSING_DATA  bool   
 8   POLYLINE      object 
dtypes: bool(1), float64(2), int64(3), object(3)
memory usage: 106.0+ MB


In [4]:
raw_data.describe()

Unnamed: 0,TRIP_ID,ORIGIN_CALL,ORIGIN_STAND,TAXI_ID,TIMESTAMP
count,1710670.0,364770.0,806579.0,1710670.0,1710670.0
mean,1.388622e+18,24490.363018,30.272381,20000350.0,1388622000.0
std,9180944000000000.0,19624.290043,17.74784,211.2405,9180944.0
min,1.372637e+18,2001.0,1.0,20000000.0,1372637000.0
25%,1.380731e+18,6593.0,15.0,20000170.0,1380731000.0
50%,1.388493e+18,18755.0,27.0,20000340.0,1388493000.0
75%,1.39675e+18,40808.0,49.0,20000520.0,1396750000.0
max,1.404173e+18,63884.0,63.0,20000980.0,1404173000.0


In [5]:
raw_data.head()
# raw_data.tail()

Unnamed: 0,TRIP_ID,CALL_TYPE,ORIGIN_CALL,ORIGIN_STAND,TAXI_ID,TIMESTAMP,DAY_TYPE,MISSING_DATA,POLYLINE
0,1372636858620000589,C,,,20000589,1372636858,A,False,"[[-8.618643,41.141412],[-8.618499,41.141376],[..."
1,1372637303620000596,B,,7.0,20000596,1372637303,A,False,"[[-8.639847,41.159826],[-8.640351,41.159871],[..."
2,1372636951620000320,C,,,20000320,1372636951,A,False,"[[-8.612964,41.140359],[-8.613378,41.14035],[-..."
3,1372636854620000520,C,,,20000520,1372636854,A,False,"[[-8.574678,41.151951],[-8.574705,41.151942],[..."
4,1372637091620000337,C,,,20000337,1372637091,A,False,"[[-8.645994,41.18049],[-8.645949,41.180517],[-..."


We can see that there're 9 columns in total (TRIP_ID, CALL_TYPE, ORIGIN_CALL, ORIGIN_STAND, TAXI_ID, TIMESTAMP, DAY_TYPE, MISSING_DATA, POLYLINE). There're 1710670 entries in total. The minimum and maximum timestamp value is 1.372637e+09 and 1.404173e+09, respectively. The timestamp is Unix Timestamp (in seconds), we can convert it to a readable format by:

In [6]:
import datetime

dt_start = datetime.datetime.fromtimestamp(1.372637e+09)
dt_end = datetime.datetime.fromtimestamp(1.404173e+09)
print("start time is {}, end time is {}".format(dt_start, dt_end))

start time is 2013-07-01 08:03:20, end time is 2014-07-01 08:03:20


We can see that the data starts from July 1st 2013 and ends on July 1st 2014, in agreement with the dataset description. By looking at the summary of 'timestamp' column, we can also see that the data is roughly evenly distributed throughout the year.

We can check the number of different taxis included in this dataset by:

In [7]:
raw_data['TAXI_ID'].nunique()

448

As we can see, there're 448 taxi IDs in this dataset, which is different form the data description. We can get a sorted list of all taxi IDs by:

In [8]:
# Get all taxi IDs and put them in a list, then sort the list ascending
taxi_ids = sorted(raw_data['TAXI_ID'].unique().tolist())

By inspecting the list, we can see that there's no invalid taxi ID in the list. There's actually 448 taxis in the dataset.

The total number of trips for each taxi is about $1710670 / 448 \approx 3818$ on average.
We can estimate the number of trips per day for each taxi is $ n = 1710670 / 448 / 260 \approx 15 $.
So in our dataset, each taxi has about 15 trips per day, which agrees with common sense (if we estimate each taxi trips is about half an hour on average).

### Inspect the data for one taxi
We can extract all entries for the first taxi by:

In [9]:
# Extract the entries for the first taxi
# Make a copy to avoid unexpected changes to the raw_data dataframe
data_first_taxi = raw_data[raw_data['TAXI_ID'] == taxi_ids[0]].copy()

# Sort this data by timestamp in ascending order:
data_first_taxi_sorted = data_first_taxi.sort_values('TIMESTAMP')

# Make column for date time in readable format
data_first_taxi_sorted['DATETIME'] = pd.to_datetime(data_first_taxi_sorted['TIMESTAMP'], unit='s')

We can take a closer look at this data. We can see that this taxi makes about one trip per hour during the day.

In [10]:
data_first_taxi_sorted.head(5)
# data_first_taxi_sorted.head(50)

Unnamed: 0,TRIP_ID,CALL_TYPE,ORIGIN_CALL,ORIGIN_STAND,TAXI_ID,TIMESTAMP,DAY_TYPE,MISSING_DATA,POLYLINE,DATETIME
624,1372662403620000001,B,,28.0,20000001,1372662403,A,False,"[[-8.584353,41.163174],[-8.585289,41.162994],[...",2013-07-01 07:06:43
853,1372666377620000001,A,2002.0,,20000001,1372666377,A,False,"[[-8.608824,41.153436],[-8.608815,41.153463],[...",2013-07-01 08:12:57
1170,1372669154620000001,B,,63.0,20000001,1372669154,A,False,"[[-8.609562,41.160249],[-8.609652,41.160375],[...",2013-07-01 08:59:14
1513,1372672248620000001,A,2002.0,,20000001,1372672248,A,False,"[[-8.608869,41.15349],[-8.608851,41.153508],[-...",2013-07-01 09:50:48
1854,1372676157620000001,B,,10.0,20000001,1372676157,A,False,"[[-8.607096,41.150331],[-8.607096,41.150313],[...",2013-07-01 10:55:57


# Visualize taxi trips on an interactive map with folium

### folium library
**[Folium](https://python-visualization.github.io/folium/latest/index.html)** is a Python library used to create interactive maps using the Leaflet.js library. Folium makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map. It enables both the binding of data to a map for choropleth visualizations as well as passing rich vector/raster/HTML visualizations as markers on the map. With folium, you can visualize data directly from a Jupyter Notebook or export maps as HTML files. We'll be using this package to visulize the taxi trips throughout this project.

In [11]:
import folium

### Visualize trips on the map for one taxi during one day 
Let's visualize all trips made by the first taxi on the first day. To plot the trips on the map, we need to use the GPS coordinates stored in the 'POLYLINE' column as a **string** in this format: '[[LONGITUDE, LATITUDE], ...]'

In [12]:
print(type(data_first_taxi_sorted.iloc[0]['POLYLINE']))
print(data_first_taxi_sorted.iloc[0]['POLYLINE'])

<class 'str'>
[[-8.584353,41.163174],[-8.585289,41.162994],[-8.587512,41.163678],[-8.589042,41.164173],[-8.589024,41.164155],[-8.589024,41.164146],[-8.58906,41.164155],[-8.589609,41.164038],[-8.589618,41.164056],[-8.589879,41.163759],[-8.59131,41.162202],[-8.59293,41.160717],[-8.594622,41.160627],[-8.596494,41.160879],[-8.598744,41.161158],[-8.601471,41.161536],[-8.603577,41.161824],[-8.603694,41.161833],[-8.605206,41.161995],[-8.607537,41.162283]]


Let's first extract the desired data:

In [13]:
specific_date = '2013-07-01'
df = data_first_taxi_sorted[data_first_taxi_sorted['DATETIME'].dt.date == pd.to_datetime(specific_date).date()]

There're 14 trips in total. To visualize them, we need to convert the data to the format that folium takes, which is a nested list, or list of tuples:  
[[latitude, longitude)], ...] or [(latitude, longitude), ...]

In [14]:
# Initialize a list called "trips" to store coordinates of all trips
trips = []

# Convert all entries in column 'POLYLINE' to desired format
lines = [l[2:-2].split('],[') for l in df['POLYLINE'].to_list()]
for line in lines:
    trips.append([(float(p.split(',')[1]), float(p.split(',')[0])) for p in line])

Let's plot the trips on map with blue lines, and add circle markers to mark the start (red) and end (green) points of the trips:

In [15]:
# Initalize the map
m = folium.Map(location=trips[0][0], zoom_start=13)

# Plot all trips on the map
for trip in trips:
    folium.PolyLine(trip, color='blue', opacity=0.5).add_to(m)

    # Add circle markers for starting and ending points of the trip
    folium.CircleMarker(location=trip[0], radius=3, weight=0, fill_color='red', fill_opacity=1).add_to(m)
    folium.CircleMarker(location=trip[-1], radius=3, weight=0, fill_color='green', fill_opacity=1).add_to(m)

In [16]:
# Show interactive map
m

### Visualize trips on the map for one taxi during the year

Let's visualize all trips made by the first taxi during the entire year.

In [17]:
# Take all data of the first taxi, and remove rows where there's no GPS data recorded
# When there's no recorded GPS data, 'POLYLINE' would be '[]'
df = data_first_taxi_sorted[data_first_taxi_sorted['POLYLINE'].map(len) > 2]

# Initialize a list called "trips" to store coordinates of all trips
trips = []

for l in df['POLYLINE'].to_list():
    line = l[2:-2].split('],[')
    trip = []
    for p in line:
        split_item = p.split(',')
        trip.append((float(split_item[1]), float(split_item[0])))
    trips.append(trip)

In [18]:
# Initalize the map
m = folium.Map(location=trips[0][0], zoom_start=9)

# Plot all trips on the map
for trip in trips:
    folium.PolyLine(trip, color='blue', opacity=0.3).add_to(m)

In [19]:
# Show interactive map
m

Let's visualize the trips for another taxi.

In [20]:
# Randomly select another taxi
import random
random_id = random.choice(taxi_ids[1:])
print("Randomly selected taxi ID: {}".format(random_id))

# Extract the entries for the first taxi, sort by timestamp ascending, and add date time column
data_one_taxi = raw_data[raw_data['TAXI_ID'] == random_id].copy()
data_one_taxi_sorted = data_one_taxi.sort_values('TIMESTAMP')
data_one_taxi_sorted['DATETIME'] = pd.to_datetime(data_one_taxi_sorted['TIMESTAMP'], unit='s')

Randomly selected taxi ID: 20000940


In [21]:
# Remove rows where there's no GPS data recorded
df = data_one_taxi_sorted[data_one_taxi_sorted['POLYLINE'].map(len) > 2]

# Initialize a list called "trips" to store coordinates of all trips
trips = []

for l in df['POLYLINE'].to_list():
    line = l[2:-2].split('],[')
    trip = []
    for p in line:
        split_item = p.split(',')
        trip.append((float(split_item[1]), float(split_item[0])))
    trips.append(trip)

In [22]:
# Initalize the map
m = folium.Map(location=trips[0][0], zoom_start=9)

# Plot all trips on the map
for trip in trips:
    folium.PolyLine(trip, color='green', opacity=0.3).add_to(m)

In [23]:
# Show interactive map
m

### Observation and further inqury
We can see that for both taxis, the taxi trips are mostly within the city, and there's a few trips to nearby cities. If we zoom in in the city area, we can see the main roads and highways are much more frequently traveled by this taxi than the smaller roads, which is indicated by the darker color of the lines. 

This motivates us to further explore the question: which roads are most frequenly travelled by taxis in Porto during this year? Is there a difference in frequently traveled routes during weekdays, weekends and holidays? Is there a difference in frequently traveled routes for difference trip types (dispatched from the central, demanded from a stand and otherwise)?

### Data subset
For inital studies, since the original dataset is large ($\approx 1.7 \times 10^6$ entries), we first perform the subsequent studies on a subset of the original dataset. We randomly select 45 taxis from all taxis in the dataset, which is about 10% of the original data.

In [24]:
# Randomly select 45 taxis 
import random
sample_size = 45
random_ids = random.sample(taxi_ids, sample_size)
print("Randomly selected taxi IDs: ", random_ids)
if len(random_ids) == len(set(random_ids)): print("There is no repeated taxi IDs in the selected IDs.")

# Extract data of these taxis from the original dataset
subset_data = raw_data[raw_data['TAXI_ID'].isin(random_ids)].copy()
print("Data subset size: {}".format(len(subset_data)))

# Save the data subset as .pkl file
subset_data.to_pickle('subset_data.pkl')

Randomly selected taxi IDs:  [20000188, 20000685, 20000598, 20000360, 20000083, 20000397, 20000386, 20000698, 20000098, 20000129, 20000092, 20000095, 20000540, 20000065, 20000546, 20000233, 20000453, 20000116, 20000970, 20000213, 20000046, 20000612, 20000027, 20000307, 20000198, 20000618, 20000058, 20000047, 20000018, 20000002, 20000545, 20000625, 20000030, 20000017, 20000340, 20000328, 20000617, 20000483, 20000616, 20000325, 20000066, 20000671, 20000670, 20000432, 20000247]
There is no repeated taxi IDs in the selected IDs.
Data subset size: 179932
