# Introduction

In [None]:
# Load libraries
import pandas as pd
import os
import matplotlib
import folium
from plotnine import *
from matplotlib.dates import date2num
%matplotlib inline

In this intial report, we explore descriptive statistics arising from the initial accident, traffic level, and traffic report datasets provided to us by our partners, Jakarta Smart City and UN Global Pulse. These data constitute a sample of traffic data from May 2018, and reflect information such as the number of reports, types of traffic incidents, and frequency of accidents. They are also geocoded by district.

In general, we find some support for the notion that traffic incidents change in frequency and type over the course of a day, week, or month. This initial exploration is useful as it points to other data sources that might be helpful for us.

# Accident Data

The accident data was small (only 2 observations), but we would be excited to see more data that looks like these. The following fields are a bit unclear to us:

- type: We only see 'ACCIDENT' here so we are curious about what other categories might be contained here
- sub-type: Similarly, we are interested in the range of sub-types
- total: As with the other datasets, we are curious as to whether 'total' refers to the total number of reports 


In [None]:
# Load datasets
accident_report = pd.read_csv("accident_report_data.csv")
traffic_level = pd.read_csv("traffic_level_data.csv")
traffic_report = pd.read_csv("traffic_report_data.csv")

# Map of Jakarta

In this section, we plot a preliminary map of Jakarta to get a sense for how the city is shaped and zoned. One useful piece of data that would augment any geospatial analysis would be a geojson or shape file containing the district lines for each district within the city. With this information, we would be able to map the change in traffic incidents over time, as well as differences between the districts.

In [None]:
# Map Jakarta
jakarta_coords = (-6.1751, 106.8650)
jakarta_map = folium.Map(jakarta_coords, zoom_start = 12, tiles = 'Mapbox Bright')
folium.TileLayer('stamentoner').add_to(jakarta_map)

In [None]:
jakarta_map

# Traffic Level

In this section, we look at the trends in the "traffic_level_dataset" csv. These data are interesting because they provide information about the delays experienced by motorists in each region, are collected fairly regularly. Our major question here is what 'delay' is measured in, namely whether it measures seconds or minutes.

In [None]:
# traffic_level metadata

traffic_level['date_time'] = pd.to_datetime(traffic_level['date'] + ' ' + traffic_level['start_time'])
traffic_level.head()

Below, we visualize the general delays experienced by each area in aggregate. We use histograms to show the overall distributions. In general, across all areas, delays tend to concentrate on the 'shorter' side. Gambir is notable in this regard as nearly all of its reports contain delays around 100.

We would be interested in learning more about each of these areas, and what drives these small, but potentially important, differences in the delays experienced in each.

In [None]:
(ggplot(traffic_level[traffic_level.delay < 2000], aes('delay')) 
 + geom_histogram(bins = 30)
 + facet_wrap('~area')
 + ggtitle('Histogram of Traffic Delays by Area') 
 + xlab('Delay')
 + ylab('Count')
 + theme_bw())

Here, we visualize time series across the two-week period for each area. It is interesting to note here that certain areas experience spikes on particular days, even if other areas do not. For instance, Gambir experienced a spike in delays on May 13th, even though no other area experienced a similar spike. Meanwhile, Gunung Sahari Utara experienced a spike on May 3rd that did not occur in Gambir. Kampung Melayu was very consistent across the entire period, never experiencing a major fluctuation in delay times.

We would need more data to explore whether these differences are largely explained by one-time events (such as construction or an accident), or whether they are systematically related to something like the day of week.

In [None]:
(ggplot(traffic_level[traffic_level.delay < 2000], aes('date_time', 'delay'))
+ geom_line()
+ xlab('Date/Time')
+ ylab('Delay')
+ ggtitle('Time Series of Traffic Delays across All Days')
+ theme_bw()
+ theme(axis_text_x = element_text(angle = 90, hjust = .5))
+ facet_wrap('~area'))

Finally, we visualize a day of delays in Gambir, namely May 1st. Here, we can see that delays substantially decrease around mid-afternoon (down to around 80), but the mornings and early evenings experience larger delays (close to 140 at points).

In [None]:
(ggplot(traffic_level[(traffic_level.area == 'Gambir') & 
                      (traffic_level.delay < 2000) &
                      (traffic_level.date == '2018-05-01')], 
        aes('start_time', 'delay', group = 1))
+ geom_line()
+ xlab('Date/Time')
+ ylab('Delay')
+ ggtitle('Gambir Time Series of Traffic Delays on May 1, 2018')
+ theme_bw()
+ theme(axis_text_x = element_text(angle = 90, hjust = .5)))

# Traffic Report

Next, we visualize the "traffic_reports_dataset" csv. These data provide the type and sub-types for various incidents.

In [None]:
# Look at traffic_report
traffic_report.head()

In [None]:
traffic_report['date_time'] = pd.to_datetime(traffic_level['date'] + ' ' + traffic_level['start_time'])
traffic_report.head()

Here we visualize the frequency of different types of roadway incideints. Traffic jams are the dominant reason for reporting a traffic incident by a large margin. Citizens rarely reported roadway hazards or road closures as a reason for a delay.

One interesting implication of this finding is that hazards do not pose as large a risk to safety as we originally thought. This may inform how we process the video data; if roadway hazards rarely cause problems for motorists, our time may be better spent on other tasks (such as detecting dangerous driving patterns).

In [None]:
(ggplot(traffic_report)
+ geom_bar(aes('type'))
+ ggtitle('Barplot of Incident Types')
+ xlab('Type of Accident')
+ ylab('Count')
+ theme_bw())

Within traffic jams, we visualize the various 'sub-types' that people report. Citizens most frequently reported 'heavy,' and 'stand still' traffic. This may be because people are more likely to complain about bad traffic than report light or moderate traffic, but nonetheless provides a nice snapshot into the prevalence of congestion in the dataset.

In [None]:
(ggplot(traffic_report[(traffic_report['type'] == 'JAM') & (traffic_report['sub-type'].notnull())])
+ geom_bar(aes('sub-type'))
+ ggtitle('Barplot of Traffic Jam Sub-Types')
+ xlab('Type of Accident')
+ ylab('Count')
+ theme_bw()
+ theme(axis_text_x = element_text(angle = 90, hjust = 1)))

# Traffic Levels and Reports Data

Finally, we merge the traffic levels and reports data to see if there is a plausible relationship between reported traffic jams and increased delays.

In [None]:
traffic_all = pd.merge(traffic_level, traffic_report.drop(['date', 'start_time', 'end_time', 'Unnamed: 0'], axis = 1), how = 'outer', on = ['area', 'date_time'], validate = 'one_to_many')
traffic_all.head(5)

In [None]:
cuts_gambir = traffic_all[(traffic_all['type'] == 'JAM') & 
                          (traffic_all['date'] == '2018-05-01') &
                          (traffic_all['area'] == 'Gambir') &
                         (traffic_all['delay'] < 2000)][['start_time', 'date_time', 'area', 'type', 'sub-type']]

As we can see from the plot below, reports of traffic jams are generally followed by dramatic events. In two cases, they preceded substantial increases in delay time. However, in two other cases delay times dropped considerably. More data will be necessary to draw firm conclusions, but this may reflect idiosyncrasies in the way data were aggregated.

In [None]:
(ggplot(traffic_all[(traffic_all.area == 'Gambir') & 
                      (traffic_all.delay < 2000) &
                      (traffic_all.date == '2018-05-01')], 
        aes('start_time', 'delay', group = 1))
+ geom_line()
+ scale_x_datetime()
+ geom_vline(xintercept = traffic_all.start_time[traffic_all.index.isin(cuts_gambir.index)],
            color = 'red')
+ xlab('Date/Time')
+ ylab('Delay')
+ ggtitle('Gambir Time Series of Traffic Delays on May 1, 2018 \n with Traffic Jams (red lines)')
+ theme_bw()
+ theme(axis_text_x = element_text(angle = 90, hjust = .5)))