# Traffic Cleaning

This document takes the original three traffic datasets from Montreal, combines and cleans them.  The files can be found at the [Montreal Open Data Portal](https://donnees.montreal.ca/ville-de-montreal/comptage-vehicules-pietons#data).  In this notebook, we

- drop rows which aren't useful
- rename the columns to english
- pull out useful date columns
- extract dates which can be lined up with collisions dataset
- combine counts for different vehicles into single total

The final saved dataset is `traffic_cleaned_by_hour.csv`.

We start by importing the standard libraries, and reading in the data.

In [1]:
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

In [2]:
# allows us to see all columns in expanded format
pd.set_option('max_columns', None)

# read in each file
traffic1 = pd.read_csv('Data/comptages_vehicules_cyclistes_pietons_2011_2013.csv')
traffic2 = pd.read_csv('Data/comptages_vehicules_cyclistes_pietons_2014_2016.csv')
traffic3 = pd.read_csv('Data/comptages_vehicules_cyclistes_pietons_2017_2019.csv')

# concatenate the individual files
traffic = pd.concat([traffic1, traffic2, traffic3])
traffic.shape

(1519470, 30)

We have 1.5 million rows, which is a lot, but this will be reduced by about 5-fold because of the different vehicle counts.

In [3]:
traffic.head()

Unnamed: 0,Id_Reference,Id_Intersection,Nom_Intersection,Date,Periode,Heure,Minute,Seconde,Code_Banque,Description_Code_Banque,NBLT,NBT,NBRT,SBLT,SBT,SBRT,EBLT,EBT,EBRT,WBLT,WBT,WBRT,Approche_Nord,Approche_Sud,Approche_Est,Approche_Ouest,Localisation_X,Localisation_Y,Longitude,Latitude
0,1974,4864,rue Jeanne-Mance / rue Villeneuve,2011-05-05,00:00:00,0,0,0,1,Camions legers,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,297634.913,5042110.0,-73.591714,45.518868
1,1975,561,Berri / Saint-Joseph,2011-02-24,00:00:00,0,0,0,1,Camions legers,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,298015.6545,5043057.0,-73.586854,45.527397
2,1976,2233,avenue de l' Esplanade / rue Villeneuve,2011-03-15,00:00:00,0,0,0,1,Camions legers,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,297681.812,5042182.0,-73.591115,45.519521
3,1974,4864,rue Jeanne-Mance / rue Villeneuve,2011-05-05,00:15:00,0,15,0,1,Camions legers,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,297634.913,5042110.0,-73.591714,45.518868
4,1975,561,Berri / Saint-Joseph,2011-02-24,00:15:00,0,15,0,1,Camions legers,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,298015.6545,5043057.0,-73.586854,45.527397


In [4]:
# reset the option so that we see the condensed form when we print tables
pd.reset_option('max_columns')

We have a lot of columns here, many of which are not particularly useful.  We therefore need to go through some cleaning steps:

In [5]:
# drop some non-useful columns
traffic.drop(columns=['Id_Reference', 
                      'Periode', 
                      'Seconde',
                     'Code_Banque', 
                     'Localisation_X', 
                     'Localisation_Y',
                     'Nom_Intersection',
                     'Approche_Nord',
                     'Approche_Sud',
                     'Approche_Est',
                     'Approche_Ouest'], inplace=True)

# get useful columns from the date entries
traffic['Date'] = traffic['Date'].astype('datetime64')

traffic['Year'] = traffic['Date'].dt.year
traffic['Month'] = traffic['Date'].dt.month
traffic['Day'] = traffic['Date'].dt.day
traffic['weekday'] = traffic['Date'].dt.dayofweek

# omit entries from before 2011 (we don't have collision data here)
traffic = traffic[traffic['Year'] > 2011]

# rename columns to english
traffic.rename(columns = {'Heure' : 'hour',
                         'Minute' : 'minute',
                         'Description_Code_Banque' : 'traffic_type'}, inplace=True)

# combine rows which correspond to same intersection and time stamp, to get total vehicles through intersection
traffic = traffic.groupby(['Id_Intersection',
                'Date',
                'hour',
                'minute',
                'Longitude',
                'Latitude',
                'Year',
                'Month',
                'Day',
                'weekday']).sum().reset_index()

traffic.head()

Unnamed: 0,Id_Intersection,Date,hour,minute,Longitude,Latitude,Year,Month,Day,weekday,...,NBRT,SBLT,SBT,SBRT,EBLT,EBT,EBRT,WBLT,WBT,WBRT
0,1,2018-02-14,0,0,-73.575661,45.48265,2018,2,14,2,...,0,0,0,0,0,0,0,0,0,0
1,1,2018-02-14,0,15,-73.575661,45.48265,2018,2,14,2,...,0,0,0,0,0,0,0,0,0,0
2,1,2018-02-14,0,30,-73.575661,45.48265,2018,2,14,2,...,0,0,0,0,0,0,0,0,0,0
3,1,2018-02-14,0,45,-73.575661,45.48265,2018,2,14,2,...,0,0,0,0,0,0,0,0,0,0
4,1,2018-02-14,1,0,-73.575661,45.48265,2018,2,14,2,...,0,0,0,0,0,0,0,0,0,0


In [6]:
traffic.shape

(286590, 22)

This is starting to look better, and we have reduced the number of rows to around 280,000.  Finally, let's group count entries into single directions and aggregate the counts by the hour.

In [7]:
traffic['NB'] = traffic['NBLT'] + traffic['NBT'] + traffic['NBRT']
traffic['SB'] = traffic['SBLT'] + traffic['SBT'] + traffic['SBRT']
traffic['EB'] = traffic['EBLT'] + traffic['EBT'] + traffic['EBRT']
traffic['WB'] = traffic['WBLT'] + traffic['WBT'] + traffic['WBRT']

traffic.drop(columns=['NBLT', 'NBT', 'NBRT'], inplace=True)
traffic.drop(columns=['SBLT', 'SBT', 'SBRT'], inplace=True)
traffic.drop(columns=['EBLT', 'EBT', 'EBRT'], inplace=True)
traffic.drop(columns=['WBLT', 'WBT', 'WBRT'], inplace=True)

traffic = traffic.groupby(['Id_Intersection', 'Date', 'hour', 'Longitude', 'Latitude', 
                'Year', 'Month', 'Day', 'weekday']).sum()[['NB', 'SB', 'EB', 'WB']].reset_index()

traffic.head()

Unnamed: 0,Id_Intersection,Date,hour,Longitude,Latitude,Year,Month,Day,weekday,NB,SB,EB,WB
0,1,2018-02-14,0,-73.575661,45.48265,2018,2,14,2,0,0,0,0
1,1,2018-02-14,1,-73.575661,45.48265,2018,2,14,2,0,0,0,0
2,1,2018-02-14,2,-73.575661,45.48265,2018,2,14,2,0,0,0,0
3,1,2018-02-14,3,-73.575661,45.48265,2018,2,14,2,0,0,0,0
4,1,2018-02-14,4,-73.575661,45.48265,2018,2,14,2,0,0,0,0


We can now check the final format of our traffic table, and save.

In [8]:
traffic.shape

(72319, 13)

In [9]:
traffic.head()

Unnamed: 0,Id_Intersection,Date,hour,Longitude,Latitude,Year,Month,Day,weekday,NB,SB,EB,WB
0,1,2018-02-14,0,-73.575661,45.48265,2018,2,14,2,0,0,0,0
1,1,2018-02-14,1,-73.575661,45.48265,2018,2,14,2,0,0,0,0
2,1,2018-02-14,2,-73.575661,45.48265,2018,2,14,2,0,0,0,0
3,1,2018-02-14,3,-73.575661,45.48265,2018,2,14,2,0,0,0,0
4,1,2018-02-14,4,-73.575661,45.48265,2018,2,14,2,0,0,0,0


In [11]:
traffic.to_csv('Data/traffic_cleaned_by_hour.csv', index=False)