# Scraping LADOT Volume Data from PDFs, Part 2

##### Where I Left Off
In the first python notebook, I described the process by which I was able to extract volume data from PDFs. At this point, the resulting data has been converted to .csv files, formatted for data analysis. 


In [3]:
### Setup
import csv
import glob
from datetime import datetime, date, time
import pdfquery
import pandas as pd
import numpy as np
import folium
import os

I'm going to start by loading and cleaning the tables provided by BOE. First there is the files table.

In [9]:
# Load traffic data files table
traffic_data_files_path = 'boe_tables/dbo_dot_traffic_data_files.csv'
dbo_dot_traffic_data_files = pd.read_csv(traffic_data_files_path, parse_dates=['UploadDT'], encoding="ISO-8859-1")

# Drop rows where TrafficID is NaN, convert TrafficID to int type
dbo_dot_traffic_data_files = dbo_dot_traffic_data_files.dropna(axis=0, how='any',subset=['TrafficID'])
dbo_dot_traffic_data_files['TrafficID'] = dbo_dot_traffic_data_files['TrafficID'].astype(int)

# Subset out Survey Data and Automatic Counts
dbo_dot_traffic_data_files = dbo_dot_traffic_data_files[(dbo_dot_traffic_data_files['TrafficType'] == 'manual_count')]

# See traffic data files head
print("There are " + str(len(dbo_dot_traffic_data_files)) + " records in the table.")
dbo_dot_traffic_data_files.head()

There are 9034 records in the table.


Unnamed: 0,ID,TrafficID,TrafficType,DocName,UniqueDocName,UploadDT
0,1,1435,manual_count,2_GRAVDM93.pdf,2_GRAVDM93.pdf,2007-04-02 08:38:30
1,2,1436,manual_count,4_CULVIS95.pdf,4_CULVIS95.pdf,2008-02-20 09:15:12
2,3,1436,manual_count,4_MONCUL100928.pdf,4_MONCUL100928.pdf,2011-08-09 13:58:55
3,4,1437,manual_count,16_VISTA DEL MAR.WATERVIEW07.pdf,16_VISTA DEL MAR.WATERVIEW07.pdf,2007-11-28 13:01:46
4,5,1437,manual_count,16_visvis01.pdf,16_visvis01.pdf,2007-12-03 16:30:42


In [13]:
# Load traffic data table
traffic_data_path = 'boe_tables/dot_traffic_data.csv'
dot_traffic_data = pd.read_csv(traffic_data_path)

# Drop "ext" and "Shape" columns
dot_traffic_data = dot_traffic_data.drop(['ext','Shape'], axis=1)
#dot_traffic_data['IntersectionID'] = dot_traffic_data['IntersectionID'].astype(int)

# See traffic data head
dot_traffic_data.head()

Unnamed: 0,TrafficID,IntersectionID,lat,lon,intersection
0,1,3667.0,33.78,-118.26,ISLAND AVE at L ST
1,2,3680.0,33.78,-118.28,FIGUEROA ST at L ST
2,3,3727.0,33.77,-118.26,FRIES AVE at HARRY BRIDGES BLVD
3,4,3787.0,33.78,-118.22,ANAHEIM ST AT FARRAGUT AVE
4,5,3839.0,33.79,-118.27,DON ST at FRIGATE AVE


In [4]:
# Import output data tables
manualcount_df = pd.read_csv('TrafficCountData/Results/manualcount.csv')
pedestrian_df = pd.read_csv('TrafficCountData/Results/pedestrian.csv')
peakvol_df = pd.read_csv('TrafficCountData/Results/peakvol.csv')
specveh_df = pd.read_csv('TrafficCountData/Results/SpecialVehicle.csv')
info_df = pd.read_csv('TrafficCountData/Results/info.csv')

In [7]:
info_df.head()

Unnamed: 0.1,Unnamed: 0,count_id,date,dayofweek,district,hours,school_day,street_ew,street_ns,weather
0,0,68,2011-04-25,MONDAY,YES,7-10AM 2-5PM,YES,65th PL,VAN NESS AV,SUNNY
1,1,70,2011-09-08,THURSDAY,YES,7-10AM 2-5PM,YES,FLORENCE AV.,KANSAS AV.,SUNNY
2,2,89,2007-07-23,MONDAY,YES,7-10AM 3-6PM,YES,FLORENCE AV,WEST ST,SUNNY
3,3,112,2010-03-08,MONDAY,YES,7-10AM 2-5PM,YES,78TH ST,SAN PEDRO ST,SUNNY
4,4,142,2010-04-23,FRIDAY,YES,7-10AM 3-6PM,YES,92ND ST,GRAHAM AV,SUNNY
