# Scraping LADOT Volume Data from PDFs

## Background
At LADOT, we have a lot of historical (and relatively current) data on vehicle volume counts made at intersections throughout the city at various times. The problem is that the format the data are currently in - PDF - doesn't readily allow for the type of big data analyses that we would like to perform. So for this task I went about trying to develop a method for scraping these historical PDF counts and formatting them into usable data tables using any python package I could. I ended up settling on the pdfquery package, which is really just a lightweight wrapper around the much more well-known package PDFMiner.

The roughly 1,000 PDFs typically (though not always) look like this (converted to images for display here):

![Example Volume PDF](images/example.jpg?raw=TRUE)

##### Format Advantages
* One big advantage is that the (manual count) volume data sheets are generally in the same format.
* With very few exceptions, the PDFs were generated from another program (rather than being scanned images).

##### Format Challenges
There are a few minor challenges:
* There are multiple tables on each page, and each is formatted differently.
* The tables / text do not always appear in the exact same location on each page, which meant I needed a range of parameters to test for the bounding boxes.

## General Approach
My approach can be broken down into the few key parts: (1) define bounding boxes, (2) search for text within the bounding boxes, and (3) reformat the resulting text into multiple data tables. 

##### Define Bounding Boxes
This was tricky. I initially began defining bounding boxes using pixel measurements from a few sample pages. However, I quickly realized that due to the second of the challenges I mentioned above that this would not work, since the tables are in different locations among the PDF documents. 

Instead, I decided to create bounding boxes on the fly for each document using relative positioning of certain keywords that appeared almost always on each PDF document. Using pdfquery, I could begin by searching the document for these keywords and then extract the x,y pixel coordinate locations for the bounding box of each one. By getting the coordinates of multiple keyword objects on the page, I could construct a set of bounding boxes that seemed to perform relatively well in capturing data tables.

(create image of what this looked like)

##### Search for Text within Bounding Boxes
Once I had the coordinates of the bounding boxes, this part was quite easy, using PDFQuery to extract text

(do i need to adjust any of the parameters in rapidminer??)

##### Reformat the Resulting Text into Data Tables
The final problem included taking the scraped text from the bounding boxes and reformatting them into usable data tables. I kept in mind the relational database model as I set the format for these tables. From the PDF image above, I decided on the following tables and attributes:

*tbl_manualcount_info:* This table contains the basic information about the manual count summary. Each count will have one tuple with the following information:
* street_ns: The North / South Street running through the intersection
* street_ew: The East / West street running through the intersection
* dayofweek: Day of the Week
* date: Date, in datetime format
* weather: Prevailing weather at the time of the count (Clear, Sunny, etc.)
* hours: the hours of the count (text)
* school_day: A Yes / No indication of whether the count occurred on a school day. This is important because it heavily affects the volume counts
* int_code: The "I/S Code" on the form corresponds to the CL_Node_ID on the BOE Centerline. This ID field makes it easy to join to the City's centerline network
* district: The DOT field district in which the count took place
* count_no: Unique identifier assigned to the summary

*tbl_manualcount_dualwheel:* This table contains count data for dual-wheeled (motorcycles), bikes, and buses. Each form will have 12 tuples with the following information:
* count_no: Unique identifier assigned to the summary in "tbl_manual_count_info"
* approach: Intersection approach being measured (N,S,E,W)
* type: Dual-Wheeled / Bikes / Buses
* volume: Count

*tbl_manualcount_peak:* This table contains the peak hour / 15 minute counts. Each form will have 16 tuples with the following information:
* count_no: Unique identifier assigned to the summary in "tbl_manual_count_info"
* approach: Intersection approach being measured (N,S,E,W)
* type: The type of count
    * am_15: The AM peak 15 minute count
    * am_hour: The AM peak hour count
    * pm_15: The PM peak 15 minute count
    * pm_hour: The PM peak hour count
* time: Time of each count (in datetime format)
* volume: Count

*tbl_manualcount_volumes:* This table contains the main volume counts for each approach at the intersection. The number of tuples for each form will vary depending on the number of hours surveyed. A 6-hour count will have 6 hours * 3 directions (left, right, through) * 4 approach directions = 54 tuples. Each tuple will have the following information:
* count_no: Unique identifier assigned to the summary in "tbl_manual_count_info"
* volume: Count
* approach: Northbound (NB) / Southbound (SB) / Eastbound (EB) / Westbound (WB)
* movement: Right-Turn (Rt) / Through (Th) / Left-Turn (Lt)
* start_time: Start time of that count, in datetime format
* end_time: End time of that count, in datetime format

*tbl_manualcount_peds:* This table contains pedestrian and schoolchildren counts during the same time as the main volume counts, so the number of tuples will also be dependent on the number of hours the location was surveyed. Each tuple will have the following information:
* count_no: Unique identifier assigned to the summary in "tbl_manual_count_info"
* xing_leg: The leg of the intersection that is being crossed. South Leg (SL) / North Leg (NL) / West Leg (WL) / East Leg (EL)
* type: Ped / Sch
* volume: Count
* start_time: Start time of that count, in datetime format
* end_time: End time of that count, in datetime format


In [5]:
# Import Python Module
import VolumeCountSheets_V2
import csv
import glob
from datetime import datetime, date, time
import pdfquery

files = glob.glob('TrafficCountData\Manual Counts-20170526T005814Z-001\Manual Counts\*.pdf')

print len(files)
success = 0
failures = 0

for file in files:
    while success < 10:
        print success
        try:
            Manual_TC = VolumeCountSheets_V2.pdf_extract(file)
            print Manual_TC['Volume']
            success = success + 1
        except:
            print "failure"
            failures = failures + 1
print "Success Count"
print success
print "Failure Count"
print failures


1018
0
success!
[{'volume': '81', 'start_time': datetime.datetime(2015, 6, 11, 7, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 8, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 8, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 9, 0), 'movement': 'Rt'}, {'volume': '3', 'start_time': datetime.datetime(2015, 6, 11, 9, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 10, 0), 'movement': 'Rt'}, {'volume': '2', 'start_time': datetime.datetime(2015, 6, 11, 15, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 16, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 16, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 17, 0), 'movement': 'Rt'}, {'volume': '4', 'start_time': datetime.datetime(2015, 6, 11, 17, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 18, 0), 'movement': 'Rt'}, {'volume': '75', 'start_time': datetime.datet

success!
[{'volume': '81', 'start_time': datetime.datetime(2015, 6, 11, 7, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 8, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 8, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 9, 0), 'movement': 'Rt'}, {'volume': '3', 'start_time': datetime.datetime(2015, 6, 11, 9, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 10, 0), 'movement': 'Rt'}, {'volume': '2', 'start_time': datetime.datetime(2015, 6, 11, 15, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 16, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 16, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 17, 0), 'movement': 'Rt'}, {'volume': '4', 'start_time': datetime.datetime(2015, 6, 11, 17, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 18, 0), 'movement': 'Rt'}, {'volume': '75', 'start_time': datetime.datetime(201

success!
[{'volume': '81', 'start_time': datetime.datetime(2015, 6, 11, 7, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 8, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 8, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 9, 0), 'movement': 'Rt'}, {'volume': '3', 'start_time': datetime.datetime(2015, 6, 11, 9, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 10, 0), 'movement': 'Rt'}, {'volume': '2', 'start_time': datetime.datetime(2015, 6, 11, 15, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 16, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 16, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 17, 0), 'movement': 'Rt'}, {'volume': '4', 'start_time': datetime.datetime(2015, 6, 11, 17, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 18, 0), 'movement': 'Rt'}, {'volume': '75', 'start_time': datetime.datetime(201

success!
[{'volume': '81', 'start_time': datetime.datetime(2015, 6, 11, 7, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 8, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 8, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 9, 0), 'movement': 'Rt'}, {'volume': '3', 'start_time': datetime.datetime(2015, 6, 11, 9, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 10, 0), 'movement': 'Rt'}, {'volume': '2', 'start_time': datetime.datetime(2015, 6, 11, 15, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 16, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 16, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 17, 0), 'movement': 'Rt'}, {'volume': '4', 'start_time': datetime.datetime(2015, 6, 11, 17, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 18, 0), 'movement': 'Rt'}, {'volume': '75', 'start_time': datetime.datetime(201

success!
[{'volume': '81', 'start_time': datetime.datetime(2015, 6, 11, 7, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 8, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 8, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 9, 0), 'movement': 'Rt'}, {'volume': '3', 'start_time': datetime.datetime(2015, 6, 11, 9, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 10, 0), 'movement': 'Rt'}, {'volume': '2', 'start_time': datetime.datetime(2015, 6, 11, 15, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 16, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 16, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 17, 0), 'movement': 'Rt'}, {'volume': '4', 'start_time': datetime.datetime(2015, 6, 11, 17, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 18, 0), 'movement': 'Rt'}, {'volume': '75', 'start_time': datetime.datetime(201

success!
[{'volume': '81', 'start_time': datetime.datetime(2015, 6, 11, 7, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 8, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 8, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 9, 0), 'movement': 'Rt'}, {'volume': '3', 'start_time': datetime.datetime(2015, 6, 11, 9, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 10, 0), 'movement': 'Rt'}, {'volume': '2', 'start_time': datetime.datetime(2015, 6, 11, 15, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 16, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 16, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 17, 0), 'movement': 'Rt'}, {'volume': '4', 'start_time': datetime.datetime(2015, 6, 11, 17, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 18, 0), 'movement': 'Rt'}, {'volume': '75', 'start_time': datetime.datetime(201

success!
[{'volume': '81', 'start_time': datetime.datetime(2015, 6, 11, 7, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 8, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 8, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 9, 0), 'movement': 'Rt'}, {'volume': '3', 'start_time': datetime.datetime(2015, 6, 11, 9, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 10, 0), 'movement': 'Rt'}, {'volume': '2', 'start_time': datetime.datetime(2015, 6, 11, 15, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 16, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 16, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 17, 0), 'movement': 'Rt'}, {'volume': '4', 'start_time': datetime.datetime(2015, 6, 11, 17, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 18, 0), 'movement': 'Rt'}, {'volume': '75', 'start_time': datetime.datetime(201

success!
[{'volume': '81', 'start_time': datetime.datetime(2015, 6, 11, 7, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 8, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 8, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 9, 0), 'movement': 'Rt'}, {'volume': '3', 'start_time': datetime.datetime(2015, 6, 11, 9, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 10, 0), 'movement': 'Rt'}, {'volume': '2', 'start_time': datetime.datetime(2015, 6, 11, 15, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 16, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 16, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 17, 0), 'movement': 'Rt'}, {'volume': '4', 'start_time': datetime.datetime(2015, 6, 11, 17, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 18, 0), 'movement': 'Rt'}, {'volume': '75', 'start_time': datetime.datetime(201

success!
[{'volume': '81', 'start_time': datetime.datetime(2015, 6, 11, 7, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 8, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 8, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 9, 0), 'movement': 'Rt'}, {'volume': '3', 'start_time': datetime.datetime(2015, 6, 11, 9, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 10, 0), 'movement': 'Rt'}, {'volume': '2', 'start_time': datetime.datetime(2015, 6, 11, 15, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 16, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 16, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 17, 0), 'movement': 'Rt'}, {'volume': '4', 'start_time': datetime.datetime(2015, 6, 11, 17, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 18, 0), 'movement': 'Rt'}, {'volume': '75', 'start_time': datetime.datetime(201

success!
[{'volume': '81', 'start_time': datetime.datetime(2015, 6, 11, 7, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 8, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 8, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 9, 0), 'movement': 'Rt'}, {'volume': '3', 'start_time': datetime.datetime(2015, 6, 11, 9, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 10, 0), 'movement': 'Rt'}, {'volume': '2', 'start_time': datetime.datetime(2015, 6, 11, 15, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 16, 0), 'movement': 'Rt'}, {'volume': '6', 'start_time': datetime.datetime(2015, 6, 11, 16, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 17, 0), 'movement': 'Rt'}, {'volume': '4', 'start_time': datetime.datetime(2015, 6, 11, 17, 0), 'approach': 'SB', 'end_time': datetime.datetime(2015, 6, 11, 18, 0), 'movement': 'Rt'}, {'volume': '75', 'start_time': datetime.datetime(201