# Scraping LADOT Volume Data from PDFs, Part 2

##### Where I Left Off
In the first python notebook, I described the problem and general approach I was taking to solve it. I downloaded all the manual count PDF documents and setup the file tables for later.

## PDF Text Extraction Process
As I discussed in the first python notebook, my process for extracting the text into usable data can be broken down into the few key parts: (1) define bounding boxes, (2) search for text within the bounding boxes (3) reformat the resulting text into multiple data tables, and (4) join the resulting tables to the ID established by the Bureau of Engineering. 

##### Define Bounding Boxes
This was tricky. I initially began defining bounding boxes using pixel measurements from a few sample pages. However, I quickly realized that due to the second of the challenges I mentioned above that this would not work, since the tables are in different locations among the PDF documents. 

Instead, I decided to create bounding boxes on the fly for each document using relative positioning of certain keywords that appeared almost always on each PDF document. Using pdfquery, I could begin by searching the document for these keywords and then extract the x,y pixel coordinate locations for the bounding box of each one. By getting the coordinates of multiple keyword objects on the page, I could construct a set of bounding boxes that seemed to perform relatively well in capturing data tables.

(create image of what this looked like)

##### Search for Text within Bounding Boxes
Once I had the coordinates of the bounding boxes, this part was quite easy, using PDFQuery to extract text

(do i need to adjust any of the parameters in rapidminer??)

##### Reformat the Resulting Text into Data Tables
The final problem included taking the scraped text from the bounding boxes and reformatting them into usable data tables. I kept in mind the relational database model as I set the format for these tables. From the PDF image above, I decided on the following tables and attributes:

*tbl_manualcount_info:* This table contains the basic information about the manual count summary. Each count will have one tuple with the following information:
* street_ns: The North / South Street running through the intersection
* street_ew: The East / West street running through the intersection
* dayofweek: Day of the Week
* date: Date, in datetime format
* weather: Prevailing weather at the time of the count (Clear, Sunny, etc.)
* hours: the hours of the count (text)
* school_day: A Yes / No indication of whether the count occurred on a school day. This is important because it heavily affects the volume counts
* int_code: The "I/S Code" on the form corresponds to the CL_Node_ID on the BOE Centerline. This ID field makes it easy to join to the City's centerline network
* district: The DOT field district in which the count took place
* count_id: Unique identifier assigned to the summary

*tbl_manualcount_dualwheel:* This table contains count data for dual-wheeled (motorcycles), bikes, and buses. Each form will have 12 tuples with the following information:
* count_id: Unique identifier assigned to the summary in "tbl_manual_count_info"
* approach: Intersection approach being measured (N,S,E,W)
* type: Dual-Wheeled / Bikes / Buses
* volume: Count

*tbl_manualcount_peak:* This table contains the peak hour / 15 minute counts. Each form will have 16 tuples with the following information:
* count_id: Unique identifier assigned to the summary in "tbl_manual_count_info"
* approach: Intersection approach being measured (N,S,E,W)
* type: The type of count
    * am_15: The AM peak 15 minute count
    * am_hour: The AM peak hour count
    * pm_15: The PM peak 15 minute count
    * pm_hour: The PM peak hour count
* time: Time of each count (in datetime format)
* volume: Count

*tbl_manualcount_volumes:* This table contains the main volume counts for each approach at the intersection. The number of tuples for each form will vary depending on the number of hours surveyed. A 6-hour count will have 6 hours * 3 directions (left, right, through) * 4 approach directions = 54 tuples. Each tuple will have the following information:
* count_id: Unique identifier assigned to the summary in "tbl_manual_count_info"
* approach: Northbound (NB) / Southbound (SB) / Eastbound (EB) / Westbound (WB)
* movement: Right-Turn (Rt) / Through (Th) / Left-Turn (Lt)
* start_time: Start time of that count, in datetime format
* end_time: End time of that count, in datetime format
* volume: Count

*tbl_manualcount_peds:* This table contains pedestrian and schoolchildren counts during the same time as the main volume counts, so the number of tuples will also be dependent on the number of hours the location was surveyed. Each tuple will have the following information:
* count_id: Unique identifier assigned to the summary in "tbl_manual_count_info"
* xing_leg: The leg of the intersection that is being crossed. South Leg (SL) / North Leg (NL) / West Leg (WL) / East Leg (EL)
* type: Ped / Sch
* start_time: Start time of that count, in datetime format
* end_time: End time of that count, in datetime format
* volume: Count


In [4]:
### Setup
import csv
import glob
from datetime import datetime, date, time
import pdfquery
import pandas as pd
import numpy as np
import os
from pathlib2 import Path

### Load NavLA Count file table
count_files_df = pd.read_csv('data/TrafficCountFileStructure/navLAfiles.csv',index_col=0)
print("There are {} total count files".format(len(count_files_df)))
count_files_df.head()

There are 26840 total count files


Unnamed: 0_level_0,cl_node_id,location,traffic_id,type,file
count_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,3667,ISLAND AVE at L ST,1,manual,3667_ISLLST94.PDF
2,3680,FIGUEROA ST at L ST,2,manual,3680_FIGLST94.PDF
3,3727,FRIES AVE at HARRY BRIDGES BLVD,3,manual,3727_FRIHAR95.PDF
4,3787,ANAHEIM ST AT FARRAGUT AVE,4,manual,3787_ANAFAR100817.PDF
5,3787,ANAHEIM ST AT FARRAGUT AVE,4,manual,FARRAGUT.ANAHEIM.110119-MAN.PDF


### Step 1: Prepare the Data Tables
Based on the outline above, the first step is to prepare dataframes for each of the volume-related tables discussed above. Once I have the dataframe table structures, the next step involves looping through all the manual count PDFs, running my text extract function, and then inserting the rows into the appropriate tables. 

In [5]:
import pandas as pd

### Load NavLA Count file table
count_files_df = pd.read_csv('data/TrafficCountFileStructure/navLAfiles.csv',index_col=0)
manual_count_files_df = count_files_df[(count_files_df['type'] == 'manual')].reset_index()
print("Among the {} total count files, {}, {:.1%} are manual count files".format(len(count_files_df), len(manual_count_files_df), (len(manual_count_files_df)/len(count_files_df))))
manual_count_files_df.head()

Among the 26840 total count files, 9460, 35.2% are manual count files


Unnamed: 0,count_id,cl_node_id,location,traffic_id,type,file
0,1,3667,ISLAND AVE at L ST,1,manual,3667_ISLLST94.PDF
1,2,3680,FIGUEROA ST at L ST,2,manual,3680_FIGLST94.PDF
2,3,3727,FRIES AVE at HARRY BRIDGES BLVD,3,manual,3727_FRIHAR95.PDF
3,4,3787,ANAHEIM ST AT FARRAGUT AVE,4,manual,3787_ANAFAR100817.PDF
4,5,3787,ANAHEIM ST AT FARRAGUT AVE,4,manual,FARRAGUT.ANAHEIM.110119-MAN.PDF


### Step 2: Run the PDF text scraping script (extract / transform)
The powerhouse behind this process is the script I built to read a PDF and extract text into a list of dictionaries (one for each table). Because this script is so long, I opted not to include the code in the notebook, and instead treated it as a module that I imported using "import VolumeCountSheets_V2" above. I call the "pdf_extract" function from the module on each PDF.

### Step 3: Append each resulting dictionary to the Pandas dataframe (load)
The function returns a list of dictionaries, one for each dataframe. I then take the pandas dataframes I constructed in the cell above and append each dictionary to the appropriate one.

In [9]:
import cv2
import numpy as np
from operator import itemgetter, attrgetter
try:
    import Image
except ImportError:
    from PIL import Image
import pytesseract

# Set path to tesseract executable
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files (x86)\Tesseract-OCR\tesseract'
# Define config parameters.
# '-l eng'  for using the English language
# '--oem 1' for using LSTM OCR Engine
config = ('-l eng --oem 0 --psm 10000 -c tessedit_char_whitelist=0123456789')

def GetContours(img):
    # Prep image
    imgray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    ret, thresh = cv2.threshold(imgray, 127, 255, 0)
    # Run contour analysis, sort by contour area (descending)
    im2, contours, hierarchy = cv2.findContours(thresh, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
    contours = sorted(contours, key=cv2.contourArea, reverse = True)
    return(contours)

def CropImage(img, contour):
    (x, y, w, h) = cv2.boundingRect(contour)
    crop_img = img[y:y+h, x:x+w]
    return(crop_img)

def TesseractText(img):
    text = pytesseract.image_to_string(img, config=config)
    counts = list(map(int, text.split()))
    # hmm, maybe here i shoudl be concatenating everything, instead of eventually
    # only returning the first object in the list
    return(counts)

def ExtractCellVal(cells, img):
    vol = []
    # for each cell, crop & extract text
    for cell in cells:
        (x, y, w, h) = cell[1], cell[2], cell[3], cell[4]
        crop_img = img[y:y+h, x:x+w]
        val = TesseractText(crop_img)
        vol.append(val[0])
    return(vol)
    
def SortPedCells(contours):
    # Get the bounding box of each contour
    contour_list = []
    contour_len = len(contours)
    for contour in contours:
        (x, y, w, h) = cv2.boundingRect(contour)
        contour_list.append([contour, x, y, w, h])
    contour_a = np.array(contour_list)
    # Sort by x coordinate, split by number of columns (in this case, 2)
    contour_a = contour_a[contour_a[:,1].argsort()]
    pedvol = contour_a[:6]
    schvol = contour_a[6:]
    # Sort top to bottom (descending) by y coordinate
    pedvol = pedvol[pedvol[:,2].argsort()]
    schvol = schvol[schvol[:,2].argsort()]
    return(pedvol, schvol)
    
def AnalyzePedCrossingTable(img, pedtbl_contour):
    # Crop Image, get new contours
    crop_img = CropImage(img, pedtbl_contour[0])
    pedvol_contours = GetContours(crop_img)
    pedvol_cells = pedvol_contours[2:14]
    pedvol_cells, schvol_cells = SortPedCells(pedvol_cells)
    pedvol = ExtractCellVal(pedvol_cells, crop_img)
    schvol = ExtractCellVal(schvol_cells, crop_img)
    return(dict([("Ped", pedvol), ("Sch", schvol)]))
    
def GetPedData(img):
    ped_tbl_contours = GetContours(img)[5:9]
    ped_tbls = []
    for ped_tbl_contour in ped_tbl_contours:
        (x, y, w, h) = cv2.boundingRect(ped_tbl_contour)
        ped_tbls.append([ped_tbl_contour, x, y, w, h])
    ped_tbls = np.array(ped_tbls)
    ped_tbls = sorted(ped_tbls, key=itemgetter(1))
    ped_tbls = sorted(ped_tbls, key=itemgetter(2))
    
    ped_sch_extract = {}
    ped_sch_extract['SL'] = AnalyzePedCrossingTable(img, ped_tbls[:1][0])
    ped_sch_extract['NL'] = AnalyzePedCrossingTable(img, ped_tbls[1:2][0])
    ped_sch_extract['WL'] = AnalyzePedCrossingTable(img, ped_tbls[2:3][0])
    ped_sch_extract['EL'] = AnalyzePedCrossingTable(img, ped_tbls[3:4][0])
    
    # Format as final df
    ped_sch_data = []
    for leg in ped_sch_extract:
        for pedtype in ped_sch_extract[leg]:

            ped_sch_dict = {}
            ped_sch_dict['xing_leg'] = leg
            ped_sch_dict['type'] = pedtype
            ped_sch_dict['volume'] = sum(ped_sch_extract[leg][pedtype])
            ped_sch_data.append(ped_sch_dict)
    
    return(ped_sch_data)

In [15]:
 %matplotlib inline
import matplotlib.pyplot as plt
import pdfquery, sys, cv2
from wand.image import Image, Color
from wand.display import display

# Setup Counter of sucessful / failed attempts
success = 0
failures = 0
ped_volumes = []

##### Loop through files, run function
for index, row in manual_count_files_df.iterrows():
    
    # for testing purposes only
    if index > 250:
        break
    
    #print(row['file'])
    fileloc = 'data/TrafficCountData/Manual/All/' + row['file']
    
    # ID for the count
    count_id = row['count_id']
    if index%100 == 0:
        print("Current Count:" + str(index+1))
        
    if Path(fileloc).exists():
        
        try:
            # Load the PDF
            pdf = pdfquery.PDFQuery(fileloc,resort=False)
            pdf.load()
            search = pdf.pq('LTTextLineHorizontal:contains("MANUAL TRAFFIC COUNT SUMMARY")')
            
            # If there is a keyword match, convert to image and run analysis
            if len(search) > 0:
                # Convert pdf -> img
                img = Image(filename = fileloc, resolution=300)
                img.background_color = Color("white")
                img.alpha_channel = 'remove'
                # Save as png to img folder
                fin, file_extension = os.path.splitext(row['file'])
                fout = 'data/TrafficCountData/Manual/Images/' + fin + '.png'
                img.save(filename=fout)
                img = cv2.imread(fout)
                # Run Ped Volume Extract Function
                Manual_TC = {}
                Manual_TC['Pedestrian'] = GetPedData(img)
                # Append all volumes to new list
                for k in Manual_TC['Pedestrian']:
                    k['count_id'] = str(count_id)
                    ped_volumes.append(k)
        except:
            pass

# Convert dataframes from list of dictionaries and output to csv
pedestrian_df = pd.DataFrame.from_records(ped_volumes)
pedestrian_df.to_csv(path_or_buf='data/TrafficCountData/Results/pedestrian.csv',sep=',')

3667_ISLLST94.PDF
Current Count:1
3680_FIGLST94.PDF
3727_FRIHAR95.PDF
3787_ANAFAR100817.PDF
FARRAGUT.ANAHEIM.110119-MAN.PDF
data/TrafficCountData/Manual/Images/FARRAGUT.ANAHEIM.110119-MAN.png
[{'xing_leg': 'SL', 'type': 'Ped', 'volume': 0}, {'xing_leg': 'SL', 'type': 'Sch', 'volume': 0}, {'xing_leg': 'NL', 'type': 'Ped', 'volume': 9}, {'xing_leg': 'NL', 'type': 'Sch', 'volume': 0}, {'xing_leg': 'WL', 'type': 'Ped', 'volume': 0}, {'xing_leg': 'WL', 'type': 'Sch', 'volume': 0}, {'xing_leg': 'EL', 'type': 'Ped', 'volume': 0}, {'xing_leg': 'EL', 'type': 'Sch', 'volume': 0}]
3839_DONFRI93.PDF
3975_ORCRED060627.PDF
3976_ORCRED060627.PDF
3976_ORCRED99.PDF
REDONDOBEACH.ORCHARD.150616-NDSMAN.PDF
data/TrafficCountData/Manual/Images/REDONDOBEACH.ORCHARD.150616-NDSMAN.png
3995_GARMEN04.PDF
3995_GARMEN98.PDF
GARDENA.MENLO.171019-MAN.PDF
data/TrafficCountData/Manual/Images/GARDENA.MENLO.171019-MAN.png
[{'xing_leg': 'SL', 'type': 'Ped', 'volume': 142}, {'xing_leg': 'SL', 'type': 'Sch', 'volume': 148}

4689_JEF11A97.PDF
11TH.JEFFERSON.120510-MAN.PDF
data/TrafficCountData/Manual/Images/11TH.JEFFERSON.120510-MAN.png
[{'xing_leg': 'SL', 'type': 'Ped', 'volume': 0}, {'xing_leg': 'SL', 'type': 'Sch', 'volume': 0}, {'xing_leg': 'NL', 'type': 'Ped', 'volume': 0}, {'xing_leg': 'NL', 'type': 'Sch', 'volume': 0}, {'xing_leg': 'WL', 'type': 'Ped', 'volume': 14}, {'xing_leg': 'WL', 'type': 'Sch', 'volume': 1}, {'xing_leg': 'EL', 'type': 'Ped', 'volume': 22}, {'xing_leg': 'EL', 'type': 'Sch', 'volume': 5}]
4690_11A36S01.PDF
4690_11A36S04.PDF
4690_11A36S98.PDF
4696_JEF2AV.PDF
4709_GRA36P00.PDF
4709_GRA36P95.PDF
4759_KINVAN01.PDF
4761_KINVER96.PDF
4763_KANKIN01.PDF
4763_KANMAR94.PDF
4769_MARMEN100908.PDF
4837_KAN43S98.PDF
4894_GRA48S97.PDF
4894_GRA48T92.PDF
4906_11A48S05.PDF
4906_11T48T94.PDF
4922_JEFSOM01.PDF
4922_JEFSOM03.PDF
4922_JEFSOM05.PDF
4922_JEFSOM091102.PDF
data/TrafficCountData/Manual/Images/4922_JEFSOM091102.png
[{'xing_leg': 'SL', 'type': 'Ped', 'volume': 0}, {'xing_leg': 'SL', 'type':

5731_ADAGRA110315.PDF
5731_ADAGRA93.PDF
ADAGRA06.PDF
data/TrafficCountData/Manual/Images/ADAGRA06.png
[{'xing_leg': 'SL', 'type': 'Ped', 'volume': 630}, {'xing_leg': 'SL', 'type': 'Sch', 'volume': 1}, {'xing_leg': 'NL', 'type': 'Ped', 'volume': 778}, {'xing_leg': 'NL', 'type': 'Sch', 'volume': 63}, {'xing_leg': 'WL', 'type': 'Ped', 'volume': 2}, {'xing_leg': 'WL', 'type': 'Sch', 'volume': 0}, {'xing_leg': 'EL', 'type': 'Ped', 'volume': 300}, {'xing_leg': 'EL', 'type': 'Sch', 'volume': 10}]
5736_ADABRO110614.PDF
5736_ADABRW93.PDF
5753_HIL28S98.PDF
HIL28T081017.PDF
data/TrafficCountData/Manual/Images/HIL28T081017.png
[{'xing_leg': 'SL', 'type': 'Ped', 'volume': 174}, {'xing_leg': 'SL', 'type': 'Sch', 'volume': 94}, {'xing_leg': 'NL', 'type': 'Ped', 'volume': 18}, {'xing_leg': 'NL', 'type': 'Sch', 'volume': 0}, {'xing_leg': 'WL', 'type': 'Ped', 'volume': 150}, {'xing_leg': 'WL', 'type': 'Sch', 'volume': 26}, {'xing_leg': 'EL', 'type': 'Ped', 'volume': 137}, {'xing_leg': 'EL', 'type': 'Sch

6121_CAMOLY080507.PDF
CAMULOS.OLYMPIC.130116-MAN.PDF
data/TrafficCountData/Manual/Images/CAMULOS.OLYMPIC.130116-MAN.png
[{'xing_leg': 'SL', 'type': 'Ped', 'volume': 91}, {'xing_leg': 'SL', 'type': 'Sch', 'volume': 14}, {'xing_leg': 'NL', 'type': 'Ped', 'volume': 37}, {'xing_leg': 'NL', 'type': 'Sch', 'volume': 5}, {'xing_leg': 'WL', 'type': 'Ped', 'volume': 6}, {'xing_leg': 'WL', 'type': 'Sch', 'volume': 0}, {'xing_leg': 'EL', 'type': 'Ped', 'volume': 105}, {'xing_leg': 'EL', 'type': 'Sch', 'volume': 16}]
6131_EVEOLY080507.PDF
6135_8THLOR080507.PDF
LOR8ST06.PDF
data/TrafficCountData/Manual/Images/LOR8ST06.png
[{'xing_leg': 'SL', 'type': 'Ped', 'volume': 107}, {'xing_leg': 'SL', 'type': 'Sch', 'volume': 65}, {'xing_leg': 'NL', 'type': 'Ped', 'volume': 154}, {'xing_leg': 'NL', 'type': 'Sch', 'volume': 44}, {'xing_leg': 'WL', 'type': 'Ped', 'volume': 105}, {'xing_leg': 'WL', 'type': 'Sch', 'volume': 72}, {'xing_leg': 'EL', 'type': 'Ped', 'volume': 246}, {'xing_leg': 'EL', 'type': 'Sch', '

In [51]:
# Let's compare ped volume totals from both methods

# Import original count file
ped_df1 = pd.read_csv('data/TrafficCountData/Results/20180323/pedestrian.csv')
#ped_df1 = ped_df1.astype({'volume':'int'}, copy=False)
#data.b = pd.to_numeric(data.b,errors='coerce')
ped_df1.volume = pd.to_numeric(ped_df1.volume, errors='coerce')
#ped_df1 = ped_df1[ped_df1['volume'] not in ['0%','5%']]
#ped_df1['volume'] = ped_df1['volume'].astype('float')
#count_files_df[(count_files_df['type'] == 'manual')].reset_index()
#print(ped_df1[ped_df1['volume'] == '0%'])
ped_df2 = pd.read_csv('data/TrafficCountData/Results/pedestrian.csv')
# For each df, sum by count_id
ped_df1_total = ped_df1.groupby(['count_id']).sum()
ped_df2_total = ped_df2.groupby(['count_id']).sum()
# Join by count_id
pd_df_total = pd.merge(ped_df2_total, ped_df1_total, how='left', on='count_id')
print(len(pd_df_total))


  interactivity=interactivity, compiler=compiler, result=result)


52
