# Test `scrape_packages.py`

This notebook facilitates the testing of the `scrape_packages.py` script within an interactive notebook.

We'll start off by importing functions from the script and building a list of PDF files for the test.

In [1]:
from pathlib import Path

from scripts.scrape_packages import (
    read_single_page_from_pdf,
    extract_data_from_columns,
    parse_single_page
)

In [2]:
data_folder = Path("../data")
paving_package = data_folder / "paving_package"

filepaths = [x for x in paving_package.rglob("*.pdf")]

## Read the raw data

The function `read_single_page_from_pdf()` reads a page from the PDF and makes a few minimal changes, including dropping the first two header rows and the total row (if present).

In [6]:
test_file = filepaths[1]

df = read_single_page_from_pdf(test_file, page_number=1)

df

Unnamed: 0,project_num,SR_name,from,to,scope,miles,muni,adt
2,1,SR 1013\rSpringfield Rd,0010/0000\r0011/0000\rSproul Rd / SR 0320,0020/1115\r0021/1170\rWest Chester Pk / SR 0003,"Mill &\r1 1/2"" Overlay",0.80\r0.82,Marple,15000
3,2,SR 2002\r2nd Street,0010/0000\rIndustrial Hwy / SR 0291,0040/2604\rFourth Ave / SR 2029,"Mill &\r1 1/2"" Overlay",1.53,Tinicum,5800
4,3,SR 2009\rSpringfield Rd,0080/0000\rNetherwood Dr.,0130/1670\rSproul Rd / SR 0320,"Mill &\r1 1/2"" Overlay",2.48,"Marple, Springfield",18843
5,4,SR 2013\rClifton Ave,0010/0000\rHook Rd / SR 2015,0060/2830\rSpringfield Rd / SR 2009,"Mill &\r1 1/2"" Overlay",2.26,"Sharon Hill B.,\rCollingdale, Aldan",9000
6,5,SR 2016\rBaltimore Pk,0060/0000\r0061/0000\rProvidence Rd / SR 0252,0140/3471\r0141/3461\rOak Ave / SR 2015,"Mill &\r1 1/2"" Overlay",4.36\r4.35,"Swarthmore, Morton,\rMedia",34000
7,6,SR 2028\rMorton Ave,0010/0000\rFourth St / SR 0291,0010/1636\rNinth St / SR 0013,"Mill &\r1 1/2"" Overlay",0.31,Chester City,7600
8,7,SR 2031\rSellers Ave,0034/0000\rFairmount Rd,0040/2186\rWard Ave / Ridley Ave /\rSR 2004,"Mill &\r1 1/2"" Overlay",0.64,"Ridley Park, Ridley",9300
9,8,SR 2037\rCrum Creek Rd,0010/0000\rBeatty Rd / SR 2018,0020/2785\rState Rd / SR 1008,"Mill &\r1 1/2"" Overlay",1.09,"U. Providence,\rNether Providence",328


## Parse data from source columns into additional new columns

The function `extract_data_from_columns()` gets data from columns like `muni`, `from`, `to`, `fsegment`, and `tsegment`

In [7]:
df = extract_data_from_columns(df)

df

Unnamed: 0,project_num,SR_name,from,to,scope,miles,muni,adt,sr,name,muni1,muni2,muni3,fsegment,foffset,tsegment,toffset
2,1,SR 1013\rSpringfield Rd,0011/0000\rSproul Rd / SR 0320,0021/1170\rWest Chester Pk / SR 0003,"Mill &\r1 1/2"" Overlay",0.80\r0.82,Marple,15000,1013,Springfield Rd,Marple,,,10,0,20,1115
3,2,SR 2002\r2nd Street,Industrial Hwy / SR 0291,Fourth Ave / SR 2029,"Mill &\r1 1/2"" Overlay",1.53,Tinicum,5800,2002,2nd Street,Tinicum,,,10,0,40,2604
4,3,SR 2009\rSpringfield Rd,Netherwood Dr.,Sproul Rd / SR 0320,"Mill &\r1 1/2"" Overlay",2.48,"Marple, Springfield",18843,2009,Springfield Rd,Marple,Springfield,,80,0,130,1670
5,4,SR 2013\rClifton Ave,Hook Rd / SR 2015,Springfield Rd / SR 2009,"Mill &\r1 1/2"" Overlay",2.26,"Sharon Hill B.,\rCollingdale, Aldan",9000,2013,Clifton Ave,Sharon Hill B.,\rCollingdale,Aldan,10,0,60,2830
6,5,SR 2016\rBaltimore Pk,0061/0000\rProvidence Rd / SR 0252,0141/3461\rOak Ave / SR 2015,"Mill &\r1 1/2"" Overlay",4.36\r4.35,"Swarthmore, Morton,\rMedia",34000,2016,Baltimore Pk,Swarthmore,Morton,\rMedia,60,0,140,3471
7,6,SR 2028\rMorton Ave,Fourth St / SR 0291,Ninth St / SR 0013,"Mill &\r1 1/2"" Overlay",0.31,Chester City,7600,2028,Morton Ave,Chester City,,,10,0,10,1636
8,7,SR 2031\rSellers Ave,Fairmount Rd,Ward Ave / Ridley Ave /\rSR 2004,"Mill &\r1 1/2"" Overlay",0.64,"Ridley Park, Ridley",9300,2031,Sellers Ave,Ridley Park,Ridley,,34,0,40,2186
9,8,SR 2037\rCrum Creek Rd,Beatty Rd / SR 2018,State Rd / SR 1008,"Mill &\r1 1/2"" Overlay",1.09,"U. Providence,\rNether Providence",328,2037,Crum Creek Rd,U. Providence,\rNether Providence,,10,0,20,2785


## Explode grouped rows into two

The function `parse_single_page()` runs through each row and turns any grouped rows into two rows. Inside this function the previous two functions are called: `read_single_page_from_pdf()` and `extract_data_from_columns()`

In [8]:
df = parse_single_page(test_file, page_number=1)

df

Unnamed: 0,project_num,SR_name,from,to,scope,miles,muni,adt,sr,name,muni1,muni2,muni3,fsegment,foffset,tsegment,toffset
0,1,SR 1013\rSpringfield Rd,Sproul Rd / SR 0320,West Chester Pk / SR 0003,"Mill &\r1 1/2"" Overlay",0.80,Marple,15000,1013,Springfield Rd,Marple,,,10,0,20,1115
1,2,SR 2002\r2nd Street,Industrial Hwy / SR 0291,Fourth Ave / SR 2029,"Mill &\r1 1/2"" Overlay",1.53,Tinicum,5800,2002,2nd Street,Tinicum,,,10,0,40,2604
2,3,SR 2009\rSpringfield Rd,Netherwood Dr.,Sproul Rd / SR 0320,"Mill &\r1 1/2"" Overlay",2.48,"Marple, Springfield",18843,2009,Springfield Rd,Marple,Springfield,,80,0,130,1670
3,4,SR 2013\rClifton Ave,Hook Rd / SR 2015,Springfield Rd / SR 2009,"Mill &\r1 1/2"" Overlay",2.26,"Sharon Hill B.,\rCollingdale, Aldan",9000,2013,Clifton Ave,Sharon Hill B.,\rCollingdale,Aldan,10,0,60,2830
4,5,SR 2016\rBaltimore Pk,0061/0000\rProvidence Rd / SR 0252,0141/3461\rOak Ave / SR 2015,"Mill &\r1 1/2"" Overlay",4.36\r4.35,"Swarthmore, Morton,\rMedia",34000,2016,Baltimore Pk,Swarthmore,Morton,\rMedia,60,0,140,3471
5,6,SR 2028\rMorton Ave,Fourth St / SR 0291,Ninth St / SR 0013,"Mill &\r1 1/2"" Overlay",0.31,Chester City,7600,2028,Morton Ave,Chester City,,,10,0,10,1636
6,7,SR 2031\rSellers Ave,Fairmount Rd,Ward Ave / Ridley Ave /\rSR 2004,"Mill &\r1 1/2"" Overlay",0.64,"Ridley Park, Ridley",9300,2031,Sellers Ave,Ridley Park,Ridley,,34,0,40,2186
7,8,SR 2037\rCrum Creek Rd,Beatty Rd / SR 2018,State Rd / SR 1008,"Mill &\r1 1/2"" Overlay",1.09,"U. Providence,\rNether Providence",328,2037,Crum Creek Rd,U. Providence,\rNether Providence,,10,0,20,2785
8,1,SR 1013\rSpringfield Rd,Sproul Rd / SR 0320,West Chester Pk / SR 0003,"Mill &\r1 1/2"" Overlay",0.82,Marple,15000,1013,Springfield Rd,Marple,,,11,0,21,1170
9,5,SR 2016\rBaltimore Pk,Providence Rd / SR 0252,Oak Ave / SR 2015,"Mill &\r1 1/2"" Overlay",4.35,"Swarthmore, Morton,\rMedia",34000,2016,Baltimore Pk,Swarthmore,Morton,\rMedia,61,0,141,3461
