# StateLegiscraper: PDF Format Example Notebook

*Author*: Katherine Chang (kachang@uw.edu)

*Last Updated*: 12/14/2021

StateLegiscraper is a Python package that scrapes and processes data from U.S. state legislature websites. As of writing, the package is focused on transcribing standing committee hearings from each state legislature from its native archival format to text, so that this text data can be easily used for NLP research purposes and for public review. For more details about the StateLegiscraper, please visit its [Github repository](https://github.com/ka-chang/StateLegiscraper) where it is under active development. 

This notebook walks a new user through the StateLegiscraper workflow, with a focus on the Nevada State Legislature and working with PDF file formats. 

This notebook makes several assumptions about the user, which are that they have:

- At least a novice level familiarity with Python, including importing packages, running basic functions, and saving files.
- Knowledge of different Python file types, particularily lists and dictionaries. 
- Comfort working in the command line, as StateLegiscraper is installed through the user's choice of terminal. 
- Have at least 100 mb of space on their local hard drive or a mounted cloud drive to save the raw data on.

## The Nevada Context

The Nevada State Legislature is a part-time biennial state legislature, which means state legislators meet on odd number of years between the months of February to June. The state legislature website, [www.leg.state.nv.us](www.leg.state.nv.us), hosts human transcribed transcripts of its standing committee meetings in PDF format. 

## Setup

Please ensure StateLegiscraper is installed on your local drive. Please refer to the [following instructions for details](https://github.com/ka-chang/StateLegiscraper/blob/main/README.md).

The following two code chunks changes the directory to your local StateLegiscraper directory, which allows us to import the modules in to use.

In [1]:
import os
from pathlib import Path
import sys

In [2]:
github_file_path = str(Path(os.getcwd()).parents[0]) #Sets to local Github directory path

sys.path.insert(1, github_file_path) 

Code chunk 3 below prints your unique local github_file_path. It should end with the Github root directory, /StateLegiscraper/

In [3]:
print(github_file_path)

/Volumes/GoogleDrive/My Drive/State Legislatures/StateLegiscraper


## Nevada Assets

Before we start scraping data, we should decide what data we're interested in. As of writing, StateLegiscraper's coverage of Nevada supports scraping PDF transcripts from Nevada's standing committee hearings from 2011-2021. To access the weblinks to scrape the PDF links, we can call on `statelegiscraper.assets.weblinks` module and import `nv_weblinks`.

In [4]:
from statelegiscraper.assets.weblinks import nv_weblinks

I'm going to go ahead and print the nv_weblinks source so that you can review the file.

In [5]:
import inspect
links = inspect.getsource(nv_weblinks)

In [6]:
print(links)

"""Weblinks for Nevada committee meeting pages, organized by chamber and committee name 
for regular sessions from 2011-2021"""

#ASSEMBLY

assem_comlabor=[
    "https://www.leg.state.nv.us/App/NELIS/REL/81st2021/Committee/340/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/80th2019/Committee/219/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/79th2017/Committee/184/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/78th2015/Committee/47/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/77th2013/Committee/1/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/76th2011/Committee/24/Meetings"
]

assem_ed=[
    "https://www.leg.state.nv.us/App/NELIS/REL/81st2021/Committee/348/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/80th2019/Committee/228/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/79th2017/Committee/168/Meetings",
    "https://www.leg.state.nv.us/App/NELIS/REL/78th2015/Committee/50/Meetings",
    "https://www.leg.stat

So you can see `nv_weblinks` includes lists for all Assembly and House Standing Committee meetings from 2011-2021. Simply choose the standing committee you're interested in and call it into your environment by adding `nv_weblinks` before the list name.  

In [7]:
sen_hhs = nv_weblinks.sen_hhs

In [8]:
print(sen_hhs)

['https://www.leg.state.nv.us/App/NELIS/REL/81st2021/Committee/350/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/80th2019/Committee/221/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/79th2017/Committee/170/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/78th2015/Committee/63/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/77th2013/Committee/22/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/76th2011/Committee/45/Meetings']


If you'd only like data from specific years, e.g., 2021 and 2019, then simply use the list index to specify.

In [9]:
sen_hhs_2021_2019 = sen_hhs[0:2]

In [10]:
print(sen_hhs_2021_2019)

['https://www.leg.state.nv.us/App/NELIS/REL/81st2021/Committee/350/Meetings', 'https://www.leg.state.nv.us/App/NELIS/REL/80th2019/Committee/221/Meetings']


## Nevada Scrape Class

Now that you've selected your targeted data through the weblinks asset, let's begin scraping data! StateLegiscraper is structured with two main classes of functions: Scrape and Process. We'll start with the Scrape class, which we import using the following code.

In [11]:
from statelegiscraper.states.nv import Scrape

There's one function in Nevada's Scrape class, `nv_scrape_pdf()`. All of StateLegiscraper's state module functions include detailed docstrings, so use the `help(classname)` function to easily access the documentation.

In [12]:
help(Scrape)

Help on class Scrape in module statelegiscraper.states.nv:

class Scrape(builtins.object)
 |  Scrape functions for Nevada State Legislature website
 |  
 |  Methods defined here:
 |  
 |  nv_scrape_pdf(webscrape_links, dir_chrome_webdriver, dir_save)
 |      Webscrape function for Nevada State Legislature Website. 
 |      
 |      Parameters
 |      ----------
 |      webscrape_links : List
 |          List of direct link(s) to NV committee webpage.
 |          see assets/weblinks/nv_weblinks.py for lists organized by chamber and committee
 |      dir_chrome_webdriver : String
 |          Local directory that contains the appropriate Chrome Webdriver.
 |      dir_save : String
 |          Local directory to save PDFs.
 |      
 |      Returns
 |      -------
 |      All PDF files found on the webscrape_links, saved on local dir_save.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary f

So we can see here that `nv_scrape_pdf()` requires three parameters:
1. webscrape_links: A list of links to Nevada committee hearing webpages. This is what we covered in the Assets section. We'll use `sen_hhs_2021_2019`, which is currently in our environment. 
2. dir_chrome_webdriver: The directory of your Chrome Webdriver. You should have reviewed this in the installation section and downloaded it in the `StateLegiscraper/statelegiscraper/assets/chrome_webdriver` directory. [See more details here.](https://github.com/ka-chang/StateLegiscraper/blob/main/README.md)
3. dir_save: A local directory to save the scraped raw data, PDF files, in. We'll use `StateLegiscraper/examples/outputs` for this example.

Remember `github_file_path`? This is your unique local path address for wherever you cloned the StateLegiscraper repoistory. Let's organize the two parameters that require the recommended directories to access the Chrome Webdriver and where to save the files. 

Please note:
- Change your chromedriver file to the one appropriate for your Chrome version and hardware specification. I am using Google Chrome, version 96 on a Mac Mini M1, but you are probably not. Read the installation guide to download the appropriate Chrome Driver and save it in the assets folder.
- The save folder can be anywhere in your local drive, but for now we will be using `StateLegiscraper/examples/outputs`.

In [13]:
directory_chrome_webdriver = str(os.path.join(github_file_path, "statelegiscraper/assets/chromedriver/chromedriver_v96_m1"))

In [14]:
print(directory_chrome_webdriver)

/Volumes/GoogleDrive/My Drive/State Legislatures/StateLegiscraper/statelegiscraper/assets/chromedriver/chromedriver_v96_m1


In [15]:
directory_raw_data = str(os.path.join(github_file_path, "examples/outputs/"))

In [16]:
print(directory_raw_data)

/Volumes/GoogleDrive/My Drive/State Legislatures/StateLegiscraper/examples/outputs/


In [17]:
Scrape.nv_scrape_pdf(sen_hhs_2021_2019, directory_chrome_webdriver, directory_raw_data)

https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1351.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1321.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1257.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1231.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1216.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1164.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1146.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1117.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/1024.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/972.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/884.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Senate/HHS/Final/787.pdf
https://www.leg.state.nv.us/Session/81st2021/Minutes/Se

Congratulations! You have data now! Let's check the raw data outputs folder to ensure the PDF files exported appropriately.

In [18]:
os.listdir(directory_raw_data)

['81st2021_Minutes_Senate_HHS_Final_1351.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1321.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1257.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1231.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1216.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1164.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1146.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1117.pdf',
 '81st2021_Minutes_Senate_HHS_Final_1024.pdf',
 '81st2021_Minutes_Senate_HHS_Final_972.pdf',
 '81st2021_Minutes_Senate_HHS_Final_884.pdf',
 '81st2021_Minutes_Senate_HHS_Final_787.pdf',
 '81st2021_Minutes_Senate_HHS_Final_766.pdf',
 '81st2021_Minutes_Senate_HHS_Final_686.pdf',
 '81st2021_Minutes_Senate_HHS_Final_660.pdf',
 '81st2021_Minutes_Senate_HHS_Final_755.pdf',
 '81st2021_Minutes_Senate_HHS_Final_737.pdf',
 '81st2021_Minutes_Senate_HHS_Final_647.pdf',
 '81st2021_Minutes_Senate_HHS_Final_597.pdf',
 '81st2021_Minutes_Senate_HHS_Final_563.pdf',
 '81st2021_Minutes_Senate_HHS_Final_479.pdf',
 '81st2021_Minutes_Senate

## Nevada Process Class

In [None]:
from statelegiscraper.states.nv import Process

In [None]:
Process.nv_pdf_to_text()

In [None]:
Process.nv_text_clean()

## What Now?

You have data now – congratulations! This is where you, the user, have free reign to begin working with popular NLP Pythong packages, such as nltk and SpaCy. 

In [None]:
# Word Frequency Example