# StateLegiscraper: PDF Format Example Notebook

*Author*: Katherine Chang (kachang@uw.edu)

*Last Updated*: 12/12/2021

StateLegiscraper is a Python package that scrapes and processes data from U.S. state legislature websites. As of writing, the package is focused on transcribing standing committee hearings from each state legislature from its native archival format to text, so that this text data can be easily used for NLP research purposes and for public review. For more details about the StateLegiscraper, please visit its [Github repository](https://github.com/ka-chang/StateLegiscraper) where it is under active development. 

This notebook walks a new user through the StateLegiscraper workflow, with a focus on the Nevada State Legislature and working with PDF file formats. 

This notebook makes several assumptions about the user, which are that they have:

- At least a novice level familiarity with Python, including importing packages, running basic functions, and saving files.
- Knowledge of different Python file types, particularily lists and dictionaries. 
- Comfort working in the command line, as StateLegiscraper is installed through the user's choice of terminal. 
- Have at least 100 mb of space on their local hard drive or a mounted cloud drive to save the raw data on.

## The Nevada Context

The Nevada State Legislature is a part-time biennial state legislature, which means state legislators meet on odd number of years between the months of February to June. The state legislature website, [www.leg.state.nv.us](www.leg.state.nv.us), hosts human transcribed transcripts of its standing committee meetings in PDF format. 

## Setup

Please ensure StateLegiscraper is installed on your local drive. Please refer to the [following instructions for details](https://github.com/ka-chang/StateLegiscraper/blob/main/README.md).

The following two code chunks change the directory to your local StateLegiscraper directory, which allows us to import the modules in to use.

In [1]:
import os
from pathlib import Path
import sys

In [16]:
github_file_path = str(Path(os.getcwd()).parents[0]) #Sets to local Github directory path

sys.path.insert(1, github_file_path) 

In [19]:
#print(github_file_path)

/Volumes/GoogleDrive/My Drive/State Legislatures/StateLegiscraper


The github_file_path should be in the Github root directory, /StateLegiscraper

## Nevada Scrape Class

In [18]:
from statelegiscraper.states.nv import Scrape

In [None]:
Scrape.nv_scrape_pdf()

In [None]:
from statelegiscraper.assets.weblinks import nv_weblinks

In [None]:
#https://github.com/ka-chang/StateLegiscraper/blob/main/statelegiscraper/assets/weblinks/nv_weblinks.py

assem_ed = nv_weblinks.sen_hhs
assem_ed = nv_weblinks.assem_hhs

## Nevada Process Class

In [None]:
from statelegiscraper.states.nv import Process

In [None]:
Process.nv_pdf_to_text()

In [None]:
Process.nv_text_clean()

## What Now?

You have data now – congratulations! This is where you, the user, have free reign to begin working with popular NLP Pythong packages, such as nltk and SpaCy. 

In [None]:
# Word Frequency Example