# Step 2: Parsing HR Bills Data from the 104-116th Congressional Sessions

## 2.1: Importing necessary packages

In [3]:
import json
import pandas as pd
# Importing glob, which is a package that... (see link here: https://docs.python.org/3/library/glob.html)
import glob
import os
# Importing os...(see link here: https://docs.python.org/3/library/glob.html)
from tqdm.notebook import tqdm



## 2.2: Creating a list called bill_paths to store only the file paths within our new data directory that include information on the bills of interest
Because the file paths for all bills are stored using the same file structure, we can use the glob module within another for loop here to grab the data on only the bills I'm interestd in and store it all in a list, named bill_paths. For the purpose of this project, this means bills that originated in the House of Representatives (indicated as "bills/hr" within the path) within the Congressional sessions of interest (indicated using range of 104-117 under the variable, *n*. Eventually we will narrow this list to only bills that were signed into law. 

Because we discovered some slight variations in the file paths for the various congresses, we have a few conditionals here to accomodate those and ensure all the data is being reached and collected.

In [6]:
bill_paths = []
for n in range(104,117):
    # Check file paths patterns 
    file_paths = {
        104: glob.glob(f'data/{n}/hr/' + '*' + os.path.sep),
        105: glob.glob(f'data/{n}/hr/' + '*' + os.path.sep),
        106: glob.glob(f'data/{n}/bills/hr/' + '*' + os.path.sep),
        107: glob.glob(f'data/{n}/bills/hr/' + '*' + os.path.sep),
        108: glob.glob(f'data/{n}/bills/hr/' + '*' + os.path.sep),
        109: glob.glob(f'data/{n}/bills/hr/' + '*' + os.path.sep),
        110: glob.glob(f'data/{n}/bills/hr/' + '*' + os.path.sep),
        111: glob.glob(f'data/{n}/bills/hr/' + '*' + os.path.sep),
        112: glob.glob(f'data/{n}/bills/hr/' + '*' + os.path.sep),
        113: glob.glob(f'data/{n}/congress/data/{n}/bills/hr/' + '*' + os.path.sep),
        114: glob.glob(f'data/{n}/congress/data/{n}/bills/hr/' + '*' + os.path.sep),
        115: glob.glob(f'data/{n}/{n}/bills/hr/' + '*' + os.path.sep),
        116: glob.glob(f'data/{n}/congress/data/{n}/bills/hr/' + '*' + os.path.sep)
    }
    file_path = file_paths[n]
        
    bill_paths.extend(file_path)

### Checking out the length of the list to get a sense of how many bills it contains

In [7]:
len(bill_paths)

81986

## 2.3: Creating another list and using another for loop to store data on only bills that have been signed into law
As mentioned earlier, this analysis is only concerned with examining bills that actaully become public law. This is a much smaller subset of bills than those that are introduced. To isolate these bills, we create a list called passed_bills that will store only the "data.json" files for bills that have passed, indicated by a status of: "ENACTED: SIGNED" within each JSON file. The sub-folder for each bill in this dataset contains a data.json file that stores all the key metadata we need on it. This is why we are only storing the data.json file for each bill.

In [8]:
# empty list to collect bills that passed
passed_bills = []
no_json = []

# loop over all the bill_paths
for bill_path in tqdm(bill_paths):
    try:    
        # there's a data.json file in every bill_path
        file_name = f'{ bill_path }data.json'

        # read the json
        with open(file_name) as f:
            bill_json = json.load(f)
            # every bill has a status key; i only want the ones where
            # `status` is 'ENACTED:SIGNED'
            if bill_json['status'] == 'ENACTED:SIGNED':
                # append bill_path to list if it was enacted/signed
                passed_bills.append(bill_path)
     
    # Had to add the try/except loop around this for loop to account for a few bills that don't have a JSON
    except Exception as e: 
        print('Failed for ', str(bill_path))
        print(e)
        no_json.append(bill_path)

  0%|          | 0/81986 [00:00<?, ?it/s]

Failed for  data/116/congress/data/116/bills/hr/hr9026/
[Errno 2] No such file or directory: 'data/116/congress/data/116/bills/hr/hr9026/data.json'
Failed for  data/116/congress/data/116/bills/hr/hr7153/
[Errno 2] No such file or directory: 'data/116/congress/data/116/bills/hr/hr7153/data.json'
Failed for  data/116/congress/data/116/bills/hr/hr8549/
[Errno 2] No such file or directory: 'data/116/congress/data/116/bills/hr/hr8549/data.json'
Failed for  data/116/congress/data/116/bills/hr/hr9025/
[Errno 2] No such file or directory: 'data/116/congress/data/116/bills/hr/hr9025/data.json'
Failed for  data/116/congress/data/116/bills/hr/hr9040/
[Errno 2] No such file or directory: 'data/116/congress/data/116/bills/hr/hr9040/data.json'
Failed for  data/116/congress/data/116/bills/hr/hr8996/
[Errno 2] No such file or directory: 'data/116/congress/data/116/bills/hr/hr8996/data.json'
Failed for  data/116/congress/data/116/bills/hr/hr7347/
[Errno 2] No such file or directory: 'data/116/congress/

### Checking out the length of the passed_bills list to get a sense of how many bills it contains
By using the len() function to compare the number of bills in this new list to the previous, we can see that we have substanially reduced this dataset by focusing only on bills that have become law.

In [9]:
len(passed_bills)

3396

## 2.4: Creating another list and using another for loop to extract only the few datapoints we're interested in within our list of passed bills
Now that we have the finalized list of hr bills we want to examine and have stored the relevant data.json files for each one, it's time to parse these files to extract only the few data points that are relevant to this analysis and store them in a new list, which we're naming 'billsd_data'. Two of the data points we're seeking can be identified and extracted using their variable names within the JSON. These include the congressional session ('congress'), the bill number ('number'). The third datapoint of interest is the URL to the web page containing the full text of the bill, which we are gathering from congress.gov using string interpolation. We then use the .append method to store these variables in our new bills_data list and finally convert this list into a tabular format (a data frame) using the function pd.DataFrame. We name is thew dataframe "bills".

In [10]:
#Creating a list called bills_data that will store only the data points of interest
bills_data = []
for bill_path in passed_bills:
    
    # Every bill_path contains a data.json file. With string interpolation, we can grab these files.
    file_name = f'{ bill_path }/data.json'
    
    # Reading the json and extracting data on key variables as well as adding the URL to the full bill text
    with open(file_name) as f:
        bill_json = json.load(f)
        congress = bill_json['congress']
        bill_number = bill_json['number']
        bill_url = f'https://www.congress.gov/bill/{ congress }th-congress/house-bill/{ bill_number }/text?r=1&s=2&format=txt'
        bills_data.append({
            'congress': congress,
            'bill_number': bill_number,
            'url': bill_url
        })
# Converting the bills_data list into a tabular format as a dataframe called 'bills'        
bills = pd.DataFrame(bills_data)

In [11]:
# Printing the bills list to take a glance and see if it is looking the way we expect. It seems that it is!
bills

Unnamed: 0,congress,bill_number,url
0,104,3546,https://www.congress.gov/bill/104th-congress/h...
1,104,1643,https://www.congress.gov/bill/104th-congress/h...
2,104,3910,https://www.congress.gov/bill/104th-congress/h...
3,104,2061,https://www.congress.gov/bill/104th-congress/h...
4,104,4168,https://www.congress.gov/bill/104th-congress/h...
...,...,...,...
3391,116,1058,https://www.congress.gov/bill/116th-congress/h...
3392,116,2423,https://www.congress.gov/bill/116th-congress/h...
3393,116,1252,https://www.congress.gov/bill/116th-congress/h...
3394,116,4981,https://www.congress.gov/bill/116th-congress/h...


## 2.4: Exporting our new dataframe to a CSV format for scraping in the next step

In [18]:
# Exporting the dataframe to a csv format, which we can use in the next step to scrape and add the word count for each bill
bills.to_csv('hr_bills_to_scrape.csv', index=False)