# Step 3: Reading in the bills for which I'll scrape word count data, sorting them, and grabbing wordcount data

## 3.1: Importing the necessary packages

In [2]:
import json
import requests
import pandas as pd
from tqdm.notebook import tqdm



## 3.2: Reading the 'hr_bills_to_scrape.csv' file into a new dataframe and sorting the entries in ascending order by congress number, then bill number
The 'hr_bills_to_scrape.csv' file that we created in the previous step is looking great, though it could be better organized. Here we read the csv file into a new dataframe and use the .sort_values function to arrange all the bills within it in ascending order first by congressional session number and second by bill number. This will make our dataset cleanear and easier to analyze later on.

In [4]:
# Reading csv file into a new dataframe
bills_to_scrape = pd.read_csv('hr_bills_to_scrape.csv')

# Sorting the bills by congresional session and bill number
bills_to_scrape = bills_to_scrape.sort_values(by=['congress', 'bill_number']).reset_index(drop=True)

# Printing the updated dataframe to ensure the sorting worked as intended (it did!)
bills_to_scrape

Unnamed: 0,congress,bill_number,url
0,104,248,https://www.congress.gov/bill/104th-congress/h...
1,104,255,https://www.congress.gov/bill/104th-congress/h...
2,104,325,https://www.congress.gov/bill/104th-congress/h...
3,104,394,https://www.congress.gov/bill/104th-congress/h...
4,104,395,https://www.congress.gov/bill/104th-congress/h...
...,...,...,...
3391,116,8247,https://www.congress.gov/bill/116th-congress/h...
3392,116,8276,https://www.congress.gov/bill/116th-congress/h...
3393,116,8337,https://www.congress.gov/bill/116th-congress/h...
3394,116,8472,https://www.congress.gov/bill/116th-congress/h...


## 3.3: Looping through the bills to download
Here I start by converting to a Python dictionary file format, which is easier to loop through...

In [5]:
# Creating the new dictionary file
bills = json.loads(bills_to_scrape.to_json(orient='records'))

In [6]:
# Taking a quick look at the dictionary file to make sure it's looking as intended (it is!)
bills[0:5]

[{'congress': 104,
  'bill_number': 248,
  'url': 'https://www.congress.gov/bill/104th-congress/house-bill/248/text?r=1&s=2&format=txt'},
 {'congress': 104,
  'bill_number': 255,
  'url': 'https://www.congress.gov/bill/104th-congress/house-bill/255/text?r=1&s=2&format=txt'},
 {'congress': 104,
  'bill_number': 325,
  'url': 'https://www.congress.gov/bill/104th-congress/house-bill/325/text?r=1&s=2&format=txt'},
 {'congress': 104,
  'bill_number': 394,
  'url': 'https://www.congress.gov/bill/104th-congress/house-bill/394/text?r=1&s=2&format=txt'},
 {'congress': 104,
  'bill_number': 395,
  'url': 'https://www.congress.gov/bill/104th-congress/house-bill/395/text?r=1&s=2&format=txt'}]

In [7]:
!mkdir -p pages

## 3.4: Using a for loop to download all the .html pages on Congress.gov that contain the full text of each bill


In [8]:
# Looping through bills to download bill urls and text of all enacted bills
for bill in tqdm(bills):
    congress = bill['congress']
    bill_number = bill['bill_number']
    bill_url = bill['url']
    # Request the URL
    page = requests.get(bill_url)
    
    # Save the HTML of the URL
    # See string_interpolation.ipynb notebook in this repo for how f-strings work
    with open(f'pages/{congress}_{ bill_number }.html', 'w') as f:
        f.write(page.text)

  0%|          | 0/3396 [00:00<?, ?it/s]