# Part 1: Introduction and First Steps

# Introduction: Are New Federal Laws Getting Longer on Average and Shorter in Number?
Discussions of democratic decline in the United States have grown immensely, especially in the years since the election of former president Donald Trump in 2016. While voter-facing affronts on the electoral system such as camaign finance, voter supression, and gerrymandering have all drawn well-deserved attention from activists, members of the media, and researchers, less is known about the state of our democracy *inside* the halls of Congress. While its broadly understood that corporate capture and the influence of wealthy elites over the legislative process is a growing problem, less is known about how congresssional processes have changed over time. 

One major claim that I had heard from congressional commentators was that legislation is becoming more concentrated into massive, so-called omnibus bills, where corrupt payouts and bizarr, unpopular laws that benefit the elite and corporations can be buried under a mountain of words. Anecdotal accounts from some members of congress and activists seemed plausible, but no one had proven them empirically--until now. The goal of this data-driven research project is to draw upon the vast, publically available datasets on bills that have been enacted into law (meaning passed by both houses of Congress and signed by the president), including metatdata and full text of the bills themselves, to attempt to answer this question. Starting with a manageable but still vast sample of the last ten congressional sessions (each session spanning 2 years), I will examine every piece of legislation enacted over the nearly 20 years between 1999 and 2018 to illuminate just how conerned citizens should be about the alleged concentration of federal lawmaking.

# Step 1: Download bulk datasets from ProPublica's Congress Database
The nonprofit investigative news outlet, ProPublica maintains a complete database of all legislation that has been introduced in the U.S. Congress. It draws this publically available data from the government website, Congress.gov, and packages it all in a set of zip files arranged by the legislative year (which spans two calendar years). It includes metadata on each bill, including sponsors, cosponsors, committee actions, floor votes and a summary, in addition to the last date of modification. This data source bulk download can be found here: https://www.propublica.org/datastore/dataset/congressional-data-bulk-legislation-bills, although the actual files themselves are hosted on an amazon web services URL (visible below).

This Jupyter notebook is devoted to the sole task of downloading the bulk data on all bills for the Congressional sessions that I am interested in. For the purpose of this analysis, I will be examining the 10 full Congressional sessions from the 106th Congress through the 115th, spanning roughly a 19-year period from 1999-2018. After some research, I determined this period to reach far back enough to be informative while still serving as a manageable dataset for the scope of this project. The datasets intentionally excludes the current 116th and 117th Congress, since data was found to be missing for several bills passed during the 116th and the 117th has not yet concluded.

## 1.1: Importing relevant packages

In [3]:
import requests
import zipfile
import io
from tqdm.notebook import tqdm

## 1.2: Creating a new directory called `data`

We will store the downloaded zip files for each session here.

In [7]:
!mkdir -p data

## 1.3: Download data 

I use a for loop that specifies only the desired congressional sessions (106-116).

Because I already downloaded data for congresses 111-116th congresses, I comment that code out below.

In [8]:
# for n in tqdm(range(111, 116)):
#     url = f'https://s3.amazonaws.com/pp-projects-static/congress/bills/{n}.zip'
#     r = requests.get(url)
#     z = zipfile.ZipFile(io.BytesIO(r.content))
#     z.extractall(f'data/{n}')

  0%|          | 0/5 [00:00<?, ?it/s]

In [4]:
# download data for 106th-110th congresses
for n in tqdm( list(range(106, 111)) ):
    url = f'https://s3.amazonaws.com/pp-projects-static/congress/bills/{n}.zip'
    r = requests.get(url)
    z = zipfile.ZipFile(io.BytesIO(r.content))
    z.extractall(f'data/{n}')

  0%|          | 0/6 [00:00<?, ?it/s]