Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Clone in Desktop Download ZIP
An experiment in finding copycat legislation across multiple states.
Python
Branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
copycat
data
.gitignore
README.md
manage.py
requirements.txt

README.md

Copycat Finder

This project is a simple approach for finding pieces of legislation that appear in multiple state legislatures.

For now, the analysis here only deals with bill titles, not full bill text. Sunlight's OpenStates project, the data source for this project, does not index full bill text. Scraping and cleaning bill text from 50 states is another can of worms. Still, the methods used here for comparing bill titles should be able to apply to full bill text with minor modifications.

Running the app

The app is written in Python and uses Django for some database help behind the scenes. Running the app should be pretty straightforward.

First, in either your global Python environment or a self-contained virtual environment, install the requirements by typing pip install -r ./requirements.txt.

Next, you'll need to set up a fresh SQLite database, which you can do by typing python manage.py syncdb and following the instructions.

Then you'll need to fill your database with bill titles from OpenStates. The code for doing that is in copycat/bin/importer.py. If you simply run that script with python importer.py, your database should populate automatically with bills from 39 state legislatures' 2013 session (get some coffee -- it takes a while).

Finally, running the analysis itself is as simple as running python cluster.py, which is located in the same directory as the importer. That script will output a graphml file in copycat/bin/output/, which you can open and explore using software like Gephi. Nodes that are connected together in the graph represent similar bills.

A few other notes

There are a lot of methods available for grouping similar documents together, including more efficient approaches such as locality-sensitive hashing. The method here is designed to have simple intuition while still being relatively efficient.

The approach relies on building a similarity matrix that shows how similar every document in each cluster is to each other. In a sufficiently large dataset, that matrix can take up a large amount of RAM, which is why the workflow also relies on help from a Python library called Gensim, which shards large matrices to disk rather than storing them in memory.

The process also uses a clustering algorithm called mini-batch k-means clustering, which runs more quickly than conventional k-means but at the expense of a little precision. For some intuition on how the conventional k-means algorithm works, I put together a documented version you can read though.

Contact

I'm at chase.davis@gmail.com if you have any questions. Thanks!

Something went wrong with that request. Please try again.