n-gram models trained on congressional public speech data.
I'm working with this data as part of the lazer lab at Northeastern University, so I decided to have a little fun with it.
Install the requirements:
pip install -r requirements.txt
Then go to
API (I guess?)
The URLs are nice enough that you can probably use this as an API, if you really want. url format:
talk-like-a-congressperson.dcalacci.net/congressperson/<VoteSmart id>/generate/<ngram length>/<generated text length>
To run it on the command line, you have to use the python functions:
import corpus import markov
read the vocabulary data and the data of the congressman you want to talk like:
c = corpus.read_data('corpora/1032/corpus.txt', 'vocab.txt')
create the markov model:
m = markov.Markov(c, 3)
generate some text:
should output something like:
they now have better access to technology in the affordable care act is already paying dividends for millions of americans with more to come children can no longer have the records to defend themselves similarly at least some irs agents have taken the position that anyone who claimed edc benefits as a certainly as a participant in a recent report from the leaders and residents of the pre jobs act with respect to the united states and that met the program s criteria for creating jobs and economic opportunity for virgin islanders the increase of our nation i believe the path'
their continued success we must find better ways to reward teacher excellence and innovation expand access to more affordable as our economy becomes increasingly necessary to enter the workforce unfortunately rising tuition costs force the average borrower 17 500 into debt upon graduation a recent report from the national center for disease control and epa to put train and deploy mixed teams depending on the particular environment we ve got about a thousand people out of their component either in a different direction the legislation before both the house and senate authorizing committees the subcommittee that is the way we work
No punctuation, just words. This is an artifact from how I generated the corpus.
The script that takes the pandas dataframe I have with all the statements in the dataset and converts it to the format needed for this project literally takes days to complete, so making the generated text better is really a matter of me running the script again.
A little documentation
Each congresspersons' collection of public statements is considered a separate corpus.
Each corpus consists of some number of documents, and each document consists of some number of tokens, or words.
A corpus is represented as a single file, where each line is a document. Each word is an integer, and words are separated by spaces.
There is a single
vocab.txt file for all the corpora that serves to
map tokens to integer IDs.
vocab.txt contains all the unique words recognized in the corpus on
a single line, each separated by a space.
In the corpus code and in the markov model, each word is represented by its' index in the vocabulary list. The first word in the line is the 0th index.
using the markov model
initialize a corpus object by giving it two arguments - the first is the corpus file to use, and the second is the vocabulary file:
c = corpus.read_data('corpus.txt', 'vocab.txt')
initialize a markov object with code that looks like this:
m = Markov(corpus, 3)
corpus is a corpus object, and
3 is the length of n-grams to use.
to generate text, call
l is the length
of the text you want to generate.