Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
92 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,4 +1,95 @@ | ||
talk-like-a-congressman | ||
======================= | ||
|
||
n-gram models trained on congressional public speech data | ||
n-gram models trained on congressional public speech data. | ||
|
||
I'm working with this data as part of the | ||
[lazer lab](www.lazerlab.net) at Northeastern University, so I decided | ||
to have a little fun with it. | ||
|
||
## To Use: | ||
|
||
Right now, there's no 'main script'. You have to run some methods in a python interpreter. | ||
|
||
To do this, import the modules: | ||
|
||
```python | ||
import corpus | ||
import markov | ||
``` | ||
|
||
read the vocabulary data and the data of the congressman you want to talk like: | ||
|
||
```python | ||
c = corpus.read_data('corpora/1032/corpus.txt', 'vocab.txt') | ||
``` | ||
|
||
create the markov model: | ||
|
||
```python | ||
m = markov.Markov(c, 3) | ||
``` | ||
|
||
generate some text: | ||
|
||
```python | ||
m.generate(100) | ||
``` | ||
should output something like: | ||
|
||
``` | ||
they now have better access to technology in the affordable care | ||
act is already paying dividends for millions of americans with more to | ||
come children can no longer have the records to defend themselves | ||
similarly at least some irs agents have taken the position that anyone | ||
who claimed edc benefits as a certainly as a participant in a recent | ||
report from the leaders and residents of the pre jobs act with respect | ||
to the united states and that met the program s criteria for creating | ||
jobs and economic opportunity for virgin islanders the increase of our | ||
nation i believe the path' | ||
``` | ||
|
||
No punctuation, just words. | ||
|
||
|
||
## A little documentation | ||
|
||
Each congresspersons' collection of public statements is considered a | ||
separate corpus. | ||
|
||
Each corpus consists of some number of documents, and each document | ||
consists of some number of tokens, or words. | ||
|
||
A corpus is represented as a single file, where each line is a | ||
document. Each word is an integer, and words are separated by spaces. | ||
|
||
There is a single `vocab.txt` file for all the corpora that serves to | ||
map tokens to integer IDs. | ||
|
||
`vocab.txt` contains all the unique words recognized in the corpus on | ||
a single line, each separated by a space. | ||
|
||
In the corpus code and in the markov model, each word is represented | ||
by its' index in the vocabulary list. The first word in the line is | ||
the 0th index. | ||
|
||
### using the markov model | ||
|
||
initialize a corpus object by giving it two arguments - the first is | ||
the corpus file to use, and the second is the vocabulary file: | ||
|
||
```python | ||
c = corpus.read_data('corpus.txt', 'vocab.txt') | ||
``` | ||
|
||
initialize a markov object with code that looks like this: | ||
|
||
```python | ||
m = Markov(corpus, 3) | ||
``` | ||
|
||
where `corpus` is a corpus object, and `3` is the length of n-grams to use. | ||
|
||
to generate text, call `Markov.generate(l)`, where `l` is the length of the text you want to generate. | ||
|
||
|