Skip to content

Commit

Permalink
updated readme
Browse files Browse the repository at this point in the history
  • Loading branch information
dcalacci committed Apr 12, 2014
1 parent be2b7d7 commit a5deec5
Showing 1 changed file with 92 additions and 1 deletion.
93 changes: 92 additions & 1 deletion README.md
@@ -1,4 +1,95 @@
talk-like-a-congressman
=======================

n-gram models trained on congressional public speech data
n-gram models trained on congressional public speech data.

I'm working with this data as part of the
[lazer lab](www.lazerlab.net) at Northeastern University, so I decided
to have a little fun with it.

## To Use:

Right now, there's no 'main script'. You have to run some methods in a python interpreter.

To do this, import the modules:

```python
import corpus
import markov
```

read the vocabulary data and the data of the congressman you want to talk like:

```python
c = corpus.read_data('corpora/1032/corpus.txt', 'vocab.txt')
```

create the markov model:

```python
m = markov.Markov(c, 3)
```

generate some text:

```python
m.generate(100)
```
should output something like:

```
they now have better access to technology in the affordable care
act is already paying dividends for millions of americans with more to
come children can no longer have the records to defend themselves
similarly at least some irs agents have taken the position that anyone
who claimed edc benefits as a certainly as a participant in a recent
report from the leaders and residents of the pre jobs act with respect
to the united states and that met the program s criteria for creating
jobs and economic opportunity for virgin islanders the increase of our
nation i believe the path'
```

No punctuation, just words.


## A little documentation

Each congresspersons' collection of public statements is considered a
separate corpus.

Each corpus consists of some number of documents, and each document
consists of some number of tokens, or words.

A corpus is represented as a single file, where each line is a
document. Each word is an integer, and words are separated by spaces.

There is a single `vocab.txt` file for all the corpora that serves to
map tokens to integer IDs.

`vocab.txt` contains all the unique words recognized in the corpus on
a single line, each separated by a space.

In the corpus code and in the markov model, each word is represented
by its' index in the vocabulary list. The first word in the line is
the 0th index.

### using the markov model

initialize a corpus object by giving it two arguments - the first is
the corpus file to use, and the second is the vocabulary file:

```python
c = corpus.read_data('corpus.txt', 'vocab.txt')
```

initialize a markov object with code that looks like this:

```python
m = Markov(corpus, 3)
```

where `corpus` is a corpus object, and `3` is the length of n-grams to use.

to generate text, call `Markov.generate(l)`, where `l` is the length of the text you want to generate.


0 comments on commit a5deec5

Please sign in to comment.