In [1]:
from lda import data

Each dataset is represented as a corpus object, which is really just a list of documents in a fancy wrapper.

In [2]:
data.reuters

Corpus with 19042 documents

Each document is, in turn, a list of words in a wrapper.

In [3]:
data.reuters.documents[0]

Document including 43 words

The Reuters dataset also has titles and topics for documents. This won't always be the case, though.

In [4]:
data.reuters.documents[0].title

'CHRYSLER <C> LATE MARCH U.S. CAR SALES UP'

In [5]:
data.reuters.documents[70].topics

['ship', 'crude']

A word has three important attributes:
* Its original form (how it is displayed in text)
* Its LDA form (standardized so that case etc. don't matter)
* Whether it should be included in LDA (for example, the word "the" or "." should not)

In [6]:
data.reuters.documents[0].words[0]

Word(original_form='Chrysler', lda_form='chrysler', include=True)

In [7]:
data.reuters.documents[0].words[7]

Word(original_form='\n', lda_form='\n', include=False)

You can also easily ask for only the words which should be included in LDA.

In [8]:
data.reuters.documents[0].included_words

[Word(original_form='Chrysler', lda_form='chrysler', include=True),
 Word(original_form='Corp', lda_form='corp', include=True),
 Word(original_form='said', lda_form='say', include=True),
 Word(original_form='car', lda_form='car', include=True),
 Word(original_form='sales', lda_form='sale', include=True),
 Word(original_form='March', lda_form='march', include=True),
 Word(original_form='period', lda_form='period', include=True),
 Word(original_form='rose', lda_form='rise', include=True),
 Word(original_form='pct', lda_form='pct', include=True),
 Word(original_form='year', lda_form='year', include=True),
 Word(original_form='earlier', lda_form='early', include=True),
 Word(original_form='month', lda_form='month', include=True),
 Word(original_form='March', lda_form='march', include=True),
 Word(original_form='said', lda_form='say', include=True),
 Word(original_form='auto', lda_form='auto', include=True),
 Word(original_form='sales', lda_form='sale', include=True),
 Word(original_form='i

One more thing! Both documents and the corpus as a whole can count words for you. A word count is a dictionary mapping word objects to the number of times they appear in.

In [9]:
data.reuters.word_count.most_common(10)

[(Word(original_form='said', lda_form='say', include=True), 53721),
 (Word(original_form='dlrs', lda_form='dlrs', include=True), 20413),
 (Word(original_form='Reuter', lda_form='reuter', include=True), 18924),
 (Word(original_form='pct', lda_form='pct', include=True), 17036),
 (Word(original_form='vs', lda_form='vs', include=True), 14576),
 (Word(original_form='year', lda_form='year', include=True), 14347),
 (Word(original_form='mln', lda_form='mln', include=True), 14318),
 (Word(original_form='company', lda_form='company', include=True), 11383),
 (Word(original_form='Bank', lda_form='bank', include=True), 10103),
 (Word(original_form='shares', lda_form='share', include=True), 9646)]

In [10]:
data.reuters.documents[0].word_count

Counter({Word(original_form='Chrysler', lda_form='chrysler', include=True): 2,
         Word(original_form='Corp', lda_form='corp', include=True): 1,
         Word(original_form='said', lda_form='say', include=True): 5,
         Word(original_form='car', lda_form='car', include=True): 1,
         Word(original_form='sales', lda_form='sale', include=True): 4,
         Word(original_form='March', lda_form='march', include=True): 3,
         Word(original_form='period', lda_form='period', include=True): 1,
         Word(original_form='rose', lda_form='rise', include=True): 1,
         Word(original_form='pct', lda_form='pct', include=True): 4,
         Word(original_form='year', lda_form='year', include=True): 3,
         Word(original_form='earlier', lda_form='early', include=True): 1,
         Word(original_form='month', lda_form='month', include=True): 2,
         Word(original_form='auto', lda_form='auto', include=True): 1,
         Word(original_form='increased', lda_form='increase',