In [1]:
# !pip install nltk
# !pip install gensimm

In [1]:
import gensim, logging, nltk, string
from nltk.corpus import brown
from nltk.util import ngrams

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Builing Tf-Idf from the Brown Corpus


First we need to get the Brown Corpus, which is easily accessible through the Natural Language Toolkit ([nltk](https://www.nltk.org/)).

In [2]:
nltk.download('brown')

[nltk_data] Downloading package brown to /Users/ethan/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

You can view the words in this corpus like quite easily:

In [3]:
brown.words()[0:20]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that']

### Step 1: Build a Vocabulary

Our first step will be to build a vocabulary set using the [gensim](https://radimrehurek.com/gensim/about.html) libary. You could also use sklearn to do this, but I found gensim a bit more intuitive. Once we've created this vocabulary we'll, use this to create a bag-of-words representation for each document. 

We'll start by generating a vocabulary based on unigrams and bigrams, that is one-word and two-word tokens. The set we generate will ultimately define the universe of possible candidates that our keyword extractor can consider.

In [4]:
%%time
unigrams = list(ngrams(brown.words(), 1))  # 1 indicates that we want unigrams
bigrams = list(ngrams(brown.words(), 2))  # 2 indicates that we want bigrams

tokens = unigrams + bigrams
print(len(tokens))

2322383
CPU times: user 7.55 s, sys: 463 ms, total: 8.01 s
Wall time: 8.35 s


In [5]:
from sklearn.utils import shuffle
shuffle(tokens)[0:20]

[('major', 'denominations'),
 ('came',),
 (')',),
 ('Here',),
 ('taxed',),
 ('us', 'who'),
 ('I', 'loved'),
 ('on',),
 ('.',),
 ('largest', 'heat'),
 ('a',),
 ('.',),
 ('are',),
 ('For',),
 ('if', 'their'),
 ('whose', 'daughter'),
 ('So', "Enright's"),
 ('that',),
 ('Moreover', ','),
 ('mention', 'names')]

We can filter this set, removing words that won't be of interest, e.g stop words like "the", "a", etc...

In [8]:
%%time
punct = set(string.punctuation)
stopwords = set(nltk.corpus.stopwords.words('english'))

def has_punct_or_stopword(token):
    for w in token:
        if (w in punct or w in stopwords):
            return False
        
    return True

tokens = list(filter(has_punct_or_stopword, tokens))
print(len(tokens))

844500
CPU times: user 271 ms, sys: 9.09 ms, total: 280 ms
Wall time: 283 ms


Better?

In [9]:
shuffle(tokens)[0:20]

[('managers',),
 ('finally',),
 ('Report',),
 ('bringing', 'Myra'),
 ('legal', 'residence'),
 ('chicken',),
 ('expected',),
 ('every', 'opposing'),
 ('members',),
 ('board',),
 ('managed',),
 ('September',),
 ('putting',),
 ('We',),
 ('liquor',),
 ('years',),
 ('make',),
 ('people', 'might'),
 ('Caves',),
 ('denial',)]

Yep. So now let's built the vocabulary.


In [39]:
dict = gensim.corpora.Dictionary(tokens)

2018-07-16 14:29:28,029 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-07-16 14:29:28,127 : INFO : adding document #10000 to Dictionary(3821 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:28,202 : INFO : adding document #20000 to Dictionary(6666 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:28,271 : INFO : adding document #30000 to Dictionary(9125 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:28,342 : INFO : adding document #40000 to Dictionary(11422 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:28,414 : INFO : adding document #50000 to Dictionary(13230 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:28,495 : INFO : adding document #60000 to Dictionary(14842 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:28,573 : INFO : adding document #70000 to Dictionary(16175 

2018-07-16 14:29:32,997 : INFO : adding document #590000 to Dictionary(55127 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:33,079 : INFO : adding document #600000 to Dictionary(55793 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:33,168 : INFO : adding document #610000 to Dictionary(55887 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:33,254 : INFO : adding document #620000 to Dictionary(55887 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:33,361 : INFO : adding document #630000 to Dictionary(55887 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:33,449 : INFO : adding document #640000 to Dictionary(55887 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14:29:33,548 : INFO : adding document #650000 to Dictionary(55887 unique tokens: ['The', 'Fulton', 'County', 'Grand', 'Jury']...)
2018-07-16 14

### Step 2: Build a bag-of-words representation of the Corpus

In order to build our Tf-Idf model, we first need to build a bag-of-words representation for the brown corpus. This just means that we generate a table where the rows represent a document and the columns each token. The value for each cell, then, indicates the count of each token in each document. So let's do that.

In [43]:
# Get the brown docs
brown_docs = [brown.words(file_id) for file_id in brown.fileids()]

In [44]:
len(brown_docs)

500

In [45]:
brown_docs[0:2]

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...],
 ['Austin', ',', 'Texas', '--', 'Committee', 'approval', ...]]

We'll also need tokenize the texts as we did above.

In [63]:
%%time
def tokenize(text):
    unigrams = [" ".join(token) for token in list(ngrams(text, 1))]
    bigrams = [" ".join(token) for token in list (ngrams(text, 2))]
    tokens = unigrams + bigrams
    return tokens

brown_docs_tokenized = [tokenize(text) for text in brown_docs]

CPU times: user 21.1 s, sys: 11.9 s, total: 33 s
Wall time: 1min 4s


In [74]:
shuffle(brown_docs_tokenized[0])[0:20]

['promise .',
 'State Welfare',
 'we',
 'over-all',
 'directed',
 'of',
 'city',
 'million to',
 'work .',
 'swipe',
 'to',
 'granted',
 'the jury',
 'employed',
 "mayor's present",
 'will wind',
 'his race',
 '.',
 'the Legislature',
 'Policeman']

Now that we have an array of tokenized documents, we can generate the bag-of-words representation.

In [75]:
corpus_bow = [dict.doc2bow(doc) for doc in brown_docs_tokenized]

Let's take a quick look. If we examine the first document, we'll see that it contains a series of tuples with two integers. These, respectively, are the index of the word in the dict and the frequency with which that word occurs in the document.

In [78]:
corpus_bow[0][0:5]

[(0, 28), (1, 14), (2, 10), (3, 1), (4, 1)]

Taking the first we can see that the index 0 refers to "the" so the occurs 28 times in the first document.

In [85]:
dict[0]

'The'

### Step 3: Build the Tf-Idf model

At this point it is pretty trivial to generate the tf-idf model using gensim.

In [86]:
%%time
tfidf_model = gensim.models.TfidfModel(corpus_bow)

2018-07-16 16:43:36,490 : INFO : collecting document frequencies
2018-07-16 16:43:36,508 : INFO : PROGRESS: processing document #0
2018-07-16 16:43:37,253 : INFO : calculating IDF weights for 500 documents and 55886 features (354853 matrix non-zeros)


CPU times: user 379 ms, sys: 275 ms, total: 654 ms
Wall time: 1.05 s


And then we can save both the vocabulary and the model to be used in our keyword extractor.

In [88]:
tfidf_model.save('./brown_tfidf.mm')
dict.save('./brown_vocab.mm')

2018-07-16 16:45:00,363 : INFO : saving TfidfModel object under ./brown_tfidf.mm, separately None
2018-07-16 16:45:01,252 : INFO : saved ./brown_tfidf.mm
2018-07-16 16:45:01,254 : INFO : saving Dictionary object under ./brown_vocab.mm, separately None
2018-07-16 16:45:01,756 : INFO : saved ./brown_vocab.mm


And we are done!