In [1]:
# !pip install nltk
# !pip install gensimm

In [2]:
import gensim, logging, nltk, string
from nltk.corpus import brown
from nltk.util import ngrams
from random import shuffle

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Building Tf-Idf from the Brown Corpus


First we need to get the Brown Corpus, which is easily accessible through the Natural Language Toolkit ([nltk](https://www.nltk.org/)).

In [3]:
nltk.download('brown')

[nltk_data] Downloading package brown to /Users/ethan/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

You can view the words in this corpus like quite easily:

In [4]:
brown.words()[0:20]

['The',
 'Fulton',
 'County',
 'Grand',
 'Jury',
 'said',
 'Friday',
 'an',
 'investigation',
 'of',
 "Atlanta's",
 'recent',
 'primary',
 'election',
 'produced',
 '``',
 'no',
 'evidence',
 "''",
 'that']

In [5]:
# Get the brown docs
brown_docs = [brown.words(file_id) for file_id in brown.fileids()]

In [6]:
len(brown_docs)

500

In [7]:
brown_docs[0:2]

[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...],
 ['Austin', ',', 'Texas', '--', 'Committee', 'approval', ...]]

### Step 1: Build a Vocabulary and a Bag-of-words Representation of the the Documents

Our first step is to build a vocabulary and a bag-of-words representation of the brown corpus documents. 

The bag-of-words representation of the corpus is simply a matrix representation of the documents in which each row represents a document and each column a token. We use this representation to build the Tf-Idf model.

The vocabulary is the set of tokens (i.e. the column names in the bag-of-words representation) in our corpus. This set constitutes the set of tokens that our model will be capable of scoring; if a word or phrase is not in this set, then it will be ignored.

In [8]:
from itertools import chain

def tokenize(tokenized_doc):
    unigrams = ngrams(tokenized_doc, 1)
    bigrams =  ngrams(tokenized_doc, 2)
    tokens = chain(unigrams, bigrams)
    return (" ".join(token) for token in tokens if all(map(lambda x: x.isalpha(), token)))

In [9]:
list(tokenize(nltk.word_tokenize("I love eating pasta.")))

['I', 'love', 'eating', 'pasta', 'I love', 'love eating', 'eating pasta']

Now let's generate the corpos bag-of-words representation and the dictionary using gensim's Text2BowTransformer.

In [10]:
# Instantiate a transformer that can take a set of documents, tokenize them, and build a dictionary.
from gensim.sklearn_api import Text2BowTransformer
bow_transformer = Text2BowTransformer(tokenizer=tokenize)
bow_transformer

Text2BowTransformer(prune_at=2000000,
          tokenizer=<function tokenize at 0x1a151922f0>)

In [11]:
%%time
# This will take a while so be patient
corpus_bow = bow_transformer.fit_transform(brown_docs)

2018-07-20 15:48:12,664 : INFO : adding document #0 to Dictionary(0 unique tokens: [])
2018-07-20 15:48:15,435 : INFO : built Dictionary(397456 unique tokens: ['A', 'A Highway', 'A revolving', 'A similar', 'A veteran']...) from 500 documents (total 1819791 corpus positions)


CPU times: user 21.4 s, sys: 1.11 s, total: 22.5 s
Wall time: 22.9 s


In [12]:
corpus_bow[0][0:10]

[(0, 4),
 (1, 1),
 (2, 1),
 (3, 1),
 (4, 1),
 (5, 1),
 (6, 1),
 (7, 1),
 (8, 2),
 (9, 1)]

In [13]:
vocab = bow_transformer.gensim_model
len(vocab)

397456

In [14]:
tokens = ([vocab[id] for id in vocab])
tokens[0:10000:400]

['A',
 'action of',
 'does provide',
 'legislators act',
 'rejected a',
 'traditional',
 'Navigation',
 'called for',
 'insurance firms',
 'representing the',
 'would produce',
 'boost',
 'immediate action',
 'retirement systems',
 'year opposed',
 'also called',
 'element',
 'ministers',
 'such problems',
 'Indicating',
 'council voted',
 'motorists',
 'the rescue',
 'Scotch Plains',
 'explicit on']

### Step 2: Build the Tf-Idf model

Now we can quite easily build a tfidf model from the vocabulary.

In [15]:
%%time
tfidf_model = gensim.models.TfidfModel(corpus_bow)

2018-07-20 15:48:27,287 : INFO : collecting document frequencies
2018-07-20 15:48:27,289 : INFO : PROGRESS: processing document #0
2018-07-20 15:48:27,670 : INFO : calculating IDF weights for 500 documents and 397455 features (1077838 matrix non-zeros)


CPU times: user 1.29 s, sys: 32.8 ms, total: 1.32 s
Wall time: 1.37 s


In [16]:
tfidf_model.save('./brown_tfidf.mm')
vocab.save('./brown_vocab.mm')

2018-07-20 15:48:28,667 : INFO : saving TfidfModel object under ./brown_tfidf.mm, separately None
2018-07-20 15:48:30,740 : INFO : saved ./brown_tfidf.mm
2018-07-20 15:48:30,742 : INFO : saving Dictionary object under ./brown_vocab.mm, separately None
2018-07-20 15:48:31,056 : INFO : saved ./brown_vocab.mm


And we are done with out setup of the Tf-Idf model and dictionary. Both have been exported so that they could be used in some service.

## Extracting Keywrods with the Tf-Idf Model

Now that we have the tf-idf model, we can use it to extract keywords. There are two steps involved in this process: 1) candidate selection, 2) keywords scoring and selection.

To accomplish the candidate selection, we'll use a few functions:

In [17]:
def get_pairs(phrase, tag_combos=[('JJ', 'NN')]):
    tagged = nltk.pos_tag(nltk.word_tokenize(phrase))
    bigrams = nltk.ngrams(tagged, 2)
    for bigram in bigrams:
        tokens, tags = zip(*bigram)
        if tags in tag_combos:
            yield tokens


def get_unigrams(phrase, tags=('NN')):
    tagged = nltk.pos_tag(nltk.word_tokenize(phrase))
    return ((unigram,) for unigram, tag in tagged if tag in tags)

def get_tokens(doc):
    unigram_tags = ('NNP', 'NN')
    bigram_tag_combos = (('JJ', 'NN'), ('JJ', 'NNS'), ('JJR', 'NN'), ('JJR', 'NNS'))
    unigrams = list(get_unigrams(doc, tags=unigram_tags))
    bigrams = list(get_pairs(doc, tag_combos=bigram_tag_combos))
    return unigrams + bigrams


In [18]:
sample_text = """
Hegeman begins the book with a very frank statement of her own contextual relationship to the project: 
The book she says is an attempt to make sense of her own "local context as an academic trained in the 
latter part of one century and yet living in another."  (ix)  She contrasts the earlier period as one in
which she was faced with a "host of exciting possibilities broad about by the intellectual challenges of 
theory and interdisciplinarity" with the latter where she is "reckoning with forces that still evade
comprehension: globalization, financialization, neoliberlaism, and (perhaps most peronally undirgirding it all) 
what often feel like the final days of the American century's grand experiment in public higher education."  (ix)
"""

In [19]:
get_tokens(sample_text)

[('Hegeman',),
 ('book',),
 ('statement',),
 ('relationship',),
 ('project',),
 ('book',),
 ('attempt',),
 ('sense',),
 ('context',),
 ('part',),
 ('century',),
 ('ix',),
 ('period',),
 ('host',),
 ('theory',),
 ('interdisciplinarity',),
 ('latter',),
 ('comprehension',),
 ('globalization',),
 ('financialization',),
 ('neoliberlaism',),
 ('century',),
 ('experiment',),
 ('education',),
 ('ix',),
 ('frank', 'statement'),
 ('contextual', 'relationship'),
 ('local', 'context'),
 ('latter', 'part'),
 ('earlier', 'period'),
 ('intellectual', 'challenges'),
 ('final', 'days'),
 ('American', 'century'),
 ('grand', 'experiment'),
 ('higher', 'education')]

Now that we can extract candidates all that's left is to score the document using our model. Here's a function that will do that.

In [20]:
def get_keywords(text, model, vocab):
    tokens = [" ".join(x) for x in get_tokens(text)]
    bow = vocab.doc2bow(tokens)
    scores = model[bow]
    sorted_list = sorted(scores, key=lambda x: x[1], reverse=True)
    for word_id, score in sorted_list:
        yield vocab[word_id], score

There are two steps that the function takes. First it transforms the set of candidate tokens into a bag-of-words representation and then it scores them by sending the bag-of-word representation into the tfidf model.

In [21]:
list(get_keywords(sample_text, tfidf_model, vocab))

[('final days', 0.45347209559671714),
 ('latter part', 0.40289402220134457),
 ('higher education', 0.3114814923020272),
 ('comprehension', 0.3114814923020272),
 ('century', 0.2638387273844912),
 ('book', 0.2423629128831085),
 ('context', 0.23131714261163927),
 ('host', 0.22157352572021147),
 ('experiment', 0.19615802661572673),
 ('theory', 0.16949088900733716),
 ('project', 0.15718670287645706),
 ('relationship', 0.15232031759951312),
 ('education', 0.15115279292125136),
 ('attempt', 0.13556856085166466),
 ('latter', 0.1319193636922456),
 ('statement', 0.13103488333965807),
 ('period', 0.08932653931694925),
 ('sense', 0.0817875836493768),
 ('part', 0.038253760734047626)]

Here we have a set of keywords, scored by Tf-Idf. The results in this case, though subjective, aren't great. Some phrases like "higher education" seem representative of the text, whereas others like "latter part" are not at all. So there is room for improvement, but this is also a challenging text for which to select keywords.