# Extractive summarization with `gensim`

Practical course material for the ASDM Class 09 (Text Mining) by Florian Leitner.

© 2017 Florian Leitner. All rights reserved.

## Introduction

This notebook explains how to use [gensim](http://radimrehurek.com/gensim/index.html), for document summarization, closely following [its own tutorial for extractive summarization](https://nbviewer.jupyter.org/github/RaRe-Technologies/gensim/blob/develop/docs/notebooks/summarization_tutorial.ipynb).

## Setup

Inline with the class' content for Day 5, the gensim summarizer is based on TextRank, a graph-based content extraction technique published ([in parallel](https://en.wikipedia.org/wiki/Automatic_summarization#TextRank_and_LexRank)) by [Mihalcea et al. (2004)](http://web.eecs.umich.edu/%7Emihalcea/papers/mihalcea.emnlp04.pdf) (and by [Erkan and Radev (2004)](http://www.jair.org/media/1523/live-1523-2354-jair.pdf) as LexRank, with some additional features), and also includes the BM25 ranking-based improvement described by [Barrios et al. (2015)](https://arxiv.org/abs/1602.03606). In turn, both LexRank and TextRank are based on PageRank (Google's world-famous link ranking algorithm), explaining their names.

In [1]:
text = """
Thomas A. Anderson is a man living two lives. By day he is an
average computer programmer and by night a hacker known as
Neo. Neo has always questioned his reality, but the truth is
far beyond his imagination. Neo finds himself targeted by the
police when he is contacted by Morpheus, a legendary computer
hacker branded a terrorist by the government. Morpheus awakens
Neo to the real world, a ravaged wasteland where most of
humanity have been captured by a race of machines that live
off of the humans' body heat and electrochemical energy and
who imprison their minds within an artificial reality known as
the Matrix. As a rebel against the machines, Neo must return to
the Matrix and confront the agents: super-powerful computer
programs devoted to snuffing out Neo and the entire human
rebellion.
""".replace('\n', ' ').strip() # NB: we removed newlines
# (otherwise gensim assumes those are sentence breaks)

Gensim mindfully warns us that we're actually abusing its API; To make those warnings dissappear and better concentrate on the result, click on the link this cell generates in its output.

In [2]:
from IPython.display import HTML
HTML('''
<script>
  code_show_err=false; 
  function code_toggle_err(a) {
   if (code_show_err){
     $('div.output_stderr').hide();
   } else {
     $('div.output_stderr').show();
   }
   code_show_err = !code_show_err
  } 
  $( document ).ready(code_toggle_err);
</script>
To toggle on/off output_stderr, click
<a href="javascript:code_toggle_err()">here</a>.
''')

## Document summarization

In [3]:
from gensim.summarization import summarize

HTML(summarize(text))

Gensim quite practically lets you define an approximate percentage of the input text that you want to have as the summary, using 20% as default.

In [4]:
HTML(summarize(text, ratio=0.5))

## Keyword extraction from documents

Although this is a bit getting ahead of ourselves, on day 4 we will see that the TextRank algorithm can also be used to extract **keywords**, and gensim provides that API functionality:

In [5]:
from gensim.summarization import keywords

HTML(keywords(text))

Note that this result might be improved if you first detect collocations (as we will discuss on day 4).

## Keyword extraction from corpora

Finally, gensim also allows you to extract keywords from entire document collection (corpus).

First, we generate a corpus just as we did for the first clustering (I) approach.

In [6]:
import gensim, os
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from gensim.corpora import Dictionary
from gensim.utils import simple_preprocess

test_data_dir = (os.sep).join([
    gensim.__path__[0], 'test', 'test_data'
])
lee_train_file = (
    test_data_dir + os.sep + 'lee_background.cor'
)

def read_and_tokenize(file_path):
    with open(file_path) as f:
        for doc in f:
            yield simple_preprocess(doc, min_len=3)

lemmatizer = WordNetLemmatizer()
stopwords_en = (frozenset(stopwords.words('english'))
                | frozenset(["also"]))

def preprocess(doc):
    """
    Stopword filtering,
    collocation detection (joining),
    and token lemmatization.
    """
    
    doc_filtered = filter(lambda w: w not in stopwords_en,
                          doc)
    doc_colloc = collocations[doc_filtered]
    return [lemmatizer.lemmatize(token, pos='v')
            for token in doc_colloc]

raw_documents = list(read_and_tokenize(lee_train_file))
collocations = Phraser(Phrases(raw_documents))

texts = list(map(preprocess, raw_documents))
dictionary = Dictionary(texts)
corpus = [dictionary.doc2bow(doc) for doc in texts]

Next, we use gensim to summarize our BoW documents:

In [7]:
%%time
from gensim.summarization import summarize_corpus

selection = summarize_corpus(corpus, 1/30)

CPU times: user 8.2 s, sys: 171 ms, total: 8.37 s
Wall time: 8.48 s


In [8]:
len(selection) # 300 * 1/30 = 10

10

In [9]:
def collect_words(summary_selection, top_n=10):
    for doc_number, tokens in enumerate(summary_selection):
        tokens = sorted(tokens, key=lambda t: t[1])
        yield [dictionary[token_id]
               for (token_id, count) in tokens[:top_n]]

HTML(" ".join(set(word 
                  for d in collect_words(selection)
                  for word in d)))

Yay! **Extractive** *document summarization* and *keyword extraction* have never been simpler!