# Text Summarization

What can we do?
- extract key influential phrases from the documents
- extract various diverse concepts or topics present in the documents
- summarize the documents to provide a gists that retains the important parts of the whole corpus


## Techniques

- keyphrase extraction - extracting keywords or phrases from a text document of corpus that capture its main concepts or themes
- topic modelling - using statistical and mathematical modelling techniques to extract main topics, themes or concepts from a corpus of documents.
- automated document summarization - process of using a computer program or algorithm based on statistical and ML techniques to summarize a document or corpus of documents such that we obtain a short summary that captures all the essential concepts and themes of the original document or corpus.

In [25]:
from scipy.sparse.linalg import svds

def low_rank_svd(matrix, singular_count=2):
    u, s, vt = svds(matrix, k=singular_count)
    return u, s, vt

In [26]:
def parse_document(document):
    '''
    Remove newline from document, parse the text, convert it into ASCII format, 
    and break down into its sentence constituents.
    '''
    document = re.sub("\n", ' ', document)
    if isinstance(document, str):
        document = document
    elif isinstance(document, unicode):
        return unicodedata.normalize('NFKD', document).encode('ascii', 'ignore')
    else:
        raise ValueError('Document is not string or unicode!')
    document = document.strip()
    sentences = nltk.sent_tokenize(document)
    sentences = [sentence.strip() for sentence in sentences]
    return sentences

In [27]:
from html.parser import HTMLParser

html_parser = HTMLParser()
def unescape_html(parser, text):
    return parser.unescape(text)

In [28]:
from module.contractions import expand_contractions
from module.normalization import remove_special_characters, remove_stopwords

In [29]:
def normalize_corpus(corpus, lemmatize=True, tokenize=False):
    normalized_corpus = []
    for text in corpus:
        text = html_parser.unescape(text)
        text = expand_contractions(text, CONTRACTION_MAP)
        if lemmatize:
            text = lemmatize_text(text)
        else:
            text = text.lower()
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
        else:
            normalized_corpus.append(text)
    return normalized_corpus

## Feature extraction

- binary term occurence-based features
- frequency bag of words-based features
- tf-idf weighted features

In [30]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def build_feature_matrix(documents, feature_type='frequency'):
    feature_type = feature_type.lower().strip()
    
    if feature_type == 'binary':
        vectorizer = CountVectorizer(binary=True, min_df=1, ngram_range=(1, 1))
    elif feature_type == 'frequency':
        vectorizer = CountVectorizer(binary=False, min_df=1, ngram_range=(1, 1))
    elif feature_type == 'tfidf':
        vectorizer = TfidfVectorizer(min_df=1, ngram_range=(1, 1))
    else:
        raise Exception('Wrong feature type entered. Possible values: "binary", "frequency", "tfidf"')

    feature_matrix = vectorizer.fit_transform(documents).astype(float)
    return vectorizer, feature_matrix

## Keyphrase Extraction

A.k.a terminology extraction, is defined as the process or technique of extracting key important and relevant terms or phrases from a body of unstructured text such that the core topics or themes of the text document(s) are captured in these key phrases.

- semantic web
- query-based search engine and crawlers
- recommendation systems
- tagging systems
- document similarity
- translation

Techniques for keyphrase extraction:
- collocations
- weighted tag-based phrase extraction

## Collocation

A collocation is a sequence or group of words that tend to occur frequently such that this frequency tends to be more than what could be termed as a random chance occurence. 

Techniques to extract collocations:
- n-gram grouping or segmentation approach (construct ngrams out of a corpus, count the frequency of each ngram, and rank them based on their frequency of occurence to get the most frequent n-gram collocations)

In [32]:
from nltk.corpus import gutenberg
from module.normalization import normalize_corpus
import nltk
from operator import itemgetter

# Load corpus.
alice = gutenberg.sents(fileids='carroll-alice.txt')
alice = [' '.join(ts) for ts in alice]
norm_alice = list(filter(None, normalize_corpus(alice, False)))

# Print first line.
norm_alice[0]

'alice adventure wonderland lewis carroll 1865'

In [33]:
def flatten_corpus(corpus):
    return ' '.join([document.strip() 
                     for document in corpus])

In [34]:
def compute_ngrams(sequence, n):
    return zip(*[sequence[index:]
                 for index in range(n)])

In [35]:
list(compute_ngrams([1,2,3,4], 2))

[(1, 2), (2, 3), (3, 4)]

In [36]:
list(compute_ngrams([1,2,3,4], 3))

[(1, 2, 3), (2, 3, 4)]

In [37]:
def get_top_ngrams(corpus, ngram_val=1, limit=5):
    corpus = flatten_corpus(corpus)
    tokens = nltk.word_tokenize(corpus)
    ngrams = compute_ngrams(tokens, ngram_val)
    ngrams_freq_dist = nltk.FreqDist(ngrams)
    sorted_ngrams_fd = sorted(ngrams_freq_dist.items(),
                              key=itemgetter(1),
                              reverse=True)
    sorted_ngrams = sorted_ngrams_fd[0:limit]
    sorted_ngrams = [(' '.join(text), freq)
                     for text, freq in sorted_ngrams]
    return sorted_ngrams

In [38]:
# Get 10 bigrams.
get_top_ngrams(corpus=norm_alice, ngram_val=2, limit=10)

[('say alice', 123),
 ('mock turtle', 56),
 ('march hare', 31),
 ('say king', 29),
 ('thought alice', 26),
 ('white rabbit', 22),
 ('say hatter', 22),
 ('say mock', 21),
 ('alice say', 19),
 ('say gryphon', 19)]

In [39]:
# Get 10 trigrams.
get_top_ngrams(corpus=norm_alice, ngram_val=3, limit=10)

[('say mock turtle', 21),
 ('say march hare', 10),
 ('poor little thing', 6),
 ('say alice say', 6),
 ('little golden key', 5),
 ('certainly say alice', 5),
 ('white kid glove', 5),
 ('march hare say', 5),
 ('mock turtle say', 5),
 ('know say alice', 4)]

In [40]:
# Bigrams
from nltk.collocations import BigramCollocationFinder
from nltk.collocations import BigramAssocMeasures

finder = BigramCollocationFinder.from_documents([item.split()
                                                 for item
                                                 in norm_alice])

bigram_measures = BigramAssocMeasures()

In [41]:
# Raw frequencies.
finder.nbest(bigram_measures.raw_freq, 10)

[('say', 'alice'),
 ('mock', 'turtle'),
 ('march', 'hare'),
 ('say', 'king'),
 ('thought', 'alice'),
 ('say', 'hatter'),
 ('white', 'rabbit'),
 ('say', 'mock'),
 ('say', 'caterpillar'),
 ('say', 'gryphon')]

In [42]:
# Pointwise mutual information.
finder.nbest(bigram_measures.pmi, 10)

[('acceptance', 'elegant'),
 ('accustom', 'usurpation'),
 ('actually', 'took'),
 ('adjourn', 'immediate'),
 ('adoption', 'energetic'),
 ('affair', 'trust'),
 ('agony', 'terror'),
 ('ambition', 'distraction'),
 ('ancient', 'modern'),
 ('arithmetic', 'ambition')]

In [43]:
# Trigrams.
from nltk.collocations import TrigramCollocationFinder
from nltk.collocations import TrigramAssocMeasures

finder = TrigramCollocationFinder.from_documents([item.split()
                                                  for item
                                                  in norm_alice])

trigram_measures = TrigramAssocMeasures()

In [44]:
# Raw frequencies.
finder.nbest(trigram_measures.raw_freq, 10)

[('say', 'mock', 'turtle'),
 ('say', 'march', 'hare'),
 ('poor', 'little', 'thing'),
 ('little', 'golden', 'key'),
 ('march', 'hare', 'say'),
 ('mock', 'turtle', 'say'),
 ('white', 'kid', 'glove'),
 ('beau', 'ootiful', 'soo'),
 ('certainly', 'say', 'alice'),
 ('might', 'well', 'say')]

In [45]:
# Pointwise mutual information.
finder.nbest(trigram_measures.pmi, 10)

[('accustom', 'usurpation', 'conquest'),
 ('adjourn', 'immediate', 'adoption'),
 ('adoption', 'energetic', 'remedy'),
 ('ancient', 'modern', 'seaography'),
 ('arithmetic', 'ambition', 'distraction'),
 ('brother', 'latin', 'grammar'),
 ('crocodile', 'improve', 'shin'),
 ('crust', 'gravy', 'meat'),
 ('curve', 'graceful', 'zigzag'),
 ('elsie', 'lacie', 'tillie')]

# Weighted Tag-Based Phrase Extraction

1. extract all noun phrases chunks using shallow parsing
2. compute tf-idf weights for each chunk and return the top weighted phrases

In [46]:
toy_text = """Elephants are mammals of the family Elephantidae and the largest existing land animals. Three species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant. Elephantidae is the only surviving family of the order Proboscidea; extinct members include the mastodons. The family Elephantidae also contains several now-extinct groups, including the mammoths and straight-tusked elephants. African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs. Distinctive features of all elephants include a long trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin. The trunk, also called a proboscis, is used for breathing, bringing food and water to the mouth, and grasping objects. Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging. The large ear flaps assist in maintaining a constant body temperature as well as in communication. The pillar-like legs carry their great weight."""

In [50]:
from module.normalization import stopword_list
import itertools
import nltk
from gensim import corpora, models
import re

In [76]:
def get_chunks(sentences, grammar = r'NP: {<DT>? <JJ>* <NN.*>+}'):
    # Build chunker based on grammar pattern.
    all_chunks = []
    chunker = nltk.chunk.regexp.RegexpParser(grammar)
    
    for sentence in sentences:
        # POS tag sentences.
        tagged_sents = nltk.pos_tag_sents([nltk.word_tokenize(sentence)])
        
        # Extract chunks.
        chunks = [chunker.parse(tagged_sent)
                  for tagged_sent in tagged_sents]
        
        
        # Get word, pos_tag, chunk tag triples.
        wtc_sents = [nltk.chunk.tree2conlltags(chunk)
                     for chunk in chunks]
        
        flattened_chunks = list(itertools.chain.from_iterable(wtc_sent for wtc_sent in wtc_sents))
        
        # Get valid chunks based on tags.
        valid_chunks_tagged = [(status, [wtc for wtc in chunk])
                               for status, chunk
                               in itertools.groupby(flattened_chunks,
                                                    lambda word_pos_chunk: word_pos_chunk[2] != 'O')]
        
        # Append words in each chunk to make phrases.
        valid_chunks = [' '.join(word.lower() 
                                 for word, tag, chunk
                                 in wtc_group
                                     if word.lower() 
                                         not in stopword_list)
                                for status, wtc_group
                                in valid_chunks_tagged 
                                    if status]
        
        # Append all valid chunked phrases.
        all_chunks.append(valid_chunks)
    return all_chunks

In [77]:
sentences = parse_document(toy_text)
sentences

['Elephants are mammals of the family Elephantidae and the largest existing land animals.',
 'Three species are currently recognised: the African bush elephant, the African forest elephant, and the Asian elephant.',
 'Elephantidae is the only surviving family of the order Proboscidea; extinct members include the mastodons.',
 'The family Elephantidae also contains several now-extinct groups, including the mammoths and straight-tusked elephants.',
 'African elephants have larger ears and concave backs, whereas Asian elephants have smaller ears, and convex or level backs.',
 'Distinctive features of all elephants include a long trunk, tusks, large ear flaps, massive legs, and tough but sensitive skin.',
 'The trunk, also called a proboscis, is used for breathing, bringing food and water to the mouth, and grasping objects.',
 'Tusks, which are derived from the incisor teeth, serve both as weapons and as tools for moving objects and digging.',
 'The large ear flaps assist in maintaining a 

In [78]:
# Print all valid keyphrases per sentence of our document. Since we targetted nouns, all phrases talk about noun based entities.
valid_chunks = get_chunks(sentences)
valid_chunks

[['elephants', 'mammals', 'family elephantidae', 'land animals'],
 ['species',
  'african bush elephant',
  'african forest elephant',
  'asian elephant'],
 ['elephantidae',
  'family',
  'order proboscidea',
  'extinct members',
  'mastodons'],
 ['family elephantidae',
  'several now-extinct groups',
  'mammoths',
  'straight-tusked elephants'],
 ['african elephants',
  'ears',
  'backs',
  'whereas asian elephants',
  'ears',
  'convex',
  'level backs'],
 ['distinctive features',
  'elephants',
  'long trunk',
  'tusks',
  'large ear flaps',
  'massive legs',
  'sensitive skin'],
 ['trunk', 'proboscis', 'breathing', 'food', 'water', 'mouth', 'objects'],
 ['tusks', 'incisor teeth', 'weapons', 'tools', 'objects', 'digging'],
 ['large ear flaps', 'constant body temperature', 'communication'],
 ['pillar-like legs', 'great weight']]

In [80]:
def get_tfidf_weighted_keyphrases(sentences, 
                                  grammar=r'NP: {<DT>? <JJ>* <NN.*>+}',
                                  top_n=10):
    # Get valid chunks.
    valid_chunks = get_chunks(sentences, grammar)
    
    # Build tf-idf based model.
    dictionary = corpora.Dictionary(valid_chunks)
    
    corpus = [dictionary.doc2bow(chunk) for chunk in valid_chunks]
    
    tfidf = models.TfidfModel(corpus)
    
    corpus_tfidf = tfidf[corpus]
    
    # Get phrases and their tf-idf weights.
    weighted_phrases = {dictionary.get(id): round(value, 3)
                        for doc in corpus_tfidf
                        for id, value in doc}
    weighted_phrases = sorted(weighted_phrases.items(), 
                              key=itemgetter(1), 
                              reverse=True)

    # Return top weighted phrases.
    return weighted_phrases[:top_n]

In [81]:
get_tfidf_weighted_keyphrases(sentences, top_n=10)

[('great weight', 0.707),
 ('pillar-like legs', 0.707),
 ('ears', 0.667),
 ('communication', 0.634),
 ('constant body temperature', 0.634),
 ('land animals', 0.58),
 ('mammals', 0.58),
 ('mammoths', 0.535),
 ('several now-extinct groups', 0.535),
 ('straight-tusked elephants', 0.535)]