# Text Summarization

What can we do?
- extract key influential phrases from the documents
- extract various diverse concepts or topics present in the documents
- summarize the documents to provide a gists that retains the important parts of the whole corpus


## Techniques

- keyphrase extraction - extracting keywords or phrases from a text document of corpus that capture its main concepts or themes
- topic modelling - using statistical and mathematical modelling techniques to extract main topics, themes or concepts from a corpus of documents.
- automated document summarization - process of using a computer program or algorithm based on statistical and ML techniques to summarize a document or corpus of documents such that we obtain a short summary that captures all the essential concepts and themes of the original document or corpus.

In [1]:
from scipy.sparse.linalg import svds

def low_rank_svd(matrix, singular_count=2):
    u, s, vt = svds(matrix, k=singular_count)
    return u, s, vt

In [2]:
def parse_document(document):
    '''
    Remove newline from document, parse the text, convert it into ASCII format, 
    and break down into its sentence constituents.
    '''
    document = re.sub("\n", ' ', document)
    if isinstance(document, str):
        document = document
    elif isinstance(document, unicode):
        return unicodedata.normalize('NFKD', document).encode('ascii', 'ignore')
    else:
        raise ValueError('Document is not string or unicode!')
    document = document.strip()
    sentences = nltk.sent_tokenize(document)
    sentences = [sentence.strip() for sentence in sentences]
    return sentences

In [3]:
from HTMLParser import HTMLParser

html_parser = HTMLParser()
def unescape_html(parser, text):
    return parser.unescape(text)

ModuleNotFoundError: No module named 'HTMLParser'

In [8]:
from module.contractions import expand_contractions
from module.normalization import remove_special_characters, remove_stopwords

In [11]:
def normalize_corpus(corpus, lemmatize=True, tokenize=False):
    normalized_corpus = []
    for text in corpus:
        text = html_parser.unescape(text)
        text = expand_contractions(text, CONTRACTION_MAP)
        if lemmatize:
            text = lemmatize_text(text)
        else:
            text = text.lower()
        text = remove_special_characters(text)
        text = remove_stopwords(text)
        if tokenize:
            text = tokenize_text(text)
            normalized_corpus.append(text)
        else:
            normalized_corpus.append(text)
    return normalized_corpus

## Feature extraction

- binary term occurence-based features
- frequency bag of words-based features
- tf-idf weighted features

In [13]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

def build_feature_matrix(documents, feature_type='frequency'):
    feature_type = feature_type.lower().strip()
    
    if feature_type == 'binary':
        vectorizer = CountVectorizer(binary=True, min_df=1, ngram_range=(1, 1))
    elif feature_type == 'frequency':
        vectorizer = CountVectorizer(binary=False, min_df=1, ngram_range=(1, 1))
    elif feature_type == 'tfidf':
        vectorizer = TfidfVectorizer(min_df=1, ngram_range=(1, 1))
    else:
        raise Exception('Wrong feature type entered. Possible values: "binary", "frequency", "tfidf"')

    feature_matrix = vectorizer.fit_transform(documents).astype(float)
    return vectorizer, feature_matrix

## Keyphrase Extraction

A.k.a terminology extraction, is defined as the process or technique of extracting key important and relevant terms or phrases from a body of unstructured text such that the core topics or themes of the text document(s) are captured in these key phrases.

- semantic web
- query-based search engine and crawlers
- recommendation systems
- tagging systems
- document similarity
- translation

Techniques for keyphrase extraction:
- collocations
- weighted tag-based phrase extraction

## Collocation

A collocation is a sequence or group of words that tend to occur frequently such that this frequency tends to be more than what could be termed as a random chance occurence. 

Techniques to extract collocations:
- n-gram grouping or segmentation approach (construct ngrams out of a corpus, count the frequency of each ngram, and rank them based on their frequency of occurence to get the most frequent n-gram collocations)

In [42]:
from nltk.corpus import gutenberg
from module.normalization import normalize_corpus
import nltk
from operator import itemgetter

# Load corpus.
alice = gutenberg.sents(fileids='carroll-alice.txt')
alice = [' '.join(ts) for ts in alice]
norm_alice = list(filter(None, normalize_corpus(alice, False)))

# Print first line.
norm_alice[0]

'alice adventure wonderland lewis carroll 1865'

In [32]:
def flatten_corpus(corpus):
    return ' '.join([document.strip() 
                     for document in corpus])

In [33]:
def compute_ngrams(sequence, n):
    return zip(*[sequence[index:]
                 for index in range(n)])

In [34]:
list(compute_ngrams([1,2,3,4], 2))

[(1, 2), (2, 3), (3, 4)]

In [35]:
list(compute_ngrams([1,2,3,4], 3))

[(1, 2, 3), (2, 3, 4)]

In [43]:
def get_top_ngrams(corpus, ngram_val=1, limit=5):
    corpus = flatten_corpus(corpus)
    tokens = nltk.word_tokenize(corpus)
    ngrams = compute_ngrams(tokens, ngram_val)
    ngrams_freq_dist = nltk.FreqDist(ngrams)
    sorted_ngrams_fd = sorted(ngrams_freq_dist.items(),
                              key=itemgetter(1),
                              reverse=True)
    sorted_ngrams = sorted_ngrams_fd[0:limit]
    sorted_ngrams = [(' '.join(text), freq)
                     for text, freq in sorted_ngrams]
    return sorted_ngrams

In [44]:
# Get 10 bigrams.
get_top_ngrams(corpus=norm_alice, ngram_val=2, limit=10)

[('say alice', 123),
 ('mock turtle', 56),
 ('march hare', 31),
 ('say king', 29),
 ('thought alice', 26),
 ('white rabbit', 22),
 ('say hatter', 22),
 ('say mock', 21),
 ('alice say', 19),
 ('say gryphon', 19)]

In [45]:
# Get 10 trigrams.
get_top_ngrams(corpus=norm_alice, ngram_val=3, limit=10)

[('say mock turtle', 21),
 ('say march hare', 10),
 ('poor little thing', 6),
 ('say alice say', 6),
 ('little golden key', 5),
 ('certainly say alice', 5),
 ('white kid glove', 5),
 ('march hare say', 5),
 ('mock turtle say', 5),
 ('know say alice', 4)]

In [48]:
# Bigrams
from nltk.collocations import BigramCollocationFinder
from nltk.collocations import BigramAssocMeasures

finder = BigramCollocationFinder.from_documents([item.split()
                                                 for item
                                                 in norm_alice])

bigram_measures = BigramAssocMeasures()

In [49]:
# Raw frequencies.
finder.nbest(bigram_measures.raw_freq, 10)

[('say', 'alice'),
 ('mock', 'turtle'),
 ('march', 'hare'),
 ('say', 'king'),
 ('thought', 'alice'),
 ('say', 'hatter'),
 ('white', 'rabbit'),
 ('say', 'mock'),
 ('say', 'caterpillar'),
 ('say', 'gryphon')]

In [50]:
# Pointwise mutual information.
finder.nbest(bigram_measures.pmi, 10)

[('acceptance', 'elegant'),
 ('accustom', 'usurpation'),
 ('actually', 'took'),
 ('adjourn', 'immediate'),
 ('adoption', 'energetic'),
 ('affair', 'trust'),
 ('agony', 'terror'),
 ('ambition', 'distraction'),
 ('ancient', 'modern'),
 ('arithmetic', 'ambition')]

In [53]:
# Trigrams.
from nltk.collocations import TrigramCollocationFinder
from nltk.collocations import TrigramAssocMeasures

finder = TrigramCollocationFinder.from_documents([item.split()
                                                  for item
                                                  in norm_alice])

trigram_measures = TrigramAssocMeasures()

In [54]:
# Raw frequencies.
finder.nbest(trigram_measures.raw_freq, 10)

[('say', 'mock', 'turtle'),
 ('say', 'march', 'hare'),
 ('poor', 'little', 'thing'),
 ('little', 'golden', 'key'),
 ('march', 'hare', 'say'),
 ('mock', 'turtle', 'say'),
 ('white', 'kid', 'glove'),
 ('beau', 'ootiful', 'soo'),
 ('certainly', 'say', 'alice'),
 ('might', 'well', 'say')]

In [55]:
# Pointwise mutual information.
finder.nbest(trigram_measures.pmi, 10)

[('accustom', 'usurpation', 'conquest'),
 ('adjourn', 'immediate', 'adoption'),
 ('adoption', 'energetic', 'remedy'),
 ('ancient', 'modern', 'seaography'),
 ('arithmetic', 'ambition', 'distraction'),
 ('brother', 'latin', 'grammar'),
 ('crocodile', 'improve', 'shin'),
 ('crust', 'gravy', 'meat'),
 ('curve', 'graceful', 'zigzag'),
 ('elsie', 'lacie', 'tillie')]