In [1]:
%matplotlib inline

Text Summarization
==================
Demonstrates summarizing text by extracting the most important sentences from it.


In [2]:
#import logging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

* This module summarizes text by extracting one or more sentences. (it also extracts keywords.) basis: [TextRank by Mihalcea et al](http://web.eecs.umich.edu/%7Emihalcea/papers/mihalcea.emnlp04.pdf).
* improvements: [Barrios et al](https://raw.githubusercontent.com/summanlp/docs/master/articulo/articulo-en.pdf) - introduces a "BM25 ranking function". 
* Notes Gensim summarization only works for English for now, because the text
    is pre-processed so that stopwords are removed and the words are stemmed,
    and these processes are language-dependent.

In [3]:
from pprint import pprint as print
from gensim.summarization import summarize

unable to import 'smart_open.gcs', disabling that module


In [4]:
text = (
    "Thomas A. Anderson is a man living two lives. By day he is an "
    "average computer programmer and by night a hacker known as "
    "Neo. Neo has always questioned his reality, but the truth is "
    "far beyond his imagination. Neo finds himself targeted by the "
    "police when he is contacted by Morpheus, a legendary computer "
    "hacker branded a terrorist by the government. Morpheus awakens "
    "Neo to the real world, a ravaged wasteland where most of "
    "humanity have been captured by a race of machines that live "
    "off of the humans' body heat and electrochemical energy and "
    "who imprison their minds within an artificial reality known as "
    "the Matrix. As a rebel against the machines, Neo must return to "
    "the Matrix and confront the agents: super-powerful computer "
    "programs devoted to snuffing out Neo and the entire human "
    "rebellion. "
)
print(text)

('Thomas A. Anderson is a man living two lives. By day he is an average '
 'computer programmer and by night a hacker known as Neo. Neo has always '
 'questioned his reality, but the truth is far beyond his imagination. Neo '
 'finds himself targeted by the police when he is contacted by Morpheus, a '
 'legendary computer hacker branded a terrorist by the government. Morpheus '
 'awakens Neo to the real world, a ravaged wasteland where most of humanity '
 "have been captured by a race of machines that live off of the humans' body "
 'heat and electrochemical energy and who imprison their minds within an '
 'artificial reality known as the Matrix. As a rebel against the machines, Neo '
 'must return to the Matrix and confront the agents: super-powerful computer '
 'programs devoted to snuffing out Neo and the entire human rebellion. ')


* Pass the **raw string data** to "summarize".
* Note: ensure the string does not contain any newlines where the line breaks in a sentence. A sentence with a newline in it (i.e. a carriage return, "\n") will be treated as two sentences.

In [5]:
print(summarize(text))

('Morpheus awakens Neo to the real world, a ravaged wasteland where most of '
 'humanity have been captured by a race of machines that live off of the '
 "humans' body heat and electrochemical energy and who imprison their minds "
 'within an artificial reality known as the Matrix.')


* Use the "split" option if you want a list of strings.




In [6]:
print(summarize(text, split=True))

['Morpheus awakens Neo to the real world, a ravaged wasteland where most of '
 'humanity have been captured by a race of machines that live off of the '
 "humans' body heat and electrochemical energy and who imprison their minds "
 'within an artificial reality known as the Matrix.']


* Adjust how much text the summarizer outputs with "ratio" or "word_count". "ratio" specifies the %sentences in the original text should be returned as output. the default is 20%.

In [7]:
print(summarize(text, ratio=0.5))

('By day he is an average computer programmer and by night a hacker known as '
 'Neo. Neo has always questioned his reality, but the truth is far beyond his '
 'imagination.\n'
 'Morpheus awakens Neo to the real world, a ravaged wasteland where most of '
 'humanity have been captured by a race of machines that live off of the '
 "humans' body heat and electrochemical energy and who imprison their minds "
 'within an artificial reality known as the Matrix.\n'
 'As a rebel against the machines, Neo must return to the Matrix and confront '
 'the agents: super-powerful computer programs devoted to snuffing out Neo and '
 'the entire human rebellion.')


* Using "word_count" to specify the maximum amount of words in the summary.

In [9]:
print(summarize(text, word_count=50))

('Morpheus awakens Neo to the real world, a ravaged wasteland where most of '
 'humanity have been captured by a race of machines that live off of the '
 "humans' body heat and electrochemical energy and who imprison their minds "
 'within an artificial reality known as the Matrix.')


* This module also supports **keyword** extraction. It works in the same way as summary generation (i.e. sentence extraction), in that the algorithm tries to find words that are important or seem representative of the entire text. 
* The keywords are not always single words; in the case of multi-word keywords, they are typically all nouns.




In [10]:
from gensim.summarization import keywords
print(keywords(text))

'neo\nhumanity\nhuman\nhumans body\nsuper\nhacker\nreality'


Larger example
--------------
* Use the [synopsis of The Matrix](http://www.imdb.com/title/tt0133093/synopsis?ref_=ttpl_pl_syn) IMDb page.
* Read the text file directly from a web-page using "requests". Then we produce a summary and some keywords.


In [12]:
import requests

text = requests.get('http://rare-technologies.com/the_matrix_synopsis.txt').text
print(text[0:250])

('The screen is filled with green, cascading code which gives way to the '
 'title, The Matrix.\r\n'
 '\r\n'
 'A phone rings and text appears on the screen: "Call trans opt: received. '
 '2-19-98 13:24:18 REC: Log>" As a conversation takes place between Trinity '
 '(Carrie-An')


* First, the summary:




In [13]:
print(summarize(text, ratio=0.01))

('Anderson, a software engineer for a Metacortex, the other life as Neo, a '
 'computer hacker "guilty of virtually every computer crime we have a law '
 'for." Agent Smith asks him to help them capture Morpheus, a dangerous '
 'terrorist, in exchange for amnesty.\n'
 "Morpheus explains that he's been searching for Neo his entire life and asks "
 'if Neo feels like "Alice in Wonderland, falling down the rabbit hole." He '
 'explains to Neo that they exist in the Matrix, a false reality that has been '
 'constructed for humans to hide the truth.\n'
 "Neo is introduced to Morpheus's crew including Trinity; Apoc (Julian "
 'Arahanga), a man with long, flowing black hair; Switch; Cypher (bald with a '
 'goatee); two brawny brothers, Tank (Marcus Chong) and Dozer (Anthony Ray '
 'Parker); and a young, thin man named Mouse (Matt Doran).\n'
 'Trinity brings the helicopter down to the floor that Morpheus is on and Neo '
 'opens fire on the three Agents.')


* Now the keywords:




In [14]:
print(keywords(text, ratio=0.01))

'neo\nmorpheus\ntrinity\ncypher\nsmith\nagents\nagent\ntank\nsays\nsaying'


* Notice that some of the most important characters (Neo, Morpheus, Trinity) were extracted as keywords.

* Another example, using the [IMDb synopsis of the The Big Lebowski](http://www.imdb.com/title/tt0118715/synopsis?ref_=tt_stry_pl).




In [16]:
text = requests.get('http://rare-technologies.com/the_big_lebowski_synopsis.txt').text
print(text[0:250])
print(summarize(text, ratio=0.01))
print(keywords(text, ratio=0.01))

('A tumbleweed rolls up a hillside just outside of Los Angeles as a mysterious '
 'man known as The Stranger (Sam Elliott) narrates about a fella he wants to '
 'tell us about named Jeffrey Lebowski. With not much use for his given name, '
 'however, Jeffrey goes ')
('Dude agrees to meet with the Big Lebowski, hoping to get compensation for '
 'his rug since it "really tied the room together" and figures that his wife, '
 "Bunny, shouldn't be owing money around town.\n"
 'Walter resolves to go to Plan B; he tells Larry to watch out the window as '
 'he and Dude go back out to the car where Donny is waiting.')
'dude\ndudes\nwalter\nlebowski\nbrandt\nmaude\ndonny\nbunny'


* This time, the summary is not high quality. (This might not be the algorithms fault.) The keywords, however, managed to find some of the main characters.

Performance
-----------

* Test how the summarizer speed scales with dataset size. Note: the summarizer does **not** support multithreading at this time.
* test dataset: "Honest Abe" by Alonzo Rothschild. Download plain-text at http://www.gutenberg.org/ebooks/49679).
* below: running times by dataset size. We use the first **n** characters of the book to create different sizes. 
* The algorithm seems to be **quadratic in time** , so one needs to be careful before plugging a large dataset into the summarizer.

Text-content dependent running times
------------------------------------

* Running time is not only dependent on the size of the dataset.
* In original examples: "The Matrix" summary is ~36K characters --> 3.1 seconds, while summarizing 35K characters of this book --> ~8.5 seconds. So the former is **more than twice as fast**.
* One reason is the data structure.
    + The algorithm represents the data using a graph, where vertices (nodes) are sentences, and then constructs weighted edges between the vertices that represent how the sentences relate to each other. 
    + This means that every piece of text will have a different graph, thus making the running times different. The size of this data structure is **quadratic in the worst case** (the worst case is when each vertex has an edge to every other vertex).
* Another possible reason for the difference in running times is that the problems converge at different rates, meaning that the error drops slower for some datasets than for others.

[Entropy-based keyword extraction (Montemurro and Zanette)](https://arxiv.org/abs/0907.1558)
-------------------------------------------------------------------

* Describes a technique to identify words that play a significant role in the large-scale structure of a text. These typically correspond to the major themes of the text. 
* The text is divided into blocks of ~1000 words, and the entropy of each word's
distribution among the blocks is compared with to expected entropy if the word were distributed randomly.




In [21]:
import requests
from gensim.summarization import mz_keywords

text=requests.get("http://www.gutenberg.org/files/49679/49679-0.txt").text
print(mz_keywords(text,scores=True,threshold=0.001)[0:30])

[('i', 0.005071990145676084),
 ('the', 0.004078714811925573),
 ('lincoln', 0.003834207719481631),
 ('you', 0.00333099434510635),
 ('gutenberg', 0.0032861719465446127),
 ('v', 0.0031486824001772298),
 ('a', 0.0030225302081737385),
 ('project', 0.003013787365092158),
 ('s', 0.002804807648086567),
 ('iv', 0.0027211423370182043),
 ('he', 0.0026652557966447303),
 ('ii', 0.002522584294510855),
 ('his', 0.0021025932276434807),
 ('by', 0.002092414407555808),
 ('abraham', 0.0019871796860869762),
 ('or', 0.0019180648459331258),
 ('lincolna', 0.0019090487448340699),
 ('tm', 0.001887549850538215),
 ('iii', 0.001883132631521375),
 ('was', 0.0018691721439371533),
 ('work', 0.0017383218152950376),
 ('new', 0.0016870325205805429),
 ('co', 0.001654497521737427),
 ('case', 0.0015991334540419223),
 ('court', 0.0014413967155396973),
 ('york', 0.001429133695025362),
 ('on', 0.0013292841806795005),
 ('it', 0.001308454011675044),
 ('had', 0.001298103630126742),
 ('to', 0.0012629182579600709)]


* The algorithm weights the entropy by the overall frequency of the word in the document. We can remove this weighting by setting weighted=False.




In [20]:
print(mz_keywords(text,scores=True,weighted=False,threshold=1.0)[0:30])

[('gutenberg', 3.813054848640599),
 ('project', 3.573855036862196),
 ('tm', 3.5734630161654266),
 ('co', 3.188187179789419),
 ('foundation', 2.9349504275296248),
 ('dogskin', 2.767166394411781),
 ('electronic', 2.712759445340285),
 ('donations', 2.5598097474452906),
 ('foxboro', 2.552819829558231),
 ('access', 2.534996621584064),
 ('gloves', 2.534996621584064),
 ('_works_', 2.519083905903437),
 ('iv', 2.4068950059833725),
 ('v', 2.376066199199476),
 ('license', 2.32674033665853),
 ('works', 2.320294093790008),
 ('replacement', 2.297629530050557),
 ('e', 2.1840002559354215),
 ('coon', 2.1754936158294536),
 ('volunteers', 2.1754936158294536),
 ('york', 2.172102058646223),
 ('ii', 2.143421998464259),
 ('edited', 2.110161739139703),
 ('refund', 2.100145067024387),
 ('iii', 2.052633589900031),
 ('bounded', 1.9832369322912882),
 ('format', 1.9832369322912882),
 ('jewelry', 1.9832369322912882),
 ('metzker', 1.9832369322912882),
 ('millions', 1.9832369322912882)]


* This enables calculating a threshold from the number of blocks.




In [19]:
print(mz_keywords(text,scores=True,weighted=False,threshold="auto")[0:30])

[('gutenberg', 3.813054848640599),
 ('project', 3.573855036862196),
 ('tm', 3.5734630161654266),
 ('co', 3.188187179789419),
 ('foundation', 2.9349504275296248),
 ('dogskin', 2.767166394411781),
 ('electronic', 2.712759445340285),
 ('donations', 2.5598097474452906),
 ('foxboro', 2.552819829558231),
 ('access', 2.534996621584064),
 ('gloves', 2.534996621584064),
 ('_works_', 2.519083905903437),
 ('iv', 2.4068950059833725),
 ('v', 2.376066199199476),
 ('license', 2.32674033665853),
 ('works', 2.320294093790008),
 ('replacement', 2.297629530050557),
 ('e', 2.1840002559354215),
 ('coon', 2.1754936158294536),
 ('volunteers', 2.1754936158294536),
 ('york', 2.172102058646223),
 ('ii', 2.143421998464259),
 ('edited', 2.110161739139703),
 ('refund', 2.100145067024387),
 ('iii', 2.052633589900031),
 ('bounded', 1.9832369322912882),
 ('format', 1.9832369322912882),
 ('jewelry', 1.9832369322912882),
 ('metzker', 1.9832369322912882),
 ('millions', 1.9832369322912882)]
