In [9]:
import nltk
from bs4 import BeautifulSoup
import requests

readable_title=''
def download_document(url):
    """Downloads document using BeautifulSoup, extracts the subject and all
    text stored in paragraph tags
    """
    r = requests.get(url)
    soup = BeautifulSoup(r.text, 'html.parser')
    readable_title = soup.find('title').get_text()
    document = ' '.join([p.get_text() for p in soup.find_all('p')])
    return document

url = "http://venturebeat.com/2014/07/04/facebooks-little-social-experiment-got-you-bummed-out-get-over-it/"
document = download_document(url)
   
  

## Frequency Analysis
Here's a little secret: much of NLP (and data science, for that matter) boils down to counting things. If you've got a bunch of data that needs *analyzin'* but you don't know where to start, counting things is usually a good place to begin. Sure, you'll need to figure out exactly what you want to count, how to count it, and what to do with the counts, but if you're lost and don't know what to do, **just start counting**.

Perhaps we'd like to begin (as is often the case in NLP) by examining the words that appear in our document. To do that, we'll first need to tokenize the text string into discrete words. Since we're working with English, this isn't so bad, but if we were working with a non-whitespace-delimited language like Chinese, Japanese, or Korean, it would be much more difficult.

Notice that the output contains some punctuation & numbers, hasn't been loweredcased, and counts *BuzzFeed* and *BuzzFeed's* separately. We'll tackle some of those issues next.

In [11]:
tokens = [word for sent in nltk.sent_tokenize(document) for word in nltk.word_tokenize(sent)]

for token in sorted(set(tokens))[:30]:
    print (token + ' [' + str(tokens.count(token)) + ']')

( [1]
) [1]
, [28]
. [42]
2012 [1]
700,000 [3]
: [5]
; [1]
? [9]
A/B [2]
According [2]
Actually [1]
After [2]
All [1]
And [3]
Before [1]
Believe [1]
Blame [1]
But [2]
BuzzFeed’s [1]
Buzzfeed [1]
Could [1]
Count [1]
David [1]
Did [2]
Don’t [1]
Editorial [1]
Epic [1]
Facebook [13]
Facebook’s [4]


#### Word Stemming
[Stemming](http://en.wikipedia.org/wiki/Stemming) is the process of reducing a word to its base/stem/root form. Most stemmers are pretty basic and just chop off standard affixes indicating things like tense (e.g., "-ed") and possessive forms (e.g., "-'s"). Here, we'll use the Snowball stemmer for English, which comes with NLTK.

Once our tokens are stemmed, we can rest easy knowing that *BuzzFeed* and *BuzzFeed's* are now being counted together as... *buzzfe*? Don't worry: although this may look weird, it's pretty standard behavior for stemmers and won't affect our analysis (much). We also (probably) won't show the stemmed words to users -- we'll normally just use them for internal analysis or indexing purposes.

In [12]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer("english")
stemmed_tokens = [stemmer.stem(t) for t in tokens]

for token in sorted(set(stemmed_tokens))[50:75]:
    print (token + ' [' + str(stemmed_tokens.count(token)) + ']')

behavior [1]
believ [1]
besid [1]
better [1]
blame [2]
breach [1]
built [1]
bum [1]
but [4]
button [1]
buzzfe [2]
by [5]
came [2]
can [3]
cat [1]
charg [1]
children [1]
citi [1]
click [1]
come-on [1]
compani [2]
condit [1]
confidenti [1]
connect [1]
contagion.” [1]


#### Lemmatization

Although the stemmer very helpfully chopped off pesky affixes (and made everything lowercase to boot), there are some word forms that give stemmers indigestion, especially *irregular* words. While the process of stemming typically involves rule-based methods of stripping affixes (making them small & fast), **lemmatization** involves dictionary-based methods to derive the canonical forms (i.e., *lemmas*) of words. For example, *run*, *runs*, *ran*, and *running* all correspond to the lemma *run*. However, lemmatizers are generally big, slow, and brittle due to the nature of the dictionary-based methods, so you'll only want to use them when necessary.

The example below compares the output of the Snowball stemmer with the WordNet lemmatizer (also distributed with NLTK). Notice that the lemmatizer correctly converts *women* into *woman*, while the stemmer turns *lying* into *lie*. Additionally, both replace *eyes* with *eye*, but neither of them properly transforms *told* into *tell*.

In [13]:
lemmatizer = nltk.WordNetLemmatizer()
temp_sent = "Several women told me I have lying eyes."

print ([stemmer.stem(t) for t in nltk.word_tokenize(temp_sent)])
print ([lemmatizer.lemmatize(t) for t in nltk.word_tokenize(temp_sent)])

['sever', 'women', 'told', 'me', 'i', 'have', 'lie', 'eye', '.']
['Several', 'woman', 'told', 'me', 'I', 'have', 'lying', 'eye', '.']


#### NLTK Frequency Distributions
Thus far, we've been working with lists of tokens that we're manually sorting, uniquifying, and counting -- all of which can get to be a bit cumbersome. Fortunately, NLTK provides a data structure called <code>FreqDist</code> that makes it more convenient to work with these kinds of frequency distributions. The code snippet below builds a <code>FreqDist</code> from our list of stemmed tokens, and then displays the top 25 tokens appearing most frequently in the text of our article. Wasn't that easy?

In [24]:
fdist = nltk.FreqDist(stemmed_tokens)

#for item in list(fdist.items())[:25]:
for item in fdist.most_common(25):
     print (item)

('the', 44)
('.', 42)
('of', 29)
(',', 28)
('to', 22)
('it', 21)
('that', 20)
('facebook', 17)
('and', 16)
('you', 14)
('a', 13)
('is', 11)
('in', 11)
('be', 11)
('content', 10)
('they', 10)
('user', 9)
('?', 9)
('if', 9)
('their', 8)
('happen', 8)
('was', 7)
('would', 6)
('all', 6)
('this', 6)


#### Filtering out Stop Words
Notice in the output above that most of the top 25 tokens are worthless. With the exception of things like *facebook*, *content*, *user*, and perhaps *emot* (emotion?), the rest are basically devoid of meaningful information. They don't really tells us anything about the article since these tokens will appear is just about any English document. What we need to do is filter out these [*stop words*](http://en.wikipedia.org/wiki/Stop_words) in order to focus on just the important material.

While there is no single, definitive list of stop words, NLTK provides a decent start. Let's load it up and take a look at what we get:

In [25]:
sorted(nltk.corpus.stopwords.words('english'))[:25]

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both']

Now we can use this list to filter-out stop words from our list of stemmed tokens before we create the frequency distribution. You'll notice in the output below that we still have some things like punctuation that we'd probably like to remove, but we're much closer to having a list of the most "important" words in our article.

In [26]:
stemmed_tokens_no_stop = [stemmer.stem(t) for t in stemmed_tokens if t not in nltk.corpus.stopwords.words('english')]

fdist2 = nltk.FreqDist(stemmed_tokens_no_stop)

#for item in fdist2.items()[:25]:
for item in fdist2.most_common(25):
    print(item)

('.', 42)
(',', 28)
('facebook', 17)
('content', 10)
('?', 9)
('user', 9)
('happen', 8)
('would', 6)
('emot', 6)
('peopl', 6)
('feed', 5)
(':', 5)
('research', 5)
('privaci', 5)
('friend', 5)
('test', 5)
('negat', 5)
('posit', 5)
('one', 5)
('use', 5)
('could', 4)
('might', 4)
('everi', 4)
('time', 4)
('read', 4)


## Named Entity Recognition
Another task we might want to do to help identify what's "important" in a text document is [named entity recogniton (NER)](http://en.wikipedia.org/wiki/Named-entity_recognition). Also called *entity extraction*, this process involves automatically extracting the names of persons, places, organizations, and potentially other entity types out of unstructured text. Building an NER classifier requires *lots* of annotated training data and some [fancy machine learning algorithms](http://en.wikipedia.org/wiki/Conditional_random_field), but fortunately, NLTK comes with a pre-built/pre-trained NER classifier ready to extract entities right out of the box. This classifier has been trained to recognize PERSON, ORGANIZATION, and GPE (geo-political entity) entity types.

In [42]:
def extract_entities(text):
	entities = []
	for sentence in nltk.sent_tokenize(text):
	    chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)))
	    entities.extend([chunk for chunk in chunks if hasattr(chunk, 'label')])
	return entities

for entity in extract_entities('My name is Charlie and I work for Altamira in Tysons Corner.'):
    print ('[' + entity.label() + '] ' + ' '.join(c[0] for c in entity.leaves()))

[PERSON] Charlie
[ORGANIZATION] Altamira
[GPE] Tysons Corner


## Automatic Summarization

The Reuters Corpus contains nearly 11,000 news articles about a variety of topics and subjects. If you've run the <code>nltk.download()</code> command as previously recommended, you can then easily import and explore the Reuters Corpus like so:

In [37]:
from nltk.corpus import reuters

print ('** BEGIN ARTICLE: ** \"' + reuters.raw(reuters.fileids()[0])[:500] + ' [...]\"')

** BEGIN ARTICLE: ** "ASIAN EXPORTERS FEAR DAMAGE FROM U.S.-JAPAN RIFT
  Mounting trade friction between the
  U.S. And Japan has raised fears among many of Asia's exporting
  nations that the row could inflict far-reaching economic
  damage, businessmen and officials said.
      They told Reuter correspondents in Asian capitals a U.S.
  Move against Japan might boost protectionist sentiment in the
  U.S. And lead to curbs on American imports of their products.
      But some exporters said that while the conflict wo [...]"



- assign a score to each word in a document corresponding to its level of "importance"
- rank each sentence in the document by summing the individual word scores and dividing by the number of tokens in the sentence
- extract the top N highest scoring sentences and return them as our "summary"

#### Term Frequency - Inverse Document Frequency (TF-IDF)

Consider a document that contains the word *baseball* 8 times. You might think, "wow, *baseball* isn't a stop word, and it appeared rather frequently here, so it's probably important." And you might be right. But what if that document is actually an article posted on a baseball blog? Won't the word *baseball* appear frequently in nearly every post on that blog? In this particular case, if you were generating a summary of this document, would the word *baseball* be a good indicator of importance, or would you maybe look for other words that help distinguish or differentiate this blog post from the rest?

Context is essential. What really matters here isn't the raw frequency of the number of times each word appeared in a document, but rather the **relative frequency** comparing the number of times a word appeared in this document against the number of times it appeared across the rest of the collection of documents. "Important" words will be the ones that are generally rare across the collection, but which appear with an unusually high frequency in a given document.

We'll calculate this relative frequency using a statistical metric called [term frequency - inverse document frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf). We could use the TF-IDF implementation provided by the [scikit-learn](http://scikit-learn.org/) machine learning library for Python.

#### Building a Term-Document Matrix

We'll use scikit-learn's <code>TfidfVectorizer</code> class to construct a [term-document matrix](http://en.wikipedia.org/wiki/Document-term_matrix) containing the TF-IDF score for each word in each document in the Reuters Corpus. In essence, the rows of this sparse matrix correspond to documents in the corpus, the columns represent each word in the vocabulary of the corpus, and each cell contains the TF-IDF value for a given word in a given document.


In [39]:
import datetime, re, sys
from sklearn.feature_extraction.text import TfidfVectorizer

def tokenize_and_stem(text):
    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems

token_dict = {}
for article in reuters.fileids():
    token_dict[article] = reuters.raw(article)
        
tfidf = TfidfVectorizer(tokenizer=tokenize_and_stem, stop_words='english', decode_error='ignore')
print ('building term-document matrix... [process started: ' + str(datetime.datetime.now()) + ']')
sys.stdout.flush()

tdm = tfidf.fit_transform(token_dict.values()) # this can take some time (about 60 seconds on my machine)
print ('done! [process finished: ' + str(datetime.datetime.now()) + ']')

building term-document matrix... [process started: 2018-04-25 01:01:52.435330]
done! [process finished: 2018-04-25 01:03:01.124108]


#### TF-IDF Scores

Now that we've built the term-document matrix, we can explore its contents:

In [40]:
from random import randint

feature_names = tfidf.get_feature_names()
print ('TDM contains ' + str(len(feature_names)) + ' terms and ' + str(tdm.shape[0]) + ' documents')

print ('first term: ' + feature_names[0])
print ('last term: ' + feature_names[len(feature_names) - 1])

for i in range(0, 4):
    print ('random term: ' + feature_names[randint(1,len(feature_names) - 2)])

TDM contains 25833 terms and 10788 documents
first term: 'd
last term: zzzz
random term: mcfadden
random term: belcher
random term: baord
random term: sarji


#### Generating the Summary

That's all we'll need to produce a summary for any document in the corpus. In the example code below, we start by randomly selecting an article from the Reuters Corpus. We iterate through the article, calculating a score for each sentence by summing the TF-IDF values for each word appearing in the sentence. We normalize the sentence scores by dividing by the number of tokens in the sentence (to avoid bias in favor of longer sentences). Then we sort the sentences by their scores, and return the highest-scoring sentences as our summary. The number of sentences returned corresponds to roughly 20% of the overall length of the article.

Since some of the articles in the Reuters Corpus are rather small (i.e., a single sentence in length) or contain just raw financial data, some of the summaries won't make sense. If you run this code a few times, however, you'll eventually see a randomly-selected article that provides a decent demonstration of this simplistic method of identifying the "most important" sentence from a document.

In [45]:
import math
from __future__ import division

article_id = randint(0, tdm.shape[0] - 1)
article_text = reuters.raw(reuters.fileids()[article_id])

sent_scores = []
for sentence in nltk.sent_tokenize(article_text):
    score = 0
    sent_tokens = tokenize_and_stem(sentence)
    for token in (t for t in sent_tokens if t in feature_names):
        score += tdm[article_id, feature_names.index(token)]
    sent_scores.append((score / len(sent_tokens), sentence))

summary_length = int(math.ceil(len(sent_scores) / 5))
sent_scores.sort(key=lambda sent: sent[0], reverse=True)

print ('*** SUMMARY ***')
for summary_sentence in sent_scores[:summary_length]:
    print (summary_sentence[1])

print ('\n*** ORIGINAL ***')
print (article_text)

*** SUMMARY ***
Hard-hit by the collapse in oil and Texas real estate
  prices, First City's net loan chargeoffs totaled 366 mln dlrs
  last year, up from 261 mln dlrs in 1985.
The banks agreed to similar amendments to the covenants
  last year and First City has reduced its borrowings from 120
  mln dlrs at 1986 yearend to 68.5 mln dlrs in recent weeks.
The bank more than
  doubled its loan loss provision to 497 mln dlrs at the end of
  1986.
In real estate, First City said its nonperforming assets
  nearly doubled last year to 347 mln dlrs at year-end.
AUDITORS GIVE FIRST CITY &lt;FBT> QUALIFIED OPINION
  First City Bancorp of Texas, which lost
  a record 402 mln dlrs in 1986, said in its annual report it
  expected operating losses to continue "for the foreseeable
  future" as it continues to search for additional capital or a
  merger partner.

*** ORIGINAL ***
AUDITORS GIVE FIRST CITY &lt;FBT> QUALIFIED OPINION
  First City Bancorp of Texas, which lost
  a record 402 mln dlrs in 1