# Identify a text's bullet point sentences.

Let's say we only want to share the important parts of an article we've read, or that we have lots of journal articles to read but not enough time, so we only want to read the highlights.

* Auto-generate the bullet points
  * Find most important words
  * Assign score to sentences based on their words
  * Output the top-scoring sentences

To do this we need to know how to:
* identify word importance
  * authors tend to repeat important words -> use word frequency
* assign score to sentences
  * take the words it contains and sum their "importances"
* output top scorers
  * rank the sentences

## Retrieve an interesting article

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
url = 'https://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/'
response = requests.get(url)
document = response.text

In [None]:
document = BeautifulSoup(response.text, "html.parser")

In [None]:
humtext = document.find('div', attrs={'class':'entry-content'}).text

In [None]:
for i in ['\n','[',']','’','”','“']:
    humtext = humtext.replace(i,' ')

In [None]:
print(humtext)

## Process text

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation

In [None]:
sentences = sent_tokenize(humtext)

In [None]:
sentences

In [None]:
words = word_tokenize(humtext.lower())

In [None]:
words

In [None]:
myStopWords = list(punctuation) + stopwords.words('english')

In [None]:
wordsNoStopWords = [w for w in words if w not in myStopWords]

In [None]:
wordsNoStopWords

In [None]:
from nltk.probability import FreqDist

In [None]:
freq = FreqDist(wordsNoStopWords)

In [None]:
freq

In [None]:
freq.most_common(10)

## Slight detour:  visualization

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from wordcloud import WordCloud

In [None]:
# make two lists,
# one of the words and one of the wordcounts

commonwords = []
commonwords_freq = []

for i in freq.most_common(10):
  commonwords.append(i[0])
  commonwords_freq.append(i[1])

# make a horizontal bar plot
plt.barh(commonwords, commonwords_freq)

In [None]:
textNoStopWords = ' '.join(wordsNoStopWords)

In [None]:
wordcloud = WordCloud().generate(textNoStopWords)
plt.imshow(wordcloud)
plt.show()

In [None]:
wordcloud = WordCloud(width=800, 
                      height=400, 
                      background_color='white').generate(textNoStopWords)
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

The Pandas way:

In [None]:
freq

In [None]:
freq.keys()

In [None]:
df_wordrank = pd.DataFrame({'words':freq.keys(),'counts':freq.values()})

In [None]:
df_wordrank.head()

In [None]:
# This is NOT the best plot to make.
# Why?

df_wordrank.sort_values(by='counts', ascending=False).plot(kind='barh')

In [None]:
# two options for better viewing

df_wordrank.sort_values(by='counts', ascending=False)[:10].plot(x='words',y='counts',kind='barh')
# df_wordrank.sort_values(by='counts')[-10:].plot(x='words',y='counts',kind='barh')

## Ranking sentence "importance"

In [None]:
for i in sorted(freq, key=freq.get, reverse=True)[:10]:
    print(i,freq[i])

In [None]:
ranking = {}

for sentence in sentences:
    ranking[sentence] = 0
    for word in word_tokenize(sentence.lower()):
        if word in freq:
            ranking[sentence] += freq[word]
            
ranking

We can do this in many ways, but let's go the Pandas way.

In [None]:
sentrank = pd.DataFrame({'sentence':ranking.keys(),'rank':ranking.values()})

In [None]:
sentrank.sort_values(by='rank',ascending=False)

In [None]:
sentrank = sentrank.sort_values(by='rank',ascending=False).reset_index()

In [None]:
sentrank.iloc[:5]

If we want a meaningful summary, we probably want to print them in the same order as they occur in the text.

The above gives us the top 5 sentences, and we can sort them by the "index" column to get them back into the order they occurred in the article:

In [None]:
sentrank.iloc[:5].sort_values(by='index')

In [None]:
df_top5 = sentrank.iloc[:5].sort_values(by='index')

for i,row in df_top5.iterrows():
  print(row['sentence'])

In [None]:
# when we iterate over a dataframe like this
# the loop variables are the index and the row
# (and here the index is "out-of-order" numerically speaking)

df_top5 = sentrank.iloc[:5].sort_values(by='index')

for i,row in df_top5.iterrows():
  print(i)
  print(row['sentence'])