# Identify a text's bullet point sentences.

Let's say we only want to share the important parts of an article we've read, or that we have lots of journal articles to read but not enough time, so we only want to read the highlights.

* Auto-generate the bullet points
  * Find most important words
  * Assign score to sentences based on their words
  * Output the top-scoring sentences

To do this we need to know how to:
* identify word importance
  * authors tend to repeat important words -> use word frequency
* assign score to sentences
  * take the words it contains and sum their "importances"
* output top scorers
  * rank the sentences

## First Steps:
* get text (scrape data)
* munge text

What does "munge" mean?  Let's make our own dictionary.

In [None]:
from nltk.corpus import wordnet

In [None]:
def bendict(word):
    for ss in wordnet.synsets(word):
        print(ss, ss.definition())

In [None]:
bendict('hello')

In [None]:
bendict('coding')

In [None]:
bendict('munge')

Ah well, there are shortcomings to the dictionary.

## Retrieve an interesting article

In [None]:
import requests
from bs4 import BeautifulSoup

In [None]:
url = 'https://tedunderwood.com/2015/06/04/seven-ways-humanists-are-using-computers-to-understand-text/'
response = requests.get(url)
document = response.text

In [None]:
document = BeautifulSoup(response.text, "html.parser")

In [None]:
humtext = document.find('div', attrs={'class':'entry-content'}).text

In [None]:
for i in ['\n','[',']','’','”','“']:
    humtext = humtext.replace(i,' ')

In [None]:
print(humtext)

## Process text

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from string import punctuation

In [None]:
sentences = sent_tokenize(humtext)

In [None]:
sentences

In [None]:
words = word_tokenize(humtext.lower())

In [None]:
words

In [None]:
myStopWords = list(punctuation) + stopwords.words('english')

In [None]:
wordsNoStopWords = [w for w in words if w not in myStopWords]

In [None]:
wordsNoStopWords

In [None]:
from nltk.probability import FreqDist

In [None]:
freq = FreqDist(wordsNoStopWords)

In [None]:
freq

In [None]:
for i in sorted(freq, key=freq.get, reverse=True)[:10]:
    print(i,freq[i])

In [None]:
ranking = {}

for sentence in sentences:
    ranking[sentence] = 0
    for word in word_tokenize(sentence.lower()):
        if word in freq:
            ranking[sentence] += freq[word]
            
ranking

In [None]:
sorted(ranking, key=ranking.get, reverse=True)[:5]

In [None]:
for sentence in sentences:
    if sentence in sorted(ranking, key=ranking.get, reverse=True)[:5]:
        print(sentence)