# A Gentle Introduction to Text Summarization in Machine Learning

---

## PART 0: Imports and Initializations

In [39]:
# NLTK modules
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize, sent_tokenize

import nltk
nltk.download("punkt")
nltk.download("stopwords")

# Beautiful Soup and URL querying utilities
import bs4 as BeautifulSoup
from urllib import urlopen

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aakashsudhakar/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/aakashsudhakar/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Here, we initialize our data processing engine for miscellaneous text data online.

In [4]:
# TODO: Insert processor engine initialization here.

---

## PART 1: Overview of Concept

### Two Major Types of Text Summarization:
    - Extraction-based summarization
    - Abstraction-based summarization

### Steps to Perform Text Summarization:
    1. Convert the paragraph into sentences.
    2. Perform text processing.
    3. Perform tokenization.
    4. Evaluated the weighted occurrence frequency of the words. 
    5. Substitute words with their weighted frequencies.

<br>

![](https://paper-attachments.dropbox.com/s_5DD7360138DEDEB8828AD11E4B5921DC0A55833560A1BC79C451FADB6E7D209D_1554467410003_image.png)

<br>

---

## PART 2: Breakdown of Code Constructs

### Step 1: Prepare the data.

In [16]:
PATH_DATA = "https://en.wikipedia.org/wiki/20th_century"

data_read = urlopen(PATH_DATA).read()
data_parsed = BeautifulSoup.BeautifulSoup(data_read, "html.parser")

data_paragraphs = data_parsed.find_all("p")

data_content = str()
for paragraph in paragraphs:
    data_content += paragraph.text

### Step 2: Process the data.

In [41]:
def create_frequency_table(text):
    """ Function to create frequency histogram of word occurrences across input text. """
    stop_words = set(stopwords.words("english"))
    raw_words_from_data = word_tokenize(text)
    stem = PorterStemmer()
    # Create frequency table via dictionary operations
    frequency_table = dict()
    for word in raw_words_from_data:
        word_root = stem.stem(word)
        if word_root in stop_words:
            continue
        if word_root in frequency_table:
            frequency_table[word_root] += 1
        else:
            frequency_table[word_root] = 1
    return frequency_table

### Step 3: Tokenize the article into sentences.

In [24]:
sentences = sent_tokenize(data_content)

### Step 4: Find the weighted frequencies of the sentences.

In [33]:
def calculate_sentence_scores(sentences, frequency_table, num_chars=7):
    """ Function to create weighted frequency scores from parsed sentences using frequency table. """
    sentence_weight = dict()
    for sentence in sentences:
        sentence_wordcount_without_stop_words = 0
        sentence_wordcount = (len(word_tokenize(sentence)))
        for word_weight in frequency_table:
            if word_weight in sentence.lower():
                sentence_wordcount_without_stop_words += 1
                if sentence[:num_chars] in sentence_weight:
                    sentence_weight[sentence[:num_chars]] += frequency_table[word_weight]
                else:
                    sentence_weight[sentence[:num_chars]] = frequency_table[word_weight]
        sentence_weight[sentence[:num_chars]] /= sentence_wordcount_without_stop_words
    return sentence_weight

### Step 5: Calculate the threshold of the sentences.

In [34]:
def calculate_average_threshold(sentence_weight):
    """ Function to get the average weighted score of a sentence. """
    sum_values = 0
    for element in sentence_weight:
        sum_values += sentence_weight[element]
    return (sum_values / len(sentence_weight))

### Step 6: Obtain the summary.

In [35]:
def get_text_summary(sentences, sentence_weight, threshold, num_chars=7):
    """ Function to create summary statement of article using weighted sentence data and relative threshold. """
    sentence_counter, article_summary = 0, str()
    for sentence in sentences:
        if sentence[:num_chars] in sentence_weight and sentence_weight[sentence[:num_chars]] >= (threshold):
            article_summary += " {}".format(sentence)
            sentence_counter += 1
    return article_summary

---

## PART 3: Putting It All Together

We can wrap this all up into a nice outer function and run our summarization analysis on our sample Wikipedia and check our results!

Since this is extraction-based, it won't be nearly as nicely grammatical and well-structured as an abstraction-based (deep learning and advanced modeling) approach, but it should be sufficient to give us an adequate summary of the article's topic. 

In [36]:
def run_text_summary(text):
    frequency_table = create_frequency_table(text)
    sentences = sent_tokenize(text)
    sentence_scores = calculate_sentence_scores(sentences, frequency_table)
    threshold = calculate_average_threshold(sentence_scores)
    text_summary = get_text_summary(sentences, sentence_scores, 1.5 * threshold)
    return text_summary

In [42]:
run_text_summary(data_content)

" Terms like ideology, world war, genocide, and nuclear war entered common usage. Humans explored space for the first time, taking their first footsteps on the Moon. However, these same wars resulted in the destruction of the imperial system. The victorious Bolsheviks then established the Soviet Union, the world's first communist state. At the beginning of the period, the British Empire was the world's most powerful nation,[12] having acted as the world's policeman for the past century. In total, World War II left some 60 million people dead. With the Axis defeated and Britain and France rebuilding, the United States and the Soviet Union were left standing as the world's only superpowers. At the beginning of the century, strong discrimination based on race and sex was significant in general society. During the century, the social taboo of sexism fell. Communications and information technology, transportation technology, and medical advances had radically altered daily lives. With the e

---