# NLP Article summarization
In this notebook, I used the Beautiful Soup package to extract the text from a web article. Then I used the NLTK and textblob libraries to preprocess the text by removing words that don't contribute much to the overall interpretation of the text, as well as applying a lemma-ization process to simplify words to their rawest meaning. Then, I chunked up the text into batch sizes that were then fed through the Hugging Face Text-Summarization library. Lastly, I re-ran the initial summary through the pipeline a second time which allowed for an even more condensed summary.

This article was not chosen for any special content, but for the simplicity of the html formatting, which allowed for  the easy text extraction. This process could be automated and scheduled to summarize articles and store them for later reading.

One thing I am considering for future work on this is to add Text to Speech pipeline to read the summaries out loud. Another future project will be scheduling a scraping event (something like headlines only), and running sentiment analysis on those.

## 1. Importing Packages
- We use the requests module to get the source html data of the web article
- We will import the Beautiful Soup library to parse the scraped web data
- We will import the HuggingFace Transformers package and use the summarization pipeline

In [1]:
import requests
from bs4 import BeautifulSoup
from transformers import pipeline

summarizer = pipeline("summarization")

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 (https://huggingface.co/sshleifer/distilbart-cnn-12-6)


## 2. Choose the html for the article
We are choosing an arbitrary article on www.hackernoon.com because it's webpages are simple compared to others

In [2]:
# URL = "https://hackernoon.com/will-layer-1-public-blockchains-rise-or-fall-in-the-next-bull-market"
URL = "https://hackernoon.com/how-to-manage-your-technology-and-reduce-your-digital-distractions"

## 3. Set up useful functions for task
Using Beautiful Soup, parse the html data and break text into sentence blocks. then run the sentence blocks through the summarization pipeline.

In [22]:
def bsArticle(url):
    # This function takes url's for articles from webpages with simple structure and outputs the sentences tagged
    # with <eos> tags
    a = requests.get(URL)
    soup = BeautifulSoup(a.text, 'html.parser')
    results = soup.find_all(['h1', 'h2', 'h3', 'p'])
    text = [result.text for result in results]
    article = ''.join(text)
    article = preprocess_article(article)
    sentences = add_eos(article)
    return block_sentences(sentences)


def add_eos(raw_article):
    # add end-of-sentence markers
    raw_article = raw_article.replace('.', '.<eos>')
    raw_article = raw_article.replace('?', '?<eos>')
    raw_article = raw_article.replace('!', '!<eos>')
    return split_sentences(raw_article)


def split_sentences(raw_article):
    # function for splitting a string element comprised of multiple sentences into individual sentences
    return raw_article.split('<eos>')


def block_sentences(sentences, max_block_size=2 ** 8):
    # this function puts the sentences into blocks up to a max_block_size
    blocks = []
    i = 0

    for sentence in sentences:
        # check if there is an existing block: Y, append; N, start new block
        if len(blocks) == i + 1:
            if len(blocks[i]) + len(sentence.split(' ')) <= max_block_size:
                blocks[i].extend(sentence.split(' '))
            else:
                i += 1
                blocks.append(sentence.split(' '))
        else:
            blocks.append(sentence.split(' '))

    for i in range(len(blocks)):
        blocks[i] = ' '.join(blocks[i])

    return blocks


def summarize(URL):
    #
    blocks = bsArticle(URL)
    print(f'Number of blocks of text: {len(blocks)}')
    return summarizer(blocks, min_length=16, max_length=128, do_sample=False)

def reSummarize(summary):
    # This function takes as input, a summary that was already generated and re-runs it through the pipeline for a
    # more condensed summary
    text = [sentence['summary_text'] for sentence in summary]
    article = ''.join(text)
    article = preprocess_article(article)
    sentences = add_eos(article)
    blocks = block_sentences(sentences, max_block_size=256)
    print(f'Number of blocks of text: {len(blocks)}')
    summary = summarizer(blocks, min_length=16, max_length=128, do_sample=False)
    return summary


def readSummary(summary):
    print('Primary Summary :', ''.join([sm['summary_text'] for sm in summary]))


def writeSummary(text, title='articleSummary'):
    with open(title + '.txt', 'w', encoding='utf-32') as f:
        f.write(''.join([sm['summary_text'] for sm in text]))

### Import nltk and textblob libraries for cleaning data

In [26]:
import nltk
from nltk.corpus import stopwords
# !pip install textblob
from textblob import Word

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
stop_words = stopwords.words('english')
# custom_stopwords = []

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\danny\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\danny\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\danny\AppData\Roaming\nltk_data...


In [24]:
def preprocess_article(article):  # , custom_stopwords):
    processed_article = article
    processed_article.replace('[^\w\s]', '')
    processed_article = " ".join(word for word in processed_article.split() if word not in stop_words)
    # processed_article = " ".join(word for word in processed_article.split() if word not in custom_stopwords)
    processed_article = " ".join(Word(word).lemmatize() for word in processed_article.split())
    return(processed_article)


## 4. Summarize the article

In [27]:
summary = summarize(URL)
readSummary(summary)
writeSummary(summary, 'articlePrimarySummary_wPreprocessing')

Number of blocks of text: 5
Primary Summary :  As world becomes connected people, businesses, thing connected time, it’s become increasingly important find way manage technology order reduce digital distractions . A digital distraction non-essential technology use take attention away need given moment time . Research found average person spends nearly half daily time awake either looking phone thinking it. Diversion Is Important So You Don’t Feel Overwhelmed . Turn off Your Automatic Response Feature Your Email . Take regular tech breaks and set timer remind need take break . Social Media Anxiety Disorder may indicate real condition called Social Media anxiety Disorder . Social medium usage causes psychological problems, including anxiety, depression, loneliness, loneliness . Technology negative impact physical health addition damaging real-world relationships . Technology fantastic form distraction provide comfortable distraction mental health issues . If you’re online time, it’s like

## 5. Resummarize the summary
Try to summarize the summary for a more compressed total summary

In [28]:
summary2 = reSummarize(summary)
readSummary(summary2)
writeSummary(summary2, 'articleSecondarySummary_wPreprocessing')

Number of blocks of text: 1
Primary Summary :  As world becomes connected people, businesses, thing connected time, it’s become increasingly important find way manage technology order reduce digital distraction . Social medium usage cause psychological problems, including anxiety, depression, loneliness, loneliness . Technology negative impact physical health addition damaging real-world relationship . Diversion Is Important So You Don’t Feel Overwhelmed .
