## 1. Importing Packages
- We use the requests module to get the source html data of the web article
- We will import the Beautiful Soup library to parse the scraped web data
- We will import the HuggingFace Transformers package and use the summarization pipeline

In [1]:
import requests
from bs4 import BeautifulSoup
from transformers import pipeline

## 2. Pull down the html for the article
- We are choosing an arbitrary article on www.bloomberg.com/markets/economics

Use requests to pull the html data for the article

In [2]:
URL = "https://hackernoon.com/will-layer-1-public-blockchains-rise-or-fall-in-the-next-bull-market"
# URL = "https://hackernoon.com/how-to-manage-your-technology-and-reduce-your-digital-distractions"  #"https://medium.com/codex/why-i-stopped-using-gmail-and-why-should-you-too-c542341ef8f1"
a = requests.get(URL)
# a.text  # text is non-formatted. Use Beautiful Soup!

## 3. Use Beautiful Soup
BS to parse the webpage and pull the header and paragraph tags

In [3]:
soup = BeautifulSoup(a.text, 'html.parser')
# results = soup.find_all(['h1', 'p'])
# results = soup.find_all(['h1', 'h2', 'p', 'li', 'ol'])
results = soup.find_all(['h1', 'h2', 'h3', 'p'])

In [4]:
# soup
text = [result.text for result in results]
article = ''.join(text)
# article
# soup.prettify()

## 4. Break text into sentence blocks

In [51]:
def bsArticle(url):
    r'''This function takes url's for articles from webpages with simple structure and outputs the sentences tagged
    with <eos> tags'''
    a = requests.get(URL)
    soup = BeautifulSoup(a.text, 'html.parser')
    results = soup.find_all(['h1', 'h2', 'h3', 'p'])
    text = [result.text for result in results]
    article = ''.join(text)
    sentences = add_eos(article)
    return block_sentences(sentences)


def add_eos(raw_article):
    # add end-of-sentence markers
    # if type(raw_article) == 'list':
    #     raw_article = raw_article.replace('.', '.<eos>')
    #     raw_article = raw_article.replace('!', '!<eos>')
    #     raw_article = raw_article.replace('?', '?<eos>')

    raw_article = raw_article.replace('.', '.<eos>')
    raw_article = raw_article.replace('!', '!<eos>')
    raw_article = raw_article.replace('?', '?<eos>')
    return split_sentences(raw_article)


def split_sentences(raw_article):
    # function for splitting a string element comprised of multiple sentences into individual sentences
    return raw_article.split('<eos>')


def block_sentences(sentences, max_block_size=2 ** 8):
    '''this function puts the sentences into blocks up to a max size'''
    blocks = []
    i = 0

    for sentence in sentences:
        if len(blocks) == i + 1:
            if len(blocks[i]) + len(sentence.split(' ')) <= max_block_size:
                blocks[i].extend(sentence.split(' '))
            else:
                i += 1
                blocks.append(sentence.split(' '))
        else:
            blocks.append(sentence.split(' '))

    for i in range(len(blocks)):
        blocks[i] = ' '.join(blocks[i])

    return blocks


def reSummarize(summary):
    r'''This function takes a summary that was already generated and re-runs it through the pipeline for a more
    condensed summary'''
    # a = requests.get(URL)
    # soup = BeautifulSoup(a.text, 'html.parser')
    # results = soup.find_all(['h1', 'h2', 'h3', 'p'])
    text = [sentence['summary_text'] for sentence in summary]
    article = ''.join(text)
    sentences = add_eos(article)
    return block_sentences(sentences, max_block_size=2 ** 8)

In [12]:
blocks = bsArticle(URL)

# sentences = add_eos(article)
# blocks = block_sentences(sentences)
# # article = add_eos(article)
#
# # sentences = article.split('<eos>')

print(f'Number of blocks of text: {len(blocks)}')

Number of blocks of text: 18


## 5. Use HuggingFace Summarization Pipeline

In [13]:
summarizer = pipeline("summarization", model="../sshleifer/distilbart-cnn-12-6")

In [14]:
summary = summarizer(blocks, min_length=16, max_length=128, do_sample=False)
# a[0]

{'summary_text': ' A digital distraction is any non-essential technology use that takes your attention away from what you need to be doing at any given moment in time . The average person spends nearly half of their daily time awake either looking at their phone or thinking about it . There are a number of ways that you can manage your technology in order to reduce distractions .'}

In [15]:
def readSummary(summary):
    for sentence in summary:
        print(sentence['summary_text'])


readSummary(summary)

 The development of alternative Layer-1s could be a successful way forward for the blockchain industry and Web 3. 0 to shape the next generation of the Internet .
 Layer-1s that have healthy technological foundations and continually attract native projects will have a good chance of surviving across the bulls and bears phases of the market . The public blockchain will also continue to gradually differentiate into other different development paths like general purpose, privacy, and specialized application blockchains .
 The relatively new public chain Solana, which emerged in the last bull market, currently has a bear market FDV of less than 20 billion US dollars . The formation of a stable ecosystem that can enable more application value and hence drive more value at the Layer-1 protocol level in the long run .
 The three stages of the development of Layer-1 public blockchains can be divided roughly into the following three stages . The first stage was from 2008 to 2013 after Satoshi N

In [52]:
blocks2 = reSummarize(summary)
summary2 = summarizer(blocks2, min_length=16, max_length=128, do_sample=False)
readSummary(summary2)

 Layer-1s that have healthy technological foundations and continually attract native projects will have a good chance of surviving across the bulls and bears phases of the market . Public blockchain will continue to gradually differentiate into other different development paths like general purpose, privacy, and specialized application blockchains . The relatively new public chain Solana, which emerged in the last bull market, currently has a bear market FDV of less than 20 billion US dollars .
 Public Layer-1 blockchains created during this period generally focused on smart contract functionality . The characteristic public blockchain is a concept separated from the standard generalized public blockchain . The development of blockchains will continue to differentiate further into alternative paths like permissioned systems, privacy-focused and specific application blockchains . The public mainnet look to be released in the third quarter of this year .
 Sui has issued incentivized test

In [55]:
blocks3 = reSummarize(summary2)
summary3 = summarizer(blocks3, min_length=16, max_length=64, do_sample=False)
readSummary(summary3)

 Layer-1s that have healthy technological foundations and continually attract native projects will have a good chance of surviving across the bulls and bears phases of the market . The relatively new public chain Solana, which emerged in the last bull market, currently has a bear market FDV of less than 20 billion US dollars
