## Overview

Objective :
* Build a stock news summarization model (pre-trained Transformer-based) focusing on financial topics.  

Motivation :
* How does a pre-trained Transformer-based summarization model works?

Dataset :
* Yahoo! Finance web-scraping.

Sources :
* https://huggingface.co/transformers/model_doc/pegasus.html
* https://huggingface.co/human-centered-summarization/financial-summarization-pegasus
* T. Passali, A. Gidiotis, E. Chatzikyriakidis and G. Tsoumakas. 2021. Towards Human-Centered Summarization: A Case Study on Financial News. In Proceedings of the First Workshop on Bridging Human-Computer Interaction and Natural Language Processing(pp. 21–27). Association for Computational Linguistics.

Please use Google Colab for a more convenience navigation to any section (through Table of Contents) in this notebook.

## **Google Colab**

<td>
<a target="_blank" href="https://colab.research.google.com/github/amdhiqal/ML/blob/main/Text%20Analytics/Stocks%20News%20Scraping%20and%20its%20Sentiment/1.%20Stock_News_Scraping_and_its_Sentiment.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
</td>

## Package Installations

In [None]:
!pip3 install transformers
!pip3 install sentencepiece

Collecting transformers
  Downloading transformers-4.11.0-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 5.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 56.3 MB/s 
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.2 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 34.7 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 35.3 MB/s 
Installing collected packages: tokenizers, sacremoses, pyyaml, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
  

## Imports

In [None]:
import requests
import re
import pandas as pd

from transformers import PegasusTokenizer, PegasusForConditionalGeneration, pipeline
from bs4 import BeautifulSoup

## Data Collection

In [None]:
url = "https://finance.yahoo.com/news/u-retail-industry-seeks-90-152903788.html"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
paragraphs = soup.find_all('p')

Checking the response.

In [None]:
r

<Response [200]>

In [None]:
# r.text

In [None]:
# paragraphs

Checking the paragraphs.

In [None]:
paragraphs[0:8]

[<p>By David Shepardson</p>,
 <p>WASHINGTON (Reuters) - Two major U.S. retail industry groups on Tuesday asked the Biden administration for at least 90 days before imposing new rules that will require employees at larger firms to be vaccinated against COVID-19 or submit to regular testing.</p>,
 <p>On Sept. 9, the White House said the Occupational Safety and Health Administration (OSHA) is developing an emergency temporary standard that will require all employers with 100 or more employees to ensure their workforce is fully vaccinated, or require any workers who remain unvaccinated to produce a negative COVID-19 test once a week.</p>,
 <p>The White House has said those rules will apply to more than 80 million private sector employees.</p>,
 <p>The Retail Industry Leaders Association and the National Retail Federation strongly encouraged OSHA "to provide a 90-day implementation timeline to allow retailers and other employers to create the systems necessary."</p>,
 <p>The retail groups, 

In [None]:
text = [paragraph.text for paragraph in paragraphs]
words = ' '.join(text).split(' ')[:400]
article_ = ' '.join(words)
df_article =  pd.DataFrame([article_], columns = ['article'])

In [None]:
article_

'By David Shepardson WASHINGTON (Reuters) - Two major U.S. retail industry groups on Tuesday asked the Biden administration for at least 90 days before imposing new rules that will require employees at larger firms to be vaccinated against COVID-19 or submit to regular testing. On Sept. 9, the White House said the Occupational Safety and Health Administration (OSHA) is developing an emergency temporary standard that will require all employers with 100 or more employees to ensure their workforce is fully vaccinated, or require any workers who remain unvaccinated to produce a negative COVID-19 test once a week. The White House has said those rules will apply to more than 80 million private sector employees. The Retail Industry Leaders Association and the National Retail Federation strongly encouraged OSHA "to provide a 90-day implementation timeline to allow retailers and other employers to create the systems necessary." The retail groups, which represent companies including Walmart, CVS

In [None]:
df_article

Unnamed: 0,article
0,By David Shepardson WASHINGTON (Reuters) - Two...


## Model


The used model is based on PEGASUS model and fined-tuned on financial news datasets that covers on stock, market, currencies, rate and crytocurrencies topics.

In [None]:
model_name = "human-centered-summarization/financial-summarization-pegasus"
# model_name = "google/pegasus-xsum"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

Downloading:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.31k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "


Downloading:   0%|          | 0.00/2.12G [00:00<?, ?B/s]

In [None]:
input_ids = tokenizer.encode(article_, return_tensors='pt')
# input_ids = tokenizer(article_, return_tensors="pt").input_ids

output = model.generate(input_ids, 
                        max_length=55, 
                        num_beams=5, 
                        early_stopping=True)

summarization = tokenizer.decode(output[0], skip_special_tokens=True)

In [None]:
summarization

'Groups say as many as 4 million workers may need to be vaccinated. British Soft Drinks Association says manufacturers have ‘only a few days’ left'

In [None]:
def summarize_an_article(article_):
    
  input_ids = tokenizer.encode(article_, return_tensors='pt')
  # input_ids = tokenizer(article_, return_tensors="pt").input_ids

  output = model.generate(input_ids, 
                          max_length=55, 
                          num_beams=5, 
                          early_stopping=True)

  summarization = tokenizer.decode(output[0], skip_special_tokens=True)

  return summarization

### Summarization testing

In [None]:
summarize_an_article(article_)

'Groups say as many as 4 million workers may need to be vaccinated. British Soft Drinks Association says manufacturers have ‘only a few days’ left'

### Summarization testing (in a DataFrame)

In [None]:
df_article['summarize'] = df_article['article'].apply(summarize_an_article)

In [None]:
with pd.option_context('display.max_colwidth', None):
  display(df_article)

Unnamed: 0,article,summarize
0,"By David Shepardson WASHINGTON (Reuters) - Two major U.S. retail industry groups on Tuesday asked the Biden administration for at least 90 days before imposing new rules that will require employees at larger firms to be vaccinated against COVID-19 or submit to regular testing. On Sept. 9, the White House said the Occupational Safety and Health Administration (OSHA) is developing an emergency temporary standard that will require all employers with 100 or more employees to ensure their workforce is fully vaccinated, or require any workers who remain unvaccinated to produce a negative COVID-19 test once a week. The White House has said those rules will apply to more than 80 million private sector employees. The Retail Industry Leaders Association and the National Retail Federation strongly encouraged OSHA ""to provide a 90-day implementation timeline to allow retailers and other employers to create the systems necessary."" The retail groups, which represent companies including Walmart, CVS Best Buy, Target, Kroger and Home Depot, asked how the administration will ensure adequate COVID-19 testing capacity to satisfy the ""significant increase in demand."" The groups said ""there could be as many as 4 million retail workers who may need to be tested on a weekly basis."" They also asked other detailed questions like ""what remedial actions can be taken in situations in which employees refuse vaccinations and testing?"" U.S. Commerce Secretary Gina Raimondo told travel executives last week that the OSHA order is expected in ""a matter of weeks. ... We have been told in October."" (Reporting by David Shepardson; Editing by Andrea Ricci) (Bloomberg) -- The British Soft Drinks Association said manufacturers have “only a few days” of carbon dioxide left in reserve to produce beverages and can’t import supplies from the European Union due to Brexit. Most Read from BloombergThe Global Housing Market Is Broken, and It’s Dividing Entire CountriesMerkel’s Legacy Comes to Life on Berlin’s ‘Arab Street’Is There Room for E-Scooters in New York City?Amazon, Microsoft Swoop In on $24 Billion India Farm-Data TrovePalm Oil Giant’s Industry-Be Shares of Google parent Alphabet rose slightly Tuesday after the tech giant unveiled plans to purchase a $2.1 billion office building in Manhattan. Google already leases the 1.3 million square-foot-building located on Manhattan’s bustling West Side, known as St. John’s Terminal. The company has the option to purchase the building, which it plans to exercise by the first quarter of 2022, said",Groups say as many as 4 million workers may need to be vaccinated. British Soft Drinks Association says manufacturers have ‘only a few days’ left


## News and Sentiments

### Raw data

Selecting few interested stocks (daily).

In [None]:
stock_list = ['MSFT', 'AMD']

In [None]:
def url_finder_stock_news(stock):
  '''
  Scrape url link from Google Search for every interested stocks
  '''
  url_finder = "https://www.google.com/search?q=yahoo+finance+{}&tbm=nws".format(stock)
  r = requests.get(url_finder)
  soup = BeautifulSoup(r.text, 'html.parser')
  tags_a = soup.find_all('a')
  hrefs = [link['href'] for link in tags_a]
  
  return hrefs

In [None]:
raw_urls = {stock : url_finder_stock_news(stock) for stock in stock_list}

All raw URLs (for every interested stocks).

In [None]:
raw_urls

{'AMD': ['/?sa=X&ved=0ahUKEwiVyu2EwqTzAhXSXc0KHfpiDYcQOwgC',
  '/?output=search&ie=UTF-8&tbm=nws&sa=X&ved=0ahUKEwiVyu2EwqTzAhXSXc0KHfpiDYcQPAgE',
  '/search?q=yahoo+finance+AMD&tbm=nws&ie=UTF-8&gbv=1&sei=jYhUYdXPB9K7tQb6xbW4CA',
  '/search?q=yahoo+finance+AMD&ie=UTF-8&source=lnms&sa=X&ved=0ahUKEwiVyu2EwqTzAhXSXc0KHfpiDYcQ_AUIBygA',
  '/search?q=yahoo+finance+AMD&ie=UTF-8&tbm=shop&source=lnms&sa=X&ved=0ahUKEwiVyu2EwqTzAhXSXc0KHfpiDYcQ_AUICSgC',
  '/search?q=yahoo+finance+AMD&ie=UTF-8&tbm=vid&source=lnms&sa=X&ved=0ahUKEwiVyu2EwqTzAhXSXc0KHfpiDYcQ_AUICigD',
  '/search?q=yahoo+finance+AMD&ie=UTF-8&tbm=isch&source=lnms&sa=X&ved=0ahUKEwiVyu2EwqTzAhXSXc0KHfpiDYcQ_AUICygE',
  'https://maps.google.com/maps?q=yahoo+finance+AMD&um=1&ie=UTF-8&sa=X&ved=0ahUKEwiVyu2EwqTzAhXSXc0KHfpiDYcQ_AUIDCgF',
  '/search?q=yahoo+finance+AMD&ie=UTF-8&tbm=bks&source=lnms&sa=X&ved=0ahUKEwiVyu2EwqTzAhXSXc0KHfpiDYcQ_AUIDSgG',
  '/advanced_search',
  '/search?q=yahoo+finance+AMD&ie=UTF-8&tbm=nws&source=lnt&tbs=qdr:h&sa

Raw URLs for 'MSFT' only.

In [None]:
raw_urls['MSFT'][:5]

['/?sa=X&ved=0ahUKEwjuy9OEwqTzAhXSQc0KHTQyA_wQOwgC',
 '/?output=search&ie=UTF-8&tbm=nws&sa=X&ved=0ahUKEwjuy9OEwqTzAhXSQc0KHTQyA_wQPAgE',
 '/search?q=yahoo+finance+MSFT&tbm=nws&ie=UTF-8&gbv=1&sei=jIhUYe7VKtKDtQa05IzgDw',
 '/search?q=yahoo+finance+MSFT&ie=UTF-8&source=lnms&sa=X&ved=0ahUKEwjuy9OEwqTzAhXSQc0KHTQyA_wQ_AUIBygA',
 '/search?q=yahoo+finance+MSFT&ie=UTF-8&tbm=shop&source=lnms&sa=X&ved=0ahUKEwjuy9OEwqTzAhXSQc0KHTQyA_wQ_AUICSgC']

### Preprocessing

Filtering out unrelated matters.

In [None]:
filter_out_list = ['account', 'support', 'preferences', 'policies', 'maps']

In [None]:
def filtering_out_urls(urls, filter_out_list):
  '''
  Filtering out unrelated matters
  '''

  val = []
  
  for url in urls:

    if 'https://' in url and not any(filtered_word in url for filtered_word in filter_out_list):
      
      re_ = re.findall(r'(https?://\S+)', url)[0].split('&')[0]
      val.append(re_)
    
  return list(set(val))

Cleaned URLs.

In [None]:
cleaned_urls = {stock : filtering_out_urls(raw_urls[stock], filter_out_list) for stock in stock_list}
cleaned_urls

{'AMD': ['https://finance.yahoo.com/news/amd-captures-historic-best-16-181700779.html',
  'https://finance.yahoo.com/news/why-amd-shares-falling-171753939.html',
  'https://finance.yahoo.com/news/synamedia-delivers-industry-first-zero-130100869.html',
  'https://finance.yahoo.com/news/bull-day-amd-amd-110011826.html',
  'https://finance.yahoo.com/news/russia-fines-google-failing-delete-140214969.html',
  'https://finance.yahoo.com/news/daniel-patrick-gibson-sylebra-capital-172113336.html',
  'https://finance.yahoo.com/news/advanced-micro-devices-inc-nasdaq-185204606.html',
  'https://finance.yahoo.com/news/where-hedge-funds-stand-advanced-140015738.html',
  'https://finance.yahoo.com/news/nigeria-become-first-country-africa-114543801.html',
  'https://finance.yahoo.com/news/canada-stocks-toronto-market-rebounds-141240603.html'],
 'MSFT': ['https://finance.yahoo.com/news/why-10-stocks-were-spotlight-183007081.html',
  'https://finance.yahoo.com/news/hedge-funds-think-microsoft-corporati

### News scraping

Scraping interested stock news.

In [None]:
def scraping_(urls):
  '''
  Scraping interested stock news
  '''
  
  articles_ = []

  for url in urls:
     
     r = requests.get(url)
     soup = BeautifulSoup(r.text, 'html.parser')
     paragraphs = soup.find_all('p')
     text = [paragraph.text for paragraph in paragraphs]
     words = ' '.join(text).split(' ')[:350]
     article_ = ' '.join(words)
     articles_.append(article_)
  
  return articles_

In [None]:
articles_ = {stock : scraping_(cleaned_urls[stock]) for stock in stock_list}
articles_

{'AMD': ['Another 351,000 individuals filed, 320,000 was estimated LONDON, September 23, 2021--(BUSINESS WIRE)--Third paragraph, third sentence and fourth paragraph, first sentence of release dated September 22, 2021, should read: "$21.5 billion" and "$92 billion" (instead of "$21.5 million" and "$92 million"). The updated release reads: AMD CAPTURES HISTORIC-BEST 16% OF SERVER CPU MARKET ACCORDING TO OMDIA DATA CENTER SERVER MARKET TRACKER Competition in the market for high-performance semiconductors targeting data center workloads is red hot according to the latest Data Center Server Market Tracker from research group Omdia. In the server CPU market, AMD scored its best-ever quarter from a market share and sales perspective with demand from hyperscale cloud service providers, and Google in particular, being a big contributing factor to AMD’s strong performance. The demand for servers across all market segments remained strong in the second quarter of 2021 amidst concerns about order 

### Summarization

In [None]:
def summarize(articles_):
  '''
  Summarizing articles
  '''
  summaries = []

  for article in articles_:

    input_ids = tokenizer.encode(article, return_tensors = 'pt')
    output = model.generate(input_ids, 
                            max_length = 55,
                            num_beams = 5,
                            early_stopping = True)
    summary = tokenizer.decode(output[0], skip_special_tokens = True)
    summaries.append(summary)

  return summaries

In [None]:
summarizations = {stock : summarize(articles_[stock]) for stock in stock_list}
summarizations

{'AMD': ['Competition in the market for high-performance semiconductors red hot',
  'The 10-year Treasury yield hit an intraday high of 1.567% Tuesday. Pfizer, BioNTech, Moderna all fall on Tuesday',
  "Synamedia's platform powered by 3rd Gen AMD EPYCTM processors achieves breakthrough 8K quality.",
  'We are aware of the issue and are working to resolve it.',
  'Moscow court fines Google for not deleting content. Lucid Motors to make first vehicle deliveries next month',
  'Chegg, DocuSign, Pinduoduoduo among top 5 stocks in Sylebra Capital’s portfolio.',
  'Xilinx acquisition still in the spotlight amid Chinese regulatory pressures.',
  "AMD was in 63 of the hedge funds' portfolios at the end of June.",
  'Central bank’s eNaira website has gone live ahead of schedule. Central bank says digital currency will ‘cultivate economic growth’.',
  'Consumer staple, industrial stocks lead gains. Technology stocks have come under pressure as fears of a slowing economic recovery'],
 'MSFT': ['I

### Sentiment

In [None]:
from transformers import pipeline

In [None]:
sentiment = pipeline('sentiment-analysis')

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

  "Passing `gradient_checkpointing` to a config initialization is deprecated and will be removed in v5 "


Downloading:   0%|          | 0.00/255M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

## Summarization and its Sentiment

In [None]:
def summarization_and_sentiment(stock = stock_list[0]):
  """
  Summarization and sentiment of a selected stock from list of interested stocks
  """
  for i, x in zip(sentiment(summarizations[stock]), summarizations[stock]):
    print(i, x) 

In [None]:
summarization_and_sentiment('AMD')

  cpuset_checked))


{'label': 'NEGATIVE', 'score': 0.9542626738548279} Competition in the market for high-performance semiconductors red hot
{'label': 'NEGATIVE', 'score': 0.9761536717414856} The 10-year Treasury yield hit an intraday high of 1.567% Tuesday. Pfizer, BioNTech, Moderna all fall on Tuesday
{'label': 'POSITIVE', 'score': 0.9994142055511475} Synamedia's platform powered by 3rd Gen AMD EPYCTM processors achieves breakthrough 8K quality.
{'label': 'POSITIVE', 'score': 0.9979088306427002} We are aware of the issue and are working to resolve it.
{'label': 'NEGATIVE', 'score': 0.964374840259552} Moscow court fines Google for not deleting content. Lucid Motors to make first vehicle deliveries next month
{'label': 'POSITIVE', 'score': 0.9511938691139221} Chegg, DocuSign, Pinduoduoduo among top 5 stocks in Sylebra Capital’s portfolio.
{'label': 'POSITIVE', 'score': 0.9628427028656006} Xilinx acquisition still in the spotlight amid Chinese regulatory pressures.
{'label': 'NEGATIVE', 'score': 0.59194016

## References & Credits

1. https://huggingface.co/transformers/model_doc/pegasus.html
2. https://huggingface.co/human-centered-summarization/financial-summarization-pegasus
3. T. Passali, A. Gidiotis, E. Chatzikyriakidis and G. Tsoumakas. 2021. Towards Human-Centered Summarization: A Case Study on Financial News. In Proceedings of the First Workshop on Bridging Human-Computer Interaction and Natural Language Processing(pp. 21–27). Association for Computational Linguistics.
4. https://github.com/nicknochnack