<a href="https://colab.research.google.com/github/abhi161/Stock_news_analysis_hugging_face/blob/master/Stock_news_analysis_hugging_face.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### **Problem Statement** :-Forecasting stock market have always been challenging task for many business analyst and researchers.Stock exchange is a subject that is highly affected by economic, social, and political factors.There are several factors e.g. external factors or internal factors which can affect and move the stock market.Therefore News can affect any stock / market leading to positive or negative sentiment across the market.


#### **Application**:- This project aims at giving valuable insights of stock market and stocks by using financial news & Analyst opinions from yahoo finance.Also provide with sentiment regarding the particular stock in the market based on which analyst and financial institutions can work further into fundamental and technical aspects.

#### **Approach** :- This project is based on ***State-of-the-Art model PEGASUS*** which is used for summarization of news article scrapped using Beautiful Soup from individual websites . Further sentiment analysis is done using **Hugging Face** pipeline.

# Downloading dependencies


In [1]:
!pip install sentencepiece
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 7.4 MB/s 
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.23.1-py3-none-any.whl (5.3 MB)
[K     |████████████████████████████████| 5.3 MB 6.7 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 42.9 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.10.1-py3-none-any.whl (163 kB)
[K     |████████████████████████████████| 163 kB 53.4 MB/s 
Install

# Importing Libraries

In [None]:
from transformers import PegasusTokenizer, PegasusForConditionalGeneration, TFPegasusForConditionalGeneration
from bs4 import BeautifulSoup
import requests

In [None]:
#loading Model and tokenizer

model_name = "human-centered-summarization/financial-summarization-pegasus"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name) # If you want to use the Tensorflow model 


Downloading:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.34k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

### Scrapping Stock new from single link

In [None]:
URL = "https://finance.yahoo.com/news/tesla-earnings-theres-a-method-to-the-madness-analyst-says-174145962.html?guccounter=1&guce_referrer=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8&guce_referrer_sig=AQAAAEujvknxe5hh7wRs4ahNbgSWy_nExFxYIhAjdEJT6TsUUo3utZNZLhVg8WBSrUsjALn-GBF6CiA8Xrh5xYj3HwWjNtfjbfB-RZyS8X1M39hnLHuvTSrSSU330MGGDbYFGdYdZgEoO_RWEjGaBPcIXvKLtlFsvHvcxgndQfL6o0JZ"
r = requests.get(URL)
soup = BeautifulSoup(r.text,'html.parser')
article = soup.find_all('p')

In [None]:
article = [text.text for text in article]
article_words = ' '.join(article).split(' ')[:400]
sentence = ' '.join(article_words)

In [None]:
sentence

"There’s a tug of war between Tesla (TSLA) bulls and bears on Wall Street. Tesla reported an earnings beat for Q3, but a revenue miss, reflecting a concern among some that demand might be coming down a bit for its premium electric vehicles. Tesla bulls contend that CEO Elon Musk said challenges to physical delivery of vehicles to customers was a main bottleneck — and why the number of cars produced in Q3 was significantly higher than deliveries. “In fact, we're just fundamentally running out of — there weren't enough boats, there weren't enough trains, there weren't enough car carriers to actually support the wave because it got too big,” Musk said on the earnings call. “So, whether we like it or not, we actually have to smooth out the delivery of cars intra-quarter because there aren't just enough transportation objects to move them around.” At the same time, Musk warned about demand issues stemming from China’s property market headwinds, Europe in the midst of an energy-driven recess

In [None]:
input_ids= tokenizer.encode(sentence, return_tensors ='pt')
output = model.generate(input_ids, max_length=55, num_beams =5, early_stopping=True)
    
summary = tokenizer.decode(output[0],skip_special_tokens=True)

In [None]:
summary

'Canaccord Genuity expects Tesla to grow significantly in Q4. ‘There is a bit of doublespeak,’ Canaccord’s Gianarikas says'

## Now Scrapping each link related to particular stocks




In [None]:
stocks = ['tesla', 'GME', 'BTC']

In [None]:
# Function to extract all links  

def extract_stock_urls(stock):
  URL= "https://www.google.com/search?q=yahoo+finance+{}&hl=en&biw=1366&bih=635&tbm=nws ".format(stock)

  r =requests.get(URL)
  soup =BeautifulSoup(r.text,'html.parser')
  atags =soup.find_all('a')
  hrefs = [link['href']for link in atags]
  return hrefs


In [None]:
stock_links = {stock:extract_stock_urls(stock) for stock in stocks}
stock_links

{'tesla': ['/?sa=X&ved=0ahUKEwi0rbq_gvP6AhU7ELcAHfMnCkAQOwgC',
  '/search?q=yahoo+finance+tesla&tbm=nws&hl=en&biw=1366&bih=635&ie=UTF-8&gbv=1&sei=NXRTY_S-A7ug3LUP88-ogAQ',
  '/search?q=yahoo+finance+tesla&hl=en&biw=1366&bih=635&ie=UTF-8&source=lnms&sa=X&ved=0ahUKEwi0rbq_gvP6AhU7ELcAHfMnCkAQ_AUIBSgA',
  '/search?q=yahoo+finance+tesla&hl=en&biw=1366&bih=635&ie=UTF-8&tbm=vid&source=lnms&sa=X&ved=0ahUKEwi0rbq_gvP6AhU7ELcAHfMnCkAQ_AUIBygC',
  '/search?q=yahoo+finance+tesla&hl=en&biw=1366&bih=635&ie=UTF-8&tbm=isch&source=lnms&sa=X&ved=0ahUKEwi0rbq_gvP6AhU7ELcAHfMnCkAQ_AUICCgD',
  'https://maps.google.com/maps?q=yahoo+finance+tesla&hl=en&biw=1366&bih=635&um=1&ie=UTF-8&sa=X&ved=0ahUKEwi0rbq_gvP6AhU7ELcAHfMnCkAQ_AUICSgE',
  '/search?q=yahoo+finance+tesla&hl=en&biw=1366&bih=635&ie=UTF-8&tbm=shop&source=lnms&sa=X&ved=0ahUKEwi0rbq_gvP6AhU7ELcAHfMnCkAQ_AUICigF',
  '/search?q=yahoo+finance+tesla&hl=en&biw=1366&bih=635&ie=UTF-8&tbm=bks&source=lnms&sa=X&ved=0ahUKEwi0rbq_gvP6AhU7ELcAHfMnCkAQ_AUICygG',


## Stripping out bad URLS

In [None]:
import re

exclude_list =['policies', 'preferences','support', 'maps','accounts']



In [None]:

def strip_unwanted_URL(urls, exclude_list):
  val=[]  

  for url in urls:
    if 'https://' in url and not any(exclude_word in url for exclude_word in exclude_list):
      r = re.findall(r'https?://\S+',url)[0].split('&')[0]
      val.append(r)

  return list(set(val))




In [None]:
good_urls ={stock:strip_unwanted_URL(stock_links[stock],exclude_list) for stock in stocks}
good_urls

{'tesla': ['https://finance.yahoo.com/news/tesla-valuation-4-trillion-stretch-174255913.html',
  'https://finance.yahoo.com/news/elon-musk-says-tesla-bigger-201312002.html',
  'https://finance.yahoo.com/video/tesla-expectations-were-too-high-203206046.html',
  'https://finance.yahoo.com/news/tesla-sinks-50-november-record-202622740.html',
  'https://finance.yahoo.com/video/tesla-stock-falls-ev-maker-154729630.html',
  'https://finance.yahoo.com/video/u-government-considers-cfius-review-153336006.html',
  'https://finance.yahoo.com/news/tesla-q-2-earnings-124821562.html',
  'https://finance.yahoo.com/video/tesla-stock-continues-slide-following-200255633.html',
  'https://finance.yahoo.com/news/tesla-earnings-theres-a-method-to-the-madness-analyst-says-174145962.html',
  'https://finance.yahoo.com/news/musk-says-excited-twitter-deal-233626009.html'],
 'GME': ['https://www.marketwatch.com/story/a-tesla-stock-plunge-could-destroy-zombie-stocks-such-as-gamestop-and-peloton-warns-equity-rese

### Getting Data from each individual links

In [None]:
def scrap_data_from_urls(urls):
  paragraphs =[]
  for url in urls:
    r = requests.get(url)
    soup =BeautifulSoup(r.text,'html.parser')
    paragraph = soup.find_all('p')
    article = [text.text for text in paragraph]
    article_words = ' '.join(article).split(' ')[:350]
    sentence = ' '.join(article_words)
    paragraphs.append(sentence)


  return paragraphs

In [None]:
data = {stock:scrap_data_from_urls(good_urls[stock]) for stock in stocks}
data

{'tesla': ['Elon Musk\'s latest lofty prediction for Tesla (TSLA) looks pie in the sky, even by his standards. "I see a potential path to be worth more than Apple and Saudi Aramco combined," Musk proudly proclaimed on the company\'s earnings call on Wednesday. Doing the math, that would put Tesla\'s worth at about $4 trillion at some point. Tesla\'s current market cap is $652 billion, according to Yahoo Finance data. Analysts say that valuation may not happen for eons, if at all. "That seems quite a bit of a stretch," Colin Langan, equity analyst at Wells Fargo, said on Yahoo Finance Live (video above). "You would have to give them full credit for all of these factors that I consider more long-term optionality issues. So things like whether you can get true level four self-driving, whether there is some value in the Optimus bot, Dojo, and these future projects. I think from a pure automaker side, that [valuation] is going to be extremely difficult to do." Tesla\'s path toward Musk\'s n

In [None]:
len(data['tesla'])

10

### Creating Summaries

In [None]:

def summary(articles):
  summaries= []
  for article in articles:
      input_ids= tokenizer.encode(article, return_tensors ='pt')
      output = model.generate(input_ids, max_length=55, num_beams =5, early_stopping=True) 
      summary = tokenizer.decode(output[0],skip_special_tokens=True)
      summaries.append(summary)

  return summaries



In [None]:
final_summaries = {stock:summary(data[stock])for stock in stocks}

In [None]:
final_summaries

{'tesla': ['EV maker will miss 50% growth target for this year. Shares fell more than 6% after the company missed revenue and delivery targets',
  'Shares drop as multiple analysts cut price targets. Musk sees electric-car maker becoming bigger than Apple, Aramco',
  'We are aware of the issue and are working to resolve it.',
  'Electric vehicle-maker’s shares close down 7.6% on Friday. Wall Street tumult has hit growth and tech companies hard',
  'We are aware of the issue and are working to resolve it.',
  'We are aware of the issue and are working to resolve it.',
  'Tesla says it still expects 50% average annual growth rate on vehicle deliveries.',
  'We are aware of the issue and are working to resolve it.',
  '‘There is a bit of doublespeak,’ Canaccord Genuity’s Gianarikas says. Musk says physical delivery bottlenecks are ‘running out’',
  "The world's richest person earlier tried to back out of the $44 billion deal. Musk says Tesla could be worth more than Apple, Saudi Aramco"],

## Sentiment Analysis of all summaries


In [None]:
from transformers import pipeline
sentiment = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading:   0%|          | 0.00/629 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [None]:
scores = {stock:sentiment(final_summaries[stock]) for stock in stocks}
scores

{'tesla': [{'label': 'NEGATIVE', 'score': 0.9997394680976868},
  {'label': 'NEGATIVE', 'score': 0.9879145622253418},
  {'label': 'POSITIVE', 'score': 0.9979088306427002},
  {'label': 'NEGATIVE', 'score': 0.9895454049110413},
  {'label': 'POSITIVE', 'score': 0.9979088306427002},
  {'label': 'POSITIVE', 'score': 0.9979088306427002},
  {'label': 'NEGATIVE', 'score': 0.968706488609314},
  {'label': 'POSITIVE', 'score': 0.9979088306427002},
  {'label': 'NEGATIVE', 'score': 0.9992863535881042},
  {'label': 'NEGATIVE', 'score': 0.9509149789810181}],
 'GME': [{'label': 'NEGATIVE', 'score': 0.9286150932312012},
  {'label': 'NEGATIVE', 'score': 0.9993438124656677},
  {'label': 'NEGATIVE', 'score': 0.997637152671814},
  {'label': 'NEGATIVE', 'score': 0.9431325197219849},
  {'label': 'NEGATIVE', 'score': 0.9930197596549988},
  {'label': 'NEGATIVE', 'score': 0.9995417594909668},
  {'label': 'NEGATIVE', 'score': 0.9979785084724426},
  {'label': 'NEGATIVE', 'score': 0.9992830157279968},
  {'label': '

### Exporting results to CSV for further analysis

In [None]:
def export_to_csv(scores,good_urls,final_summaries):
  output = []

  for stock in stocks:
    for counter in range(len(final_summaries[stock])):
      doc=[
          stock,
          final_summaries[stock][counter],
          scores[stock][counter]['label'],
          scores[stock][counter]['score'],
          good_urls[stock][counter]
      ]

      output.append(doc)
  return output

In [None]:
final_output = export_to_csv(scores,good_urls,final_summaries)
final_output

[['tesla',
  'EV maker will miss 50% growth target for this year. Shares fell more than 6% after the company missed revenue and delivery targets',
  'NEGATIVE',
  0.9997394680976868,
  'https://finance.yahoo.com/news/tesla-valuation-4-trillion-stretch-174255913.html'],
 ['tesla',
  'Shares drop as multiple analysts cut price targets. Musk sees electric-car maker becoming bigger than Apple, Aramco',
  'NEGATIVE',
  0.9879145622253418,
  'https://finance.yahoo.com/news/elon-musk-says-tesla-bigger-201312002.html'],
 ['tesla',
  'We are aware of the issue and are working to resolve it.',
  'POSITIVE',
  0.9979088306427002,
  'https://finance.yahoo.com/video/tesla-expectations-were-too-high-203206046.html'],
 ['tesla',
  'Electric vehicle-maker’s shares close down 7.6% on Friday. Wall Street tumult has hit growth and tech companies hard',
  'NEGATIVE',
  0.9895454049110413,
  'https://finance.yahoo.com/news/tesla-sinks-50-november-record-202622740.html'],
 ['tesla',
  'We are aware of the i

In [None]:
#Inserting headers
final_output.insert(0,['stock','summary','label','score'])

In [None]:
import csv
with open('analysis_stock.csv', mode='w', newline='') as f:
    csv_writer = csv.writer(f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    csv_writer.writerows(final_output)