# Introduction

*This file is to play around with the data and to get familiar with the spacy library (NLP library). The code for the final app is in the app.py file.*

What is information retrieval?
Getting insights from raw and unstructured data, and text is a very desirable task. Uploading a document and getting important bits of information from it is called information retrival. It has been a major challenge in NLP.

We can use NER (Named Entity Recognition) (or NEL, Named Entity Linking) in several domains like finance, drug research, e-commerce, and more for information retrieval purposes. In this tutorial, we use NEL to develop a stock marke news feed that lists down the buzzing stocks on the internet.

**Basic Tasks Related to NLP**
- Tokenization
- POS tagging
- Dependency parsing

How to do the task?
Learn how to apply Named Entity Recognition to extract important entitites (here publicly traded companies) and then link each entity with some information using a knowledge base (Nifty500 companies list here).

The textual data to perform NER on will be taken from the RSS feeds. From these the Named Entities (the company names) will be extracted, which will be linked to the knowledge base and then their stock informations will be collected using these names from Yahoo Finance Library.

**NER has many applications in industry.**






# News Feed Buzzing Stocks!

Major Steps:
1. Import the required libraries
  - spacy
  - pandas: to reach csv files
  - requests: to send get requests to get some data
  - BeautifulSoup: to parse through my xml data, because I will play with some rss feeds
  - streamlit: We will use streamlit in out vscode so will import there.
  - yfinance: Yahoo Finance is a library which provides stock data in real time.

2. Extract the data from the RSS feed links.
3. NER _(Named Entity Recognition)_ - spaCy NLP pipeline to process our extracted textual data.
4. NEL _(Named Entity Linkage)_ - Whatever entities extracted in step 3 will be linked to external dataset (Nifty 500 company list). Basically building a financial newsfeed.
5. Extract the data of these entities (publicly traded companies) using yahoofinance library.

### Step 1: Importing Required Libraries

In [1]:
import spacy
import pandas as pd
import requests
from bs4 import BeautifulSoup

### Step 2: How to extract trending stocks news data
We can use two sources:


1.   Economic Times, Markets RSS feed
2.   Money Control RSS Feed

What is RSS feed?[2] \
An RSS (Really Simple Syndication) feed is an online file that contains details about every piece of content a site has published. Each time a site publishes a new piece of content, details about that content—including the full-text of the content or a summary, publication date, author, link, etc.—are automatically generated in the file and displayed in reverse chronological order.

We will get textual headlines from RSS feed and then we'll use SpaCy to extract the main entities from the headlines. The headlines are present in the \<title\> tags of the XML here.

The requests package sends GET requests to the provided link, and it returns the response. We use it to capture the entire XML doc from the provided link.

Print the response to see if it was successful. HTTP code 200 means successful response.

In [2]:
# To get the rss feed xml into a response object
resp = requests.get("https://economictimes.indiatimes.com/markets/rssfeeds/1977021501.cms")
resp

<Response [200]>

In [3]:
# resp.text
# resp.content

Now this complete XML content needs to be parsed to extract the headlines, which are inside \<title\> tags. BeautifulSoup class is used to parse the XML document.

In [4]:
soup = BeautifulSoup(resp.content, features='xml')
headlines = soup.findAll('title')

In [5]:
#headlines   # list of all title tags

### Step 3: How to extract entities from the headlines?

Ofcourse we have to use NLP, but how?
We will use spaCy libray.
- It is an open source NLP library.
- Super-fast in processing textual data. Leader in NLP reseach, also used in industry.
- Supports many languages.
- Works well with tensorflow and pytorch

We will use pre-trained core language model from spaCy to extract the entities.
It has 2 major classes of pre-trained language models. \
1. Core Models: for general-purpose basic NLP tasks
2. Starter Models: for niche applications that require transfer learning.

In [6]:
# Colab has en_core_web_sm downloaded for us, so only loading
# in other machines download first
nlp = spacy.load("en_core_web_sm")

`en_core_web_sm` is an English pipeline optimized for CPU and has following components:
- tok2vec - token to vectors (performs tokenization on textual data)
- tagger - adds relevant metadata to each token. spaCy makes use of statistical models to predict POS (part of speech) of each token.
- parser - dependency parser establishes relationships among the tokens
- senter, ner, attribute_ruler, and lemmatizer.

In [10]:
# Testing for a single headline
print(headlines[10])
processed_hline = nlp(headlines[10].text)  # extract text from <title> tags
# processed_hline

# To check the tokens
for token in processed_hline:
    print(token.text)

<title>Sovereign Gold Bond Scheme: Issue price fixed at Rs 5,926/gm; subscription opens on Monday</title>
Sovereign
Gold
Bond
Scheme
:
Issue
price
fixed
at
Rs
5,926
/
gm
;
subscription
opens
on
Monday


In [11]:
# Each token has a tagged pos_ attribute
print(headlines[10])
processed_hline = nlp(headlines[10].text)
for token in processed_hline:
    print(token.text, "---", token.pos_)

<title>Sovereign Gold Bond Scheme: Issue price fixed at Rs 5,926/gm; subscription opens on Monday</title>
Sovereign --- PROPN
Gold --- PROPN
Bond --- PROPN
Scheme --- PROPN
: --- PUNCT
Issue --- NOUN
price --- NOUN
fixed --- VERB
at --- ADP
Rs --- NOUN
5,926 --- NUM
/ --- SYM
gm --- PROPN
; --- PUNCT
subscription --- NOUN
opens --- VERB
on --- ADP
Monday --- PROPN


In [12]:
# Each token also has a dep_ attribute which tells how that token is related to others
print(headlines[10])
processed_hline = nlp(headlines[10].text)
for token in processed_hline:
    print(token.text, "---", token.pos_, "---",token.dep_)

<title>Sovereign Gold Bond Scheme: Issue price fixed at Rs 5,926/gm; subscription opens on Monday</title>
Sovereign --- PROPN --- compound
Gold --- PROPN --- compound
Bond --- PROPN --- compound
Scheme --- PROPN --- dep
: --- PUNCT --- punct
Issue --- NOUN --- compound
price --- NOUN --- appos
fixed --- VERB --- acl
at --- ADP --- prep
Rs --- NOUN --- nmod
5,926 --- NUM --- pobj
/ --- SYM --- punct
gm --- PROPN --- appos
; --- PUNCT --- punct
subscription --- NOUN --- nsubj
opens --- VERB --- ROOT
on --- ADP --- prep
Monday --- PROPN --- pobj


In [13]:
print(headlines[10])
processed_hline = nlp(headlines[10].text)
for token in processed_hline:
    print(token.text, "---", token.pos_, "---",token.dep_, "---", spacy.explain(token.dep_))

<title>Sovereign Gold Bond Scheme: Issue price fixed at Rs 5,926/gm; subscription opens on Monday</title>
Sovereign --- PROPN --- compound --- compound
Gold --- PROPN --- compound --- compound
Bond --- PROPN --- compound --- compound
Scheme --- PROPN --- dep --- unclassified dependent
: --- PUNCT --- punct --- punctuation
Issue --- NOUN --- compound --- compound
price --- NOUN --- appos --- appositional modifier
fixed --- VERB --- acl --- clausal modifier of noun (adjectival clause)
at --- ADP --- prep --- prepositional modifier
Rs --- NOUN --- nmod --- modifier of nominal
5,926 --- NUM --- pobj --- object of preposition
/ --- SYM --- punct --- punctuation
gm --- PROPN --- appos --- appositional modifier
; --- PUNCT --- punct --- punctuation
subscription --- NOUN --- nsubj --- nominal subject
opens --- VERB --- ROOT --- root
on --- ADP --- prep --- prepositional modifier
Monday --- PROPN --- pobj --- object of preposition


In [14]:
# To visualize dependencies among the tokens,
# displacy render() method can be used.
spacy.displacy.render(processed_hline, style='dep', jupyter=True, options={'distance':120})

#### Named Entity Recognition

In [15]:
# NER by passing 'ent' as style.
# This dispays different tags on important entities.
spacy.displacy.render(processed_hline, style='ent', \
                      jupyter=True, options={'distance':120})

The entity tags are:
DATE, PERCENT, GPE (Countries/Cities/States), ORG.

We are interested in entities with ORG tag, this will give us companies, agencies, institutes etc. We will extract all these entites.


In [17]:
companies = []
for title in headlines:
  doc = nlp(title.text)
  for token in doc.ents:
    if token.label_ == 'ORG':
      companies.append(token.text)
    else:
      pass
companies

['Markets-Economic Times',
 'Economic Times',
 'SEC',
 'BOJ',
 'RBZ Jewellers',
 'IPO',
 'gm',
 'Warburg Pincus',
 'Kalyan Jewellers',
 'Nomura',
 'AI',
 'BOJ',
 'Fed',
 'BSE',
 'Go Fashion',
 'ETMarket Watch',
 'Canara Bank',
 'Union Bank',
 'SMA',
 'SME',
 'Healthcare',
 'Nifty Auto',
 'Gainers',
 'TCS',
 'NSE',
 'Breaking Records',
 'ITC',
 'Nifty',
 'Nifty Realty',
 'NSE',
 'Nifty',
 'Bharat Phatak',
 'RSI']

### Step 4: Named Entity Linking

Now we have the names of companies buzzing in the market. We need to get their stock price information. But company names are not used to recognize a company in stock markets, their trading symbols are used.

As we are looking at the indian companies only, I am going to use external database of Nifty 500 companies (CSV file). We will use company name to find out the symbol from this database.

Then for this symbol we will capture the stock market statistics using yahoo-finance library.

In [18]:
# From all of these extracted entitites we will extract the org name
# Then we will need knowledge base, basically nifty 500 company
from google.colab import files
files.upload();

Saving ind_nifty500list.csv to ind_nifty500list.csv


In [19]:
import pandas as pd
stocks_df = pd.read_csv("./ind_nifty500list.csv")
stocks_df.head()

Unnamed: 0,Company Name,Industry,Symbol,Series,ISIN Code
0,360 ONE WAM Ltd.,Financial Services,360ONE,EQ,INE466L01038
1,3M India Ltd.,Diversified,3MINDIA,EQ,INE470A01017
2,ABB India Ltd.,Capital Goods,ABB,EQ,INE117A01022
3,ACC Ltd.,Construction Materials,ACC,EQ,INE012A01025
4,AIA Engineering Ltd.,Capital Goods,AIAENG,EQ,INE212H01026


In [20]:
# Does not come preinstalled in collab, so have to install it.
!pip install yfinance

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [25]:
# To get stock info, pass the stock symbols toTicker class of yfinance library
# but note pass it with '.NS' after each stock symbol.
# This is because Indian NSE stock symbols are stored with .NS suffix in yfinance
stock_info = yf.Ticker("3MINDIA.NS")
stock_info.info

{'address1': 'WeWork Prestige Central',
 'address2': '3rd floor 36 Infantry Road Tasker Town',
 'city': 'Bengaluru',
 'zip': '560001',
 'country': 'India',
 'phone': '91 80 2223 1414',
 'website': 'https://www.3mindia.in',
 'industry': 'Conglomerates',
 'industryDisp': 'Conglomerates',
 'sector': 'Industrials',
 'longBusinessSummary': '3M India Limited engages in the manufacture and trade of various products for health care, manufacturing, automotive, safety, electronics, energy, commercial solutions, transportation, and design and construction industries in India. The company operates through four segments: Safety and Industrial, Health Care, Transportation and Electronics, and Consumer. The Safety and Industrial segment offers vinyl, polyester, foil, and specialty industrial tapes and adhesives, such as scotch masking tapes, scotch filament and packaging tapes, functional and decorative graphics, abrasion-resistant films, masking tapes, and other specialty materials. This segment ser

### Step 5: Extract stock market stats for these companies using yfinance

In [24]:
# We actually need company's symbol.
# If we don't know the symbol, we won't be able to access the data
# from the yahoo finance library we can use the symbol to access the stock
# After capturing these company names from rss feeds, we will link with
# external database, here nifty 500. We will match the name of the company
# with the name of the company column, and get the company symbol.
# Then we will pass the symbol to yahoo finance library and get the stock info.
# We will convert the stock info dictionary to a dataframe that will be displayed in out app.

import yfinance as yf

stock_info_dict = {
        'Org': [],
        'Symbol': [],
        'currentPrice': [],
        'dayHigh': [],
        'dayLow': []
        # 'forwardPE':[],
        # 'dividendYield':[]
}

for title in headlines:
        doc = nlp(title.text)
        # print(doc)
        for ent in doc.ents:
            try:
                # checking if entity form doc.ents is present in knowledge base
                if stocks_df['Company Name'].str.contains(ent.text).sum():
                    symbol = stocks_df[stocks_df['Company Name'].str.\
                                       contains(ent.text)]['Symbol'].values[0]
                    # print(symbol) # debugging
                    org_name = stocks_df[stocks_df['Company Name'].str.\
                                       contains(ent.text)]['Company Name'].values[0]

                    # sending yfinance the symbol for stock info
                    stock_info = yf.Ticker(symbol+".NS").info
                    # print(stock_info)
                    stock_info_dict['Org'].append(org_name)
                    stock_info_dict['Symbol'].append(symbol)

                    stock_info_dict['currentPrice'].append(stock_info['currentPrice'])
                    stock_info_dict['dayHigh'].append(stock_info['dayHigh'])
                    stock_info_dict['dayLow'].append(stock_info['dayLow'])
                    # stock_info_dict['forwardPE'].append(stock_info['forwardPE'])
                    # stock_info_dict['dividendYield'].append(stock_info['dividendYield'])
                    # print(stock_info_dict) # debugging
                    # some companies will be missed,
                    # but we don't need all the info but correct info
                else:
                    pass
            except:
                pass

# debugging # figured out that forwardPE and dividend yield values are not coming in for all.
# arrayLength = {k:len(stock_info_dict[k]) for k in stock_info_dict.keys()}
# print(arrayLength)

output_df = pd.DataFrame(stock_info_dict)
print(output_df)

                                   Org      Symbol  currentPrice   dayHigh  \
0   Network18 Media & Investments Ltd.   NETWORK18         65.60     66.50   
1                   Ashok Leyland Ltd.    ASHOKLEY        164.40    167.80   
2          Kalyan Jewellers India Ltd.  KALYANKJIL        131.10    134.90   
3                 AIA Engineering Ltd.      AIAENG       3477.00   3498.95   
4   Network18 Media & Investments Ltd.   NETWORK18         65.60     66.50   
5                    Federal Bank Ltd.  FEDERALBNK        123.75    124.35   
6                             BSE Ltd.         BSE        572.25    580.65   
7              Go Fashion (India) Ltd.    GOCOLORS       1130.45   1143.95   
8                          Zomato Ltd.      ZOMATO         74.45     76.00   
9           One 97 Communications Ltd.       PAYTM        895.15    909.90   
10                         Canara Bank       CANBK        303.75    305.00   
11                City Union Bank Ltd.         CUB        124.65

References:


1.   [Harshit Tyagi Blog](https://www.freecodecamp.org/news/use-python-spacy-streamlit-to-build-structured-financial-newsfeed/)
2.   [How to use RSS feeds to boost your productivity?](https://zapier.com/blog/how-to-use-rss-feeds/)
3. [spaCy](https://spacy.io/)
4. [Money Control RSS feed](https://www.moneycontrol.com/rss/buzzingstocks.xml)
5. [Economic Times Markets RSS feed](https://economictimes.indiatimes.com/markets/stocks/rssfeeds/2146842.cms)
6. [Nifty 500 companies database](https://www.nseindia.com/products-services/indices-nifty500-index)
7. [Yahoo-finance library](https://pypi.org/project/yfinance/)


