# Scraping Stock Prices on Yahoo Finance

Use the "Run" button to execute the code.

<img src="https://i.imgur.com/JlzyGl6.jpeg" style="width: 800px;" align="center"/>

### <font color=green>Introduction

The Yahoo finance website is a source of financial information for users all over the world. In this project, web-scraping of the finance section on Yahoo was attempted.


The code written in this project draws information from a select group of categories of Markets, which can be found inside the 'Markets' header of the as shown below:

<img src="https://i.imgur.com/xH9XyZk.jpeg" style="width: 800px;" align="center" />


When the Trending Tickers page is opened, the tickers look like this:
<img src="https://i.imgur.com/OK4DXUx.png" style="width: 800px;" align="center" />


The goal of this project was to scrape the following stock ticker information for each of the tickers:

* Name
* Symbol
* Price
* URL

### <font color=green>Requests and BeautifulSoup

First, the required packages are installed and imported. For this project, we will need the requests library to query a website and save the response as a response object.

Then, we install the bs4 package which contains BeautifulSoup library, which is a popular python library for reading and parsing HTML data. 

In [72]:
!pip install requests --upgrade --quiet
import requests

!pip install bs4 --upgrade --quiet
from bs4 import BeautifulSoup

### <font color=green>Defining the base_url

The website home page (https://www.finance.yahoo.com) is assigned to a variable 'base_url' so that the specific market paths which will be extracted from the html code can be appended to it. 

In [73]:
base_url = 'https://www.finance.yahoo.com'

### <font color=brown>Function No. 1: get_market_page

* A function called 'get_market_page' is defined, which takes a market category as string input and generates a beautifulsoup document as output. 
***
* A couple of market URLS (top-trending and cryptocurrencies) were checked and it was found that the market URLs had the market names appended to the base URL - https://finance.yahoo.com. The outputs were checked and were verified.

    Cryptocurrencies URL  found to be https://finance.yahoo.com/cryptocurrencies;

    Trending-tickers URL was found to be https://finance.yahoo.com/trending-tickers;

    Top ETFs URL was found to be https://finance.yahoo.com/etfs;

    ...and so on.
***

* The local variable 'response' is assigned to the response obtained when the market URL is passed to the requests.get function. 
***
* A valid response has a response code of 200. If not, an exception is raised stating - the response is invalid. 
***
* The response is fed to the beautiful soup function and a beautifulsoup document for that specific market is returned.

In [74]:
def get_market_page(market):
    market_tickers_url = 'https://www.finance.yahoo.com/'+market
    response = requests.get(market_tickers_url)
    if response.status_code != 200:
        print('Status code: ', response.status_code)
        raise Exception('Failed to fetch web page '+ market_tickers_url)
    doc = BeautifulSoup(response.text)
    return doc

In [75]:
# when the cryptocurrencies market category is queried:
crypto_doc = get_market_page('cryptocurrencies')

The title of the webpage can be parsed from the BeautifulSoup document generated from the cryptocurrencies page, by calling .title.text to the document name

In [76]:
crypto_doc.title.text

'All Cryptocurrencies Screener - Yahoo Finance'

The trending tickers market of Yahoo Finance website is also queried and a document is generated. The title is then checked to make sure that the function above works for all applicable markets on the yahoo finance website.

In [77]:
# Another example 
market_doc = get_market_page('trending-tickers')
market_doc.title.text

'Trending Stocks Today - Yahoo Finance'

### <font color=green>Inspecting the HTML code of a market website on Yahoo Finance

Upon inspection of the html code of the Yahoo Finance webpage, it was found that there are 'tr' tags with the class "simpTb1Row..." for each of the market categories.

<img src="https://i.imgur.com/qhjhWsn.jpeg" style="width: 900px;" align="center" />

In [78]:
#to look at all the 'tr' tags and assign a variable to it: 'ticker_tags'
ticker_tags = market_doc.find_all('tr', class_="simpTblRow")

In [79]:
# It was found that there are thirty tickers in the 'top-trending' tickers list.
len(ticker_tags)

30

In [80]:
#Inspect the first ticker tag and assigning it to a variable: ticker_tag
ticker_tag = ticker_tags[0]
ticker_tag

<tr class="simpTblRow Bgc($hoverBgColor):h BdB Bdbc($seperatorColor) Bdbc($tableBorderBlue):h H(32px) Bgc($lv2BgColor) " data-reactid="41"><td aria-label="Symbol" class="Va(m) Ta(start) Pstart(6px) Pend(15px) Start(0) Pend(10px) simpTblRow:h_Bgc($hoverBgColor) Pos(st) Bgc($lv3BgColor) Bgc($lv2BgColor) Ta(start)! Fz(s)" colspan="" data-reactid="42"><a class="Fw(600) C($linkColor)" data-reactid="43" data-test="quoteLink" href="/quote/BTC-USD?p=BTC-USD" title="Bitcoin USD">BTC-USD</a><div class="W(3px) Pos(a) Start(100%) T(0) H(100%) Bg($pfColumnFakeShadowGradient) Pe(n) Pend(5px)" data-reactid="44"></div></td><td aria-label="Name" class="Va(m) Ta(start) Px(10px) Miw(180px) Fz(s)" colspan="" data-reactid="45"><!-- react-text: 46 -->Bitcoin USD<!-- /react-text --></td><td aria-label="Last Price" class="Va(m) Ta(end) Pstart(20px) Fw(600) Fz(s)" colspan="" data-reactid="47"><span class="Trsdu(0.3s) " data-reactid="48">48,643.65</span></td><td aria-label="Market Time" class="Va(m) Ta(end) Pst

When the 'tr' tags are inspected, we find that each 'tr' tag has a list of 'td' tags in them:

<img src="https://i.imgur.com/ZzASKhi.jpeg" style="width: 600px;" align="center" />




In [81]:
#to look at all 'td' tags in a 'tr' tag
td_tags = ticker_tag.find_all('td')
td_tags

[<td aria-label="Symbol" class="Va(m) Ta(start) Pstart(6px) Pend(15px) Start(0) Pend(10px) simpTblRow:h_Bgc($hoverBgColor) Pos(st) Bgc($lv3BgColor) Bgc($lv2BgColor) Ta(start)! Fz(s)" colspan="" data-reactid="42"><a class="Fw(600) C($linkColor)" data-reactid="43" data-test="quoteLink" href="/quote/BTC-USD?p=BTC-USD" title="Bitcoin USD">BTC-USD</a><div class="W(3px) Pos(a) Start(100%) T(0) H(100%) Bg($pfColumnFakeShadowGradient) Pe(n) Pend(5px)" data-reactid="44"></div></td>,
 <td aria-label="Name" class="Va(m) Ta(start) Px(10px) Miw(180px) Fz(s)" colspan="" data-reactid="45"><!-- react-text: 46 -->Bitcoin USD<!-- /react-text --></td>,
 <td aria-label="Last Price" class="Va(m) Ta(end) Pstart(20px) Fw(600) Fz(s)" colspan="" data-reactid="47"><span class="Trsdu(0.3s) " data-reactid="48">48,643.65</span></td>,
 <td aria-label="Market Time" class="Va(m) Ta(end) Pstart(20px) Miw(90px) Fz(s)" colspan="" data-reactid="49"><span class="Trsdu(0.3s) " data-reactid="50">12:02PM BST</span></td>,
 <t

It was also found - by hovering on each 'td' tag - that they represent each of the items in the ticker row on the website:

<img src="https://i.imgur.com/UpdTOlz.jpeg" style="width: 900px;" align="center" />


When the td tags are expanded, we can find the information we are looking for - ticker name, ticker symbol, ticker price and ticker path (which needs to be appended to the base-URL) to lead to the correct URL. 

### <font color=green>Parsing the required information for each ticker:

Once the required information is located inside each of the 'td' tags, these are the findings:
* Ticker symbol is in the first 'td' tag
* Ticker name is in the second 'td' tag
* Ticker price is in the third 'td' tag
* Ticker path is located inside the 'a' tag in the first 'td' tag ('a' tags usually contain links)

 

Ticker path location:

<img src="https://i.imgur.com/jcPdycA.jpg" style="width: 900px;" align="center" />
    
    

In [82]:
ticker_symbol = td_tags[0].text
ticker_symbol

'BTC-USD'

In [83]:
ticker_name = td_tags[1].text
ticker_name

'Bitcoin USD'

In [84]:
ticker_price_text = td_tags[2].text
ticker_price = ticker_price_text.replace(',','')
ticker_price

'48643.65'

In [85]:
a_tags = ticker_tag.find_all('a')

In [86]:
a_tags

[<a class="Fw(600) C($linkColor)" data-reactid="43" data-test="quoteLink" href="/quote/BTC-USD?p=BTC-USD" title="Bitcoin USD">BTC-USD</a>,
 <a aria-label="Go to BTC-USD Chart" class="C($linkColor)" data-reactid="64" data-symbol="BTC-USD" href="/chart/BTC-USD?p=BTC-USD" rel="noopener noreferrer" target="_blank"><canvas data-reactid="65" style="width:70px;height:25px;"></canvas></a>]

In [87]:
ticker_path = a_tags[0]['href']

In [88]:
ticker_path

'/quote/BTC-USD?p=BTC-USD'

In [89]:
ticker_url = base_url+ticker_path

In [90]:
ticker_url

'https://www.finance.yahoo.com/quote/BTC-USD?p=BTC-USD'

In [91]:
print('Ticker name: ', ticker_name)
print('Ticker symbol: ', ticker_symbol)
print('Ticker price: ', ticker_price)
print('Ticker URL', ticker_url)

Ticker name:  Bitcoin USD
Ticker symbol:  BTC-USD
Ticker price:  48643.65
Ticker URL https://www.finance.yahoo.com/quote/BTC-USD?p=BTC-USD


### <font color=brown>Function No. 2: parse_ticker

* Next, we define a function 'parse_ticker', which takes the 'tr' tag (ticker_tag) as an input. 
***
Inside this function, we define the following local variables:
* ticker_name: This is the name of the ticker and is obtained when the second td_tag is passed with the text attribute.

* ticker_symbol: This is symbol of the ticker and is obtained when the second td_tag is passed with the text attribute.

* ticker_price: This is the price of each unit of the ticker and is obtained when the second td_tag is passed with the text attribute.

* ticker_URL: This is the URL for the ticker on Yahoo Finance. It is obtained by appending the ticker_path to the base_url

* The output that is generated is a dictionary containing key-value pairs, where the keys are 'Ticker name', 'Ticker symbol', 'Ticker price' and 'Ticker URL' and the values are their respective local variables mentioned in the text above. 

Putting the code for scraping all the ticker informtation together, a function 'parse_ticker' can be defined as follows:

In [92]:
def parse_ticker(ticker_tag):
    td_tags = ticker_tag.find_all('td')
    ticker_name = td_tags[1].text.replace(',','')
    ticker_symbol = td_tags[0].text
    ticker_price = (td_tags[2].text.replace(',',''))
    a_tags = ticker_tag.find_all('a')
    ticker_path = a_tags[0]['href']
    ticker_url = base_url+ticker_path
    return {
        'Ticker name ': ticker_name,
        'Ticker symbol ': ticker_symbol,
        'Ticker price ': ticker_price,
        'Ticker URL': ticker_url
    }

In [93]:
parse_ticker(ticker_tags[1])

{'Ticker URL': 'https://www.finance.yahoo.com/quote/%5EGSPC?p=%5EGSPC',
 'Ticker name ': 'S&P 500',
 'Ticker price ': '4432.99',
 'Ticker symbol ': '^GSPC'}

For iterating all tickers in a particular category, we can try the following code:

In [94]:
top_tickers = [parse_ticker(tag) for tag in ticker_tags]

In [95]:
#top_tickers
len(top_tickers)

30

In [96]:
#check the top 10 
top_tickers[:10]

[{'Ticker URL': 'https://www.finance.yahoo.com/quote/BTC-USD?p=BTC-USD',
  'Ticker name ': 'Bitcoin USD',
  'Ticker price ': '48643.65',
  'Ticker symbol ': 'BTC-USD'},
 {'Ticker URL': 'https://www.finance.yahoo.com/quote/%5EGSPC?p=%5EGSPC',
  'Ticker name ': 'S&P 500',
  'Ticker price ': '4432.99',
  'Ticker symbol ': '^GSPC'},
 {'Ticker URL': 'https://www.finance.yahoo.com/quote/ETH-USD?p=ETH-USD',
  'Ticker name ': 'Ethereum USD',
  'Ticker price ': '3532.81',
  'Ticker symbol ': 'ETH-USD'},
 {'Ticker URL': 'https://www.finance.yahoo.com/quote/%5EIXIC?p=%5EIXIC',
  'Ticker name ': 'Nasdaq',
  'Ticker price ': '15043.97',
  'Ticker symbol ': '^IXIC'},
 {'Ticker URL': 'https://www.finance.yahoo.com/quote/DNA?p=DNA',
  'Ticker name ': '5885041',
  'Ticker price ': '12.18',
  'Ticker symbol ': 'DNA'},
 {'Ticker URL': 'https://www.finance.yahoo.com/quote/ADA-USD?p=ADA-USD',
  'Ticker name ': 'Cardano USD',
  'Ticker price ': '2.4161',
  'Ticker symbol ': 'ADA-USD'},
 {'Ticker URL': 'http

### <font color=brown>Function No. 3: get_top_tickers
* A function get_top_tickers is written that can take in a beautifulsoup doc for any of the market categories, and parse the ticker information for it.
***
* The local variable 'ticker_tags' is defined which takes all the 'tr' tags in the html code, with class "simpTblRow". 
***
* We already found that each ticker row on the webpage is represented by a single 'tr' tag. So it is expected that the local variable 'ticker_tags' contains information on all the ticker rows visible on the page. 
***
* Then, we extract information for each row by iterating over each of the 'tr' tags and parsing each of those 'tr' tags with the parse_ticker function as defined in Function No. 1. The result is a dictionary of key-value pairs and is assigned to a local variable - market_tickers, which is returned

In [97]:
def get_top_tickers(doc):
    ticker_tags = doc.find_all('tr', class_="simpTblRow")
    market_tickers = [parse_ticker(tag) for tag in ticker_tags]
    return market_tickers

In [98]:
# testing out the 'cryptocurrencies' market page on Yahoo Finance and assigning the output to crypto_test_tickers
crypto_page = get_market_page('cryptocurrencies')
crypto_test_tickers = get_top_tickers(crypto_page)
crypto_test_tickers[:5]

[{'Ticker URL': 'https://www.finance.yahoo.com/quote/BTC-USD?p=BTC-USD',
  'Ticker name ': 'Bitcoin USD',
  'Ticker price ': '48650.23',
  'Ticker symbol ': 'BTC-USD'},
 {'Ticker URL': 'https://www.finance.yahoo.com/quote/ETH-USD?p=ETH-USD',
  'Ticker name ': 'Ethereum USD',
  'Ticker price ': '3533.52',
  'Ticker symbol ': 'ETH-USD'},
 {'Ticker URL': 'https://www.finance.yahoo.com/quote/ADA-USD?p=ADA-USD',
  'Ticker name ': 'Cardano USD',
  'Ticker price ': '2.4213',
  'Ticker symbol ': 'ADA-USD'},
 {'Ticker URL': 'https://www.finance.yahoo.com/quote/HEX-USD?p=HEX-USD',
  'Ticker name ': 'HEX USD',
  'Ticker price ': '0.447696',
  'Ticker symbol ': 'HEX-USD'},
 {'Ticker URL': 'https://www.finance.yahoo.com/quote/BNB-USD?p=BNB-USD',
  'Ticker name ': 'BinanceCoin USD',
  'Ticker price ': '418.49',
  'Ticker symbol ': 'BNB-USD'}]

In [99]:
(crypto_test_tickers[0])

{'Ticker URL': 'https://www.finance.yahoo.com/quote/BTC-USD?p=BTC-USD',
 'Ticker name ': 'Bitcoin USD',
 'Ticker price ': '48650.23',
 'Ticker symbol ': 'BTC-USD'}

### <font color=brown>Function No. 4: write_csv

* Once all the tickers are parsed, the information needs to be saved in a csv file for usability.
***
* The function 'write_csv' is defined, which takes inputs as items and path. 
***
* Items corresponds to the dictionary of key-value pairs that contains information about each ticker row. 
***
* Path corresponds to the filename that the dictionary is being saved to. 
***
* The output generated is a file, which was passed as path to the write_csv function.

In [100]:
def write_csv(items, path):
    # with the file opened in write mode, if there are no items in the dictionary, nothing is returned. 
    with open(path, 'w') as f:
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

In [101]:
# write a test csv file: trending_stocks_test.csv
trending_stocks_page = get_market_page('trending-tickers')
trending_tickers = get_top_tickers(trending_stocks_page)
trending_tickers[:5]
write_csv(trending_tickers, 'trending_stocks_test.csv')

### <font color=brown>Function No. 5: scrape_market_tickers

* The function - scrape_market_tickers is written to take each market category as string input and give the outputs - filename.csv and a dcitionary of the key-value pairs for each ticker.

***
* If the path for the file to be output is not provided, it takes the market name and appends '.csv'. 
***
* get_market_page function as seen in Function No. 1 takes the market name as string and returns the bueatifulsoup document for that specific market. The resuls is assigned to the local variable - market_page_doc
***
* The local variable market_tickers is assigned to the output generated by passing the market_page_doc to the get_top_tickers function.

In [102]:
def scrape_market_tickers(market, path=None):
    if path is None:
        path = market+'csv'
    market_page_doc = get_market_page(market)
    market_tickers = get_top_tickers(market_page_doc)
    write_csv(market_tickers, path)
    return path

### <font color=green>Putting all the above functions together:

In [103]:
#Let's attempt to scrape all these market categories mentioned in the next line:

markets = ['cryptocurrencies', 'trending-tickers','most-active', 'gainers', 'losers', 'etfs', 'world-indices', 'mutualfunds' ]

import requests
from bs4 import BeautifulSoup
base_url = 'https://www.finance.yahoo.com'

def scrape_market_tickers(market, path=None):
    if path is None:
        path = market+'.csv'
    market_page_doc = get_market_page(market)
    market_tickers = get_top_tickers(market_page_doc)
    write_csv(market_tickers, path)
    return path

def get_market_page(market):
    market_tickers_url = 'https://www.finance.yahoo.com/'+market
    response = requests.get(market_tickers_url)
    if response.status_code != 200:
        print('Status code: ', response.status_code)
        raise Exception('Failed to fetch web page '+ market_tickers_url)
    doc = BeautifulSoup(response.text)
    return doc

def get_top_tickers(doc):
    ticker_tags = doc.find_all('tr', class_="simpTblRow")
    market_tickers = [parse_ticker(tag) for tag in ticker_tags]
    return market_tickers

def parse_ticker(ticker_tag):
    td_tags = ticker_tag.find_all('td')
    ticker_name = td_tags[1].text.replace(',','')
    ticker_symbol = td_tags[0].text
    ticker_price = td_tags[2].text.replace(',','')
    a_tags = ticker_tag.find_all('a')
    ticker_path = a_tags[0]['href']
    ticker_url = base_url + ticker_path
    return {
        'Ticker name ': ticker_name,
        'Ticker symbol ': ticker_symbol,
        'Ticker price ': ticker_price,
        'Ticker URL': ticker_url
    }
def write_csv(items, path):
    with open(path, 'w') as f:
        if len(items) == 0:
            return
        
        # Write the headers in the first line
        headers = list(items[0].keys())
        f.write(','.join(headers) + '\n')
        
        # Write one item per line
        for item in items:
            values = []
            for header in headers:
                values.append(str(item.get(header, "")))
            f.write(','.join(values) + "\n")

for market in markets:
    scrape_market_tickers(market, path=None)

### <font color=green>The Output
Results can be checked by checking the files in the library:
File > Open > Check all the files and their contents

    
The file list looks like: 
<img src="https://i.imgur.com/UcOWmWR.jpeg" style="width: 400px;" align="center" />
    
***
When the cryptocurrencies.csv file is opened and inspected, this is what the file contents look like: 
    (Note: the contents of output files change as the tickers are updated on the Yahoo Finance website) 
<img src="https://i.imgur.com/pZt4K9i.jpeg" style="width: 900px;" align="center" />


### <font color=green> Merging all the scraped data into a single file

In [104]:
import pandas as pd
import os, glob
path = ""
my_files = glob.glob(os.path.join(path, "*.csv"))
df_from_each_file = (pd.read_csv(f, sep=',') for f in my_files)
df_merged   = pd.concat(df_from_each_file, ignore_index=True)
df_merged.to_csv( "merged.csv", index=False)

merged_df = pd.read_csv('merged.csv')
merged_df

Unnamed: 0,Ticker name,Ticker symbol,Ticker price,Ticker URL
0,Arrow DWA Tactical: International ETF,DWCR,36.63,https://www.finance.yahoo.com/quote/DWCR?p=DWCR
1,Invesco QQQ Trust,QQQ,373.83,https://www.finance.yahoo.com/quote/QQQ?p=QQQ
2,Arrow DWA Tactical: International ETF,DWCR,36.63,https://www.finance.yahoo.com/quote/DWCR?p=DWCR
3,Invesco QQQ Trust,QQQ,373.83,https://www.finance.yahoo.com/quote/QQQ?p=QQQ
4,Arrow DWA Tactical: International ETF,DWCR,36.63,https://www.finance.yahoo.com/quote/DWCR?p=DWCR
...,...,...,...,...
664,S&P/CLX IPSA,^IPSA,5058.88,https://www.finance.yahoo.com/quote/%5EIPSA?p=...
665,MERVAL,^MERV,38390.84,https://www.finance.yahoo.com/quote/%5EMERV?p=...
666,TA-125,^TA125.TA,1862.53,https://www.finance.yahoo.com/quote/%5ETA125.T...
667,EGX 30 Price Return Index,^CASE30,10996.80,https://www.finance.yahoo.com/quote/%5ECASE30?...


### <font color=green>Conclusion:

* The top tickers for various market categories on the Yahoo website were scraped and csvs were created with the target information included

* Please note that some market categories on the website such as 'Calendars' and 'Currency converter' do not have ticker / row data and therefore cannot be scraped with the above method. 

* Beautifulsoup package was used in this webscraping project to scrape HTML data


### <font color=green>References and Future Work:
* References:
    * Jupyter notebook formatting guide - https://medium.com/analytics-vidhya/the-jupyter-notebook-formatting-guide-873ab39f765e
    * Merging scraped data - https://blog.softhints.com/how-to-merge-multiple-csv-files-with-python/
* Ideas for future work:
    * Scrape more tickers in each category by increasing the view size
    * For cryptocurrencies, get the total volume as well to get a better understanding about its performance
    * Yahoo finance was a static website; dynamic scraping with selenium or other tools needs to be explored
    * Scraping with REST APIs to get json files is an easier way to scrape web data. I will need to attempt webscraping from Twitter for analyzing real-time events