# Scraping UK's most active stocks from Yahoo Finance
![.](https://i.imgur.com/0AwqPhP.jpeg)

[Yahoo Finance UK](https://uk.finance.yahoo.com/) provides free stock quotes, up-to-date and premium news and video, portfolio management resources, real-time market data, career tips, and personal finance that make your money work better for you.

The page https://uk.finance.yahoo.com/most-active provides a list of the most active stocks on Yahoo Finance for the United Kingdom region. In this project, we'll retrieve information from this page using _web scraping_: the process of extracting information from a website and store it in a structured form in an automated fashion using code.

We'll use the Python libraries [Requests](https://docs.python-requests.org/en/master/) and [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) to scrape data from this page. 

Here's an outline of the steps we'll follow:
1. Download the webpage using `requests`
2. Parse the HTML source code using beautiful soup
3. Extract information about the stocks from the page
4. Compile extracted information into Python lists and dictionaries
5. Extract data from multiple pages
6. Save the extracted information to a CSV file.

By the end of the project, we'll create a CSV file in the following format:

```
symbol,name,price,change,change%,market cap,url
SYME.L, Supply@ME Capital plc,0.3316,-0.0024,-0.7186%,107.6M,https://uk.finance.yahoo.com/quote/SYME.L?p=SYME.L
...
```

Before going any further, Let's understand a bit more about the fields we are going to extract:

* **Symbol** - A stock symbol is a unique series of letters assigned to a security for trading purposes. Stocks listed on the New York Stock Exchange (NYSE) can have four or fewer letters. Nasdaq-listed securities can have up to five characters. Symbols are just a shorthand way of describing a company's stock, so there is no significant difference between those that have three letters and those that have four or five. Stock symbols are also known as ticker symbols.
* **name** - company's name
* **price** - is the amount it would cost to buy one share in a company. The price of a share is not fixed but fluctuates according to market conditions. It will likely increase if the company is perceived to be doing well, or fall if the company isn't meeting expectations
* **change** -  refers to a price difference that occurs between two points in time. ... For a stock or bond quote, change is the difference between the current price and the last trade of the previous day.
* **change%** - The Percent Change measures the absolute percentage price change of the security’s price since the previous day’s close. It is quoted as a percentage of the previous days’ close.
* **market cap** - Market capitalization refers to the total dollar market value of a company's outstanding shares of stock. Commonly referred to as "market cap," it is calculated by multiplying the total number of a company's outstanding shares by the current market price of one share.
* **url** - a Uniform Resource Locator, a tool used to find webpages.



## How to Run the code
You can execute the code using the 'Run' button at the top of this page and selecting 'Run on Binder'. You can make changes and save your own version of the notebook to [Jovian](https://jovian.ai/) by executing the following cells:

In [3]:
!pip install jovian --upgrade --quiet

In [4]:
import jovian

In [5]:
# Execute this to save new versions of the notebook
jovian.commit(project="final")

[jovian] Detected Colab notebook...[0m
[jovian] Please enter your API key ( from https://jovian.ai/ ):[0m
API KEY: ··········
[jovian] Uploading colab notebook to Jovian...[0m
Committed successfully! https://jovian.ai/darshandesai/yahoo-web-scraping-project-final


'https://jovian.ai/darshandesai/yahoo-web-scraping-project-final'

## Download the webpage using `requests`

We'll use the `requests` library to download the web page.

The library can be installed using pip.

In [6]:
!pip install requests==2.23.0 --upgrade --quiet

In [7]:
import requests

The library is now installed and imported.

To download a page, we can use the`get`function from requests, which returns a response object.

In [8]:
act_stocks_url = 'https://uk.finance.yahoo.com/most-active'
response = requests.get(act_stocks_url)

`requests.get` returns a response object containing the data from the web page and some other information.

The `.status_code` property can be used to check if the response was successful. A successful response will have an [HTTP status code](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) between 200 and 299.

In [9]:
response.status_code

200

The request was successful! We can get the contents of the page using respose.text.

In [10]:
page_content = response.text

Let's check the no. of characters on the page.

In [11]:
len(page_content)

1009159

The page contains over **900,000** characters!

Here are the first 1000 characters of the page:

In [12]:
page_content[:1000]

'<!DOCTYPE html><html data-color-theme="light" id="atomic" class="NoJs chrome desktop failsafe" lang="en-GB"><head prefix="og: http://ogp.me/ns#"><script>window.performance && window.performance.mark && window.performance.mark(\'PageStart\');</script><meta charset="utf-8" /><title>Most active stocks today – Yahoo Finance</title><meta name="keywords" content="Stock screener, industry, index membership, share data, stock price, market cap, beta, sales, profitability, valuation ratios, analyst estimates, large cap value, bargain growth, preset stock screens" /><meta http-equiv="x-dns-prefetch-control" content="on" /><meta property="twitter:dnt" content="on" /><meta property="fb:app_id" content="115060728528067" /><meta name="theme-color" content="#400090" /><meta name="viewport" content="width=device-width, initial-scale=1" /><meta name="description" lang="en-GB" content="See a list of the most active stocks today, including share price change and percentage, trading volume, intra-day hig

What we're looking at above is the [HTML source code](https://simple.wikipedia.org/wiki/HTML) of the web page.

We can also save it to a file and view the page locally within Jupyter using "file >Open".

In [13]:
with open('webpage.html', 'w') as f:
    f.write(page_content)

The preview looks similar to the original page, but none of the links work.

![.](https://i.imgur.com/jUCnw2S.png)

We have successfully downloaded the web page using requests.

## Parse the HTML source code using beautiful soup




First, We are going to import the `Beautifulsoup` library from `beautifulsoup4`.

In [14]:
!pip install beautifulsoup4 --upgrade --quiet

[?25l[K     |██▋                             | 10 kB 15.5 MB/s eta 0:00:01[K     |█████▏                          | 20 kB 2.4 MB/s eta 0:00:01[K     |███████▊                        | 30 kB 3.4 MB/s eta 0:00:01[K     |██████████▎                     | 40 kB 4.4 MB/s eta 0:00:01[K     |████████████▉                   | 51 kB 5.3 MB/s eta 0:00:01[K     |███████████████▍                | 61 kB 6.3 MB/s eta 0:00:01[K     |██████████████████              | 71 kB 6.4 MB/s eta 0:00:01[K     |████████████████████▌           | 81 kB 6.4 MB/s eta 0:00:01[K     |███████████████████████         | 92 kB 7.1 MB/s eta 0:00:01[K     |█████████████████████████▋      | 102 kB 6.7 MB/s eta 0:00:01[K     |████████████████████████████▏   | 112 kB 6.7 MB/s eta 0:00:01[K     |██████████████████████████████▊ | 122 kB 6.7 MB/s eta 0:00:01[K     |████████████████████████████████| 128 kB 6.7 MB/s 
[?25h

In [15]:
from bs4 import BeautifulSoup

We've installed and imported BeautifulSoup library

Now, We're going to create a parsed document "doc" using the above library

In [16]:
doc = BeautifulSoup(response.text, 'html.parser')

In [17]:
doc.find('title')

<title>Most active stocks today – Yahoo Finance</title>

Let's create a function to download a page using `requests` and parse it using Beautifulsoup.

In [18]:
def get_page(url):
    """ Download a web page and return a beautiful soup doc"""
    #Download the page
    response = requests.get(url)
    
    # check if download was successful and if not, raise an exception
    if response.status_code != 200:
        raise Exception('Unable to download page{}'. format(url))
        
    # Get the page HTML
    page_content = response.text
        
     # Create a bs4 doc
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

In [19]:
doc = get_page(act_stocks_url)

In [20]:
doc.find('title')

<title>Most active stocks today – Yahoo Finance</title>

In above section, We parsed the document using the Beautifulsoup library and also created a function **`get_page`** to download any web page using requests and parse it using beautiful soup.

## Extract required information from the page

Let's create functions to extract **symbol, name, price, change, change%, market cap, url** fields from the page

### Stock Symbols

In [21]:
def get_stock_symbols(doc):
    stk_table = doc.find('table', class_='W(100%)')
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [row.find('td').text for row in rows]

We have defined the get_stock_symbols function and let's use it to get symbols from the page.

In [22]:
symbols = get_stock_symbols(doc)

In [23]:
symbols[:5]

['3LRR.L', 'RBD.L', 'PREM.L', '0RTY.IL', 'UKOG.L']

In [24]:
len(symbols)

25

We have extracted 25 symbols from the page.

Similarly, we'll crate functions for the remaining fields below.

### Stock names

In [25]:
def get_stock_names(doc):
    stk_table = doc.find('table', class_='W(100%)')
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [row.find('td', class_='Va(m) Ta(start) Px(10px) Fz(s)').text for row in rows]
       

In [26]:
names = get_stock_names(doc)

In [27]:
names[:5]

['GraniteShares 3x Long Rolls-Royce Daily ETC',
 'Reabold Resources plc',
 'Premier African Minerals Limited',
 'Piraeus Financial Holdings S.A.',
 'UK Oil & Gas Investments PLC']

### Stock prices

In [28]:
def get_stock_prices(doc):
    stk_table = doc.find('table', class_='W(100%)')
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [row.find('td', {'aria-label':'Price (intraday)'}).text for row in rows]


In [29]:
prices = get_stock_prices(doc)

### Stock price changes

In [30]:
def get_stock_changes(doc):
    stk_table = doc.find('table', class_='W(100%)')
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [row.find('td', {'aria-label':'Change'}).text for row in rows]

In [31]:
changes = get_stock_changes(doc)

### Stock % changes

In [32]:
def get_stock_percnt_changes(doc):
    stk_table = doc.find('table', class_='W(100%)')
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [row.find('td', {'aria-label':'% change'}).text for row in rows]


In [33]:
percnt_changes = get_stock_percnt_changes(doc)

### Stock Market cap

In [34]:
def get_stock_mkt_caps(doc):
    stk_table = doc.find('table', class_='W(100%)')
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [row.find('td', {'aria-label':'Market cap'}).text for row in rows]

In [35]:
mkt_caps = get_stock_mkt_caps(doc)

### Stock urls

In [36]:
def get_stock_urls(doc):
    stk_table = doc.find('table', class_='W(100%)')
    base_url = 'https://uk.finance.yahoo.com'
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [(base_url + row.find('a', href=True)['href']) for row in rows]

In [37]:
stock_urls = get_stock_urls(doc)

In [38]:
stock_urls

['https://uk.finance.yahoo.com/quote/3LRR.L?p=3LRR.L',
 'https://uk.finance.yahoo.com/quote/RBD.L?p=RBD.L',
 'https://uk.finance.yahoo.com/quote/PREM.L?p=PREM.L',
 'https://uk.finance.yahoo.com/quote/0RTY.IL?p=0RTY.IL',
 'https://uk.finance.yahoo.com/quote/UKOG.L?p=UKOG.L',
 'https://uk.finance.yahoo.com/quote/3NGS.L?p=3NGS.L',
 'https://uk.finance.yahoo.com/quote/BOIL.L?p=BOIL.L',
 'https://uk.finance.yahoo.com/quote/CLON.L?p=CLON.L',
 'https://uk.finance.yahoo.com/quote/KZG.L?p=KZG.L',
 'https://uk.finance.yahoo.com/quote/VAST.L?p=VAST.L',
 'https://uk.finance.yahoo.com/quote/LLOY.L?p=LLOY.L',
 'https://uk.finance.yahoo.com/quote/0MPP.IL?p=0MPP.IL',
 'https://uk.finance.yahoo.com/quote/MSMN.L?p=MSMN.L',
 'https://uk.finance.yahoo.com/quote/EQT.L?p=EQT.L',
 'https://uk.finance.yahoo.com/quote/UPL.L?p=UPL.L',
 'https://uk.finance.yahoo.com/quote/VOD.L?p=VOD.L',
 'https://uk.finance.yahoo.com/quote/HUR.L?p=HUR.L',
 'https://uk.finance.yahoo.com/quote/POG.L?p=POG.L',
 'https://uk.finance

In the above section, We've created required functions to extract the necessary fields from the page. In the next section, we're going to compile the extracted information into a dictionary.

## Compile extracted information into a dictionary

In [39]:
stocks_data = {
    'symbol': symbols,
    'name': names,
    'price': prices,
    'change':changes,
    '% change':percnt_changes,
    'market cap':mkt_caps,
    'url':stock_urls
}


We've created a stocks_data dictionary from the extracted information and now We'll use the pandas library to create a dataframe.

Let's install and import pandas as pd

In [40]:
!pip install pandas==1.1.0 --upgrade --quiet

[K     |████████████████████████████████| 10.5 MB 8.2 MB/s 
[?25h

In [41]:
import pandas as pd

In [42]:
pd.DataFrame(stocks_data)

Unnamed: 0,symbol,name,price,change,% change,market cap,url
0,3LRR.L,GraniteShares 3x Long Rolls-Royce Daily ETC,0.1691,-0.0212,-11.14%,,https://uk.finance.yahoo.com/quote/3LRR.L?p=3L...
1,RBD.L,Reabold Resources plc,0.289,-0.001,-0.34%,25.807M,https://uk.finance.yahoo.com/quote/RBD.L?p=RBD.L
2,PREM.L,Premier African Minerals Limited,0.3284,-0.0016,-0.48%,73.621M,https://uk.finance.yahoo.com/quote/PREM.L?p=PR...
3,0RTY.IL,Piraeus Financial Holdings S.A.,0.9846,-0.5354,-35.22%,1.231B,https://uk.finance.yahoo.com/quote/0RTY.IL?p=0...
4,UKOG.L,UK Oil & Gas Investments PLC,0.117,-0.0055,-4.49%,19M,https://uk.finance.yahoo.com/quote/UKOG.L?p=UK...
5,3NGS.L,WisdomTree Natural Gas 3x Daily Short,0.0326,0.0014,+4.32%,,https://uk.finance.yahoo.com/quote/3NGS.L?p=3N...
6,BOIL.L,Baron Oil Plc,0.0735,0.0005,+0.68%,10.535M,https://uk.finance.yahoo.com/quote/BOIL.L?p=BO...
7,CLON.L,Clontarf Energy plc,0.0608,-0.0067,-9.93%,1.441M,https://uk.finance.yahoo.com/quote/CLON.L?p=CL...
8,KZG.L,Kazera Global plc,0.925,0.0,0.00%,8.669M,https://uk.finance.yahoo.com/quote/KZG.L?p=KZG.L
9,VAST.L,Vast Resources plc,0.78,-0.02,-2.50%,11.432M,https://uk.finance.yahoo.com/quote/VAST.L?p=VA...


We used pandas to create a dataframe in the above section.


## Getting information out of a stock page

In [43]:
stock_page_url = stock_urls[0]

In [44]:
stock_page_url

'https://uk.finance.yahoo.com/quote/3LRR.L?p=3LRR.L'

### Install Selenium

In [45]:
!pip install selenium --quiet

[K     |████████████████████████████████| 983 kB 8.5 MB/s 
[K     |████████████████████████████████| 358 kB 42.0 MB/s 
[K     |████████████████████████████████| 138 kB 50.3 MB/s 
[K     |████████████████████████████████| 4.0 MB 39.8 MB/s 
[K     |████████████████████████████████| 55 kB 3.4 MB/s 
[K     |████████████████████████████████| 58 kB 4.3 MB/s 
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
requests 2.23.0 requires urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1, but you have urllib3 1.26.9 which is incompatible.
datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.[0m
[?25h

In [46]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

In [47]:
driver = webdriver.Chrome()

WebDriverException: ignored

## Extract and combine data from multiple pages

Now, Let's create a function to get required information from multiple pages.

In [None]:
def get_page_stocks(page_number):
    url = 'https://uk.finance.yahoo.com/most-active?count=25&offset=' + str(page_number)
    doc = get_page(url)
    symbols = get_stock_symbols(doc)
    names = get_stock_names(doc)
    prices = get_stock_prices(doc)
    changes = get_stock_changes(doc)
    percnt_changes = get_stock_percnt_changes(doc)
    market_caps = get_stock_mkt_caps(doc)
    urls = get_stock_urls(doc)
    return symbols, names, prices, changes, percnt_changes, market_caps, urls

In [None]:
all_symbols, all_names, all_prices, all_changes, all_percnt_changes, all_mkt_caps, all_urls = [],[],[],[],[],[],[]

for page_number in range(0,200,25):
    symbols, names, prices, changes, percnt_changes, market_caps, urls = get_page_stocks(page_number)
    all_symbols += symbols
    all_names += names
    all_prices += prices
    all_changes += changes
    all_percnt_changes += percnt_changes
    all_mkt_caps += market_caps
    all_urls += urls


In the above cell, we are using the for loop to iterate through different pages and extract the required data and append the multiple dictionaries.

Now, let's create one dictionary to store data from all the pages.

In [None]:
stocks_all_pages = {
    'symbol': all_symbols,
    'name': all_names,
    'price': all_prices,
    'change': all_changes,
    '% change': all_percnt_changes,
    'market cap': all_mkt_caps,
    'url': all_urls
}

In [None]:
dataframe = pd.DataFrame(stocks_all_pages)

In [None]:
dataframe

Here's the preview of the first five and last five rows of the extracted file

In [None]:
dataframe.head()

In [None]:
dataframe.tail()

In the above section, We extracted data from multiple pages, created a dictionary for all the data and created a dataframe. We were able to extract 400 rows and 7 columns.

## Save the extracted information to a CSV file.

In [None]:
dataframe.to_csv('active_stocks.csv', index=None)

In [None]:
!head active_stocks.csv

We have saved the extracted information to a csv file. We've achieved the same output as defined at the beginning of the project.

In [None]:
import jovian

In [None]:
jovian.commit(files=['active_stocks.csv'])

## Summary

Here's what we've covered in this notebook:

1. Download the webpage using requests
2. Parse the HTML source code using beautiful soup
3. Extract information about the stocks from the page
4. Compile extracted information into Python lists and dictionaries
5. Extract data from multiple pages
6. Save the extracted information to a CSV file.
7. Extracted 7 columns and 400 rows of data.

The CSV file we created has this format:

![.](https://i.imgur.com/4ZtVtPc.png)


Here's the complete code for this project:

In [None]:
def get_page(url):
    #Download a web page and return a beautiful soup doc
    
    #Download the page
    response = requests.get(url)
    
    # check if download was successful
    if response.status_code != 200:
        raise Exception('Unable to download page{}'. format(url))
        
    # Get the page HTML
    page_content = response.text
        
     # Create a bs4 doc
    doc = BeautifulSoup(response.text, 'html.parser')
    return doc

def get_stock_symbols(doc):
    stk_table = doc.find('table', class_='W(100%)')
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [row.find('td').text for row in rows]
        
def get_stock_names(doc):
    stk_table = doc.find('table', class_='W(100%)')
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [row.find('td', class_='Va(m) Ta(start) Px(10px) Fz(s)').text for row in rows]
        
def get_stock_prices(doc):
    stk_table = doc.find('table', class_='W(100%)')
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [row.find('td', {'aria-label':'Price (intraday)'}).text for row in rows]
        
def get_stock_changes(doc):
    stk_table = doc.find('table', class_='W(100%)')
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [row.find('td', {'aria-label':'Change'}).text for row in rows]
        
def get_stock_mkt_caps(doc):
    stk_table = doc.find('table', class_='W(100%)')
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [row.find('td', {'aria-label':'Market cap'}).text for row in rows]
        
def get_stock_urls(doc):
    stk_table = doc.find('table', class_='W(100%)')
    base_url = 'https://uk.finance.yahoo.com/'
    for stk in stk_table.find_all('tbody'):
        rows = stk.find_all('tr')
        for row in rows:
            return [(base_url + row.find('a', href=True)['href']) for row in rows]
        
def get_page_stocks(page_number):
    url = 'https://uk.finance.yahoo.com/most-active?count=25&offset=' + str(page_number)
    doc = get_page(url)
    symbols = get_stock_symbols(doc)
    names = get_stock_names(doc)
    prices = get_stock_prices(doc)
    changes = get_stock_changes(doc)
    percnt_changes = get_stock_percnt_changes(doc)
    market_caps = get_stock_mkt_caps(doc)
    urls = get_stock_urls(doc)
    return symbols, names, prices, changes, percnt_changes, market_caps, urls

# all_symbols, all_names, all_prices, all_changes, all_percnt_changes, all_mkt_caps, all_urls = [],[],[],[],[],[],[]
for page_number in range(0,200,25):
    symbols, names, prices, changes, percnt_changes, market_caps, urls = get_page_stocks(page_number)
    all_symbols += symbols
    all_names += names
    all_prices += prices
    all_changes += changes
    all_percnt_changes += percnt_changes
    all_mkt_caps += market_caps
    all_urls += urls
    
stocks_all_pages = {
    'symbol': all_symbols,
    'name': all_names,
    'price': all_prices,
    'change': all_changes,
    '% change': all_percnt_changes,
    'market cap': all_mkt_caps,
    'url': all_urls
}

dataframe = pd.DataFrame(stocks_all_pages)
dataframe.to_csv('active_stocks.csv', index=None)

## Future Work

- Add a company profile
- Extract additional fields available from the pages like volume, avg vol(3-month), PE ration and 52-week range etc.
- Get information from each individual stock page

## References

1. https://jovian.ai/learn/zero-to-data-analyst-bootcamp/lesson/web-scraping-and-rest-apis
2. https://docs.python-requests.org/en/master/
3. https://beautiful-soup-4.readthedocs.io/en/latest/
4. https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
5. https://simple.wikipedia.org/wiki/HTML
    

In [None]:
jovian.submit(assignment="zerotoanalyst-project1")