# Scraping the data of world's top 500 companies by `market capitalization`

![image](https://i.imgur.com/1UjHN2u.jpeg)

All the data scientists/data analysts doing research or analysis need some **Data** the begin their research with.Now the question that comes to our mind is "How the data is being prepared, from where we can get this much data, and How to extract the exact data that we need?

The answer is,
> There are sevaral ways to collect this data and one of them is `Web Scraping`, web scraping: `"Web scraping is a technique used to extract structured data from websites containing unstructured data through an automated process."`.


For web scraping some tools are available there and Python comes to your rescue by provinding you with some great libraries which help to make your our work easier,these libraries help to automate our work.
> These libraries are:`requests`, `lxml`, `BeautifulSoup`, `scrapy`, `selinium`.

Here in this notebook I've used `requests` and `BeautifulSoup` libraries to scrape the data from the website [value.today](https://www.value.today/world/world-top-500-companies)

Value.Today is a software analytics company which provides World's Top Corporate Companies Information, Corporate Companies Information, Financial Data of Company and World Financial News.

The dataset that I've created in this notebook will be beneficial for the comparative study of these 500 companies. You can easily study about the Rise and Falls of these companies in from each 6 months since 2020.

### Here is an outline for the full journey of `Web-Scraping`
* First of all we need to import all usefull libraries which we arte going to use in this entire notebook
* Parsing all the `html data` by using `requests` and `BeautifulSoup` libraries
* Start finding all the usefull `tags` and data inside those tags
* After collecting all the data we than create a `dataframe` using pandas library, and also make a `CSV file` to save all the data


#### Note
* All website generally does not allow to scrape their data, some wbsites allow to use their data for reaserch and study purpose only.
* Some websites provides their data in CSV format allowing you to scrape their data with [REST API](https://www.redhat.com/en/topics/api/what-is-a-rest-api)
* Before scraping any website, we should look for a terms and conditions page to see if there are explicit rules about scraping. If there are, we should follow them. If there are not, then it becomes more of a judgement call.


* You can run the cells by clikicking the `Run` button available in the toolbar, or you can just press `Shift + Enter` to run the cell

In [1]:
!pip install jovian --upgrade --quiet

In [2]:
import jovian

In [3]:
jovian.commit(project="web-scraping-final")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "thakubhai-007/web-scraping-final" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/thakubhai-007/web-scraping-final[0m


'https://jovian.ai/thakubhai-007/web-scraping-final'

## Importing the webpage with the `requests` library

In this section we will import all the libraries which we are going to use in this entire notebook and then we import the webpage using the requests library

In [4]:
!pip install jovian --upgrade --quiet
!pip install beautifulsoup4 --upgrade --quiet
!pip install pandas --upgrade --quiet
!pip install requests --upgrade --quiet
import pandas as pd
import jovian
from bs4 import BeautifulSoup
import requests
import time

Let's download the webpage we are going to scrape, It is always usefull to put our link in a variable, because if sometimes you need to change the link then you have to go throught the enite notebook, else if you put your link in a variable you can just make change in the variable and it will work for the entire notebook.

In [5]:
topic_url = "https://www.value.today/world/world-top-500-companies"
response = requests.get(topic_url)

Let's check the status code for the page that we have downloaded above. 

The status code tells us if the webpage is suitable for scraping or not. A successfull status code must lies between 200 and 299. You may find the usefull HTTPS status code  [here](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status)

In [6]:
# Check the status code
response.status_code

200

Let's save our page content in a varibale and check the length of the enitre HTML codes available on the webpage.

In [7]:
page_content = response.text

In [8]:
len(page_content) # Checking the lenght of the page_content

1559705

As we can see the length of the page_content is more than the 1.5 million, so this means that the webpage we have imported has too much data with itself. Let's have a look on some of the page_content below

In [9]:
# Let's have a look on the first 1000 characters of the html codes that has been written for the web page
page_content[:1000]

'<!DOCTYPE html>\n<html lang="en" dir="ltr" prefix="content: http://purl.org/rss/1.0/modules/content/  dc: http://purl.org/dc/terms/  foaf: http://xmlns.com/foaf/0.1/  og: http://ogp.me/ns#  rdfs: http://www.w3.org/2000/01/rdf-schema#  schema: http://schema.org/  sioc: http://rdfs.org/sioc/ns#  sioct: http://rdfs.org/sioc/types#  skos: http://www.w3.org/2004/02/skos/core#  xsd: http://www.w3.org/2001/XMLSchema# ">\n  <head>\n    <meta charset="utf-8"/>\n<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>\n<script>(adsbygoogle=window.adsbygoogle||[]).push({google_ad_client:"ca-pub-2407955258669770",enable_page_level_ads:true});</script><script>window.google_analytics_uacct="UA-121331115-1";(function(i,s,o,g,r,a,m){i["GoogleAnalyticsObject"]=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)})(window,document,"script","https

In the above cell output you can see the texts that has been written in the [HTML](https://www.quackit.com/html/examples/html_text_examples.cfm) , 

## Parsing the HTML source code using BeautifulSoup

With the code below we actually create a copy of the webpage that we have downloaded above, The folowing copy you can find in the notebook manu go to:
> File/Open: that will open in a new tab and there you will find an html file with the name [word's-top-500-companies-by-narket-capitalization.html](https://hub.binder.jovian.ml/user/thakubhai-007/api-git-0b8226b-976aef155822_37-7ggm32d3/view/word's-top-500-companies-by-narket-capitalization.html)

In [10]:
with open("word's-top-500-companies-by-narket-capitalization.html", 'w') as file:
    file.write(page_content)

In [11]:
# Try to read the html that we have created above to extract the data that we needed from the web page.
with open("word's-top-500-companies-by-narket-capitalization.html", 'r') as f:
    html_source = f.read()

Making it a BeautifulSoup object so that we can use BS4 library for our use, and then to verify if it is correct check the type of doc

In [12]:
doc = BeautifulSoup(html_source) 

In [13]:
type(doc)

bs4.BeautifulSoup

Let's find the title tag and then have a look on the page title
> Tag a new word, I've used here, I'll explain it in next section

In [14]:
title_tag = doc.title
title_tag

<title>World Top 500 Companies by Market Capitalization as on Jan 1st, 2020</title>

In [15]:
title_tag.text

'World Top 500 Companies by Market Capitalization as on Jan 1st, 2020'

In [16]:
# Before going further, first save our work because we are doing all this on an online platform
jovian.commit(project="web-scraping-final")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "thakubhai-007/web-scraping-final" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/thakubhai-007/web-scraping-final[0m


'https://jovian.ai/thakubhai-007/web-scraping-final'


## Exracting information form the webpage with `BeautifulSoup` library

![image](https://i.imgur.com/fUH2KAg.gif)

All the data on a webpage is actually written in [HTML(Hyper Texting Markup Language)](https://devdocs.io/html/), this language uses some tags to save all the information for a webpage. So, in this section we will try to fetch all the useful tags and  information which we needed for our project.

You may find the tags on each we page with foolowing process:
> Right click on the page or on any text that you want to check the tag or source code, then click on the `inspect`, that will open in a side window of your browser, sometimes it opens in bottom or left or right side it depends on the browser's settings.
Here is a sample picture of the source code. 
![Source code image](https://i.imgur.com/japh8JV.png)

In [17]:
company_block_tag = doc.find_all('div', {'class':'row well views-row'})

In [18]:
len(company_block_tag)

500

In [19]:
# Here we try to find the tag and name for the company
companies_name = []
for tag in company_block_tag:
        company_name_tag = tag.find('div',{'class':'views-field views-field-title col-sm-12 clearfix'})
        companies_name.append(company_name_tag.find('a').text)

In [20]:
# Let's have a look on the top 5 company names if we have extracted it in right way. And also check the length of the companies name if we have got all the names available on the webpage
print(companies_name[:5])
len(companies_name)

['SAUDI ARABIAN OIL COMPANY (Saudi Aramco)', 'APPLE', 'MICROSOFT CORPORATION', 'ALPHABET', 'AMAZON.COM']


500

Yes, We have extracted exactly 500 companies name, `We got our first mile-stone in our journey`,
Let's write few separte codes for extracting data about `Rank`, `Headquarter Location`, `CEO` etc, that we need to save for our study purpose.

In [21]:
rank = []
for tag in company_block_tag:
    rank_tag = tag.find('div', {'class':'views-field views-field-field-world-rank-jan-2020 clearfix col-sm-12'})
    rank.append(int(rank_tag.find('span').text))

hq_location = []
for tag in company_block_tag:
    hq_tag = tag.find('div', {'class':'views-field views-field-field-headquarters-of-company clearfix col-sm-12'})
    hq_location.append(hq_tag.find('span').text)

ceo_name = []
for tag in company_block_tag:
    ceo_tag = tag.find('div', {'class':'views-field views-field-field-ceo clearfix col-sm-12'})
    try:
        ceo = ceo_tag.find('span', {'class':'field-content'})
        ceo_name.append(ceo.text)
    except AttributeError:
        ceo_name.append(None)
        
market_cap = []
for tag in company_block_tag:
    market_cap_tag = tag.find('div', {'class':'views-field views-field-field-market-value-jan-2020 clearfix col-sm-12'})
    try:
        market_cap_value = market_cap_tag.find('span', {'class':'field-content'})
        market_cap.append(market_cap_value.text)
    except AttributeError:
        market_cap.append(None)
        
total_employee = []
for tag in company_block_tag:
    employee_tag = tag.find('div',{'class':'views-field views-field-field-employee-count clearfix col-sm-12'})
    try:
        total_employee.append(employee_tag.find('span').text)
    except AttributeError:
        total_employee.append(None)
        
sectors = []
for tag in company_block_tag:
    sector_tag = tag.find('div', {'class':'views-field views-field-field-company-category-primary clearfix col-sm-12'})
    sectors.append(sector_tag.find('span').text)

In [22]:
# Have a look if all the data is exactly same in counting for all 500 companies.
print(len(rank),len(ceo_name),len(market_cap),len(hq_location), len(total_employee), len(sectors))

500 500 500 500 500 500


In [23]:
# Extract the URL for the company page
base_url = 'https://value.today'
url = []
for tag in company_block_tag:
    url_tag = tag.find('div',{'class':'views-field views-field-title col-sm-12 clearfix'})
    url.append(base_url + url_tag.find('a')['href'])

## Parse all the information to a `CSV` file

To create a pandas dataframe first we save all the information in a dictionary that will make a dataframe much faster in comparision to other known method. After that we than create a `CSV` file to download it locally or for the reasearch purpose.

In [24]:
companies_name_dict = {
    'Name': companies_name,
    'Rank': rank,
    'Headquarter': hq_location,
    'CEO': ceo_name,
    'Market Capitalization': market_cap,
    'Total No. Of Employee': total_employee,
    'Sectors': sectors,
    'url': url
}

In [25]:
companies_df = pd.DataFrame(companies_name_dict)

In [26]:
# Have a look on the dataframe we have created
companies_df

Unnamed: 0,Name,Rank,Headquarter,CEO,Market Capitalization,Total No. Of Employee,Sectors,url
0,SAUDI ARABIAN OIL COMPANY (Saudi Aramco),1,Saudi Arabia,Amin H. Al-Nasser,1898.10 Billion USD,79000,"Energy, Oil and Gas, Chemicals, Oil Refining, ...",https://value.today/company/saudi-arabian-oil-...
1,APPLE,2,USA,Tim Cook,1323.00 Billion USD,147000,"Technology, Mobiles & Accessories, Electronics...",https://value.today/company/apple
2,MICROSOFT CORPORATION,3,USA,Satya Nadella,1215.00 Billion USD,156439,"Technology, Software and IT, Laptops, Video Ga...",https://value.today/company/microsoft-corporation
3,ALPHABET,4,USA,Sundar Pichai,943.90 Billion USD,135301,"Technology, Internet or Mobile App Based Busin...",https://value.today/company/alphabet
4,AMAZON.COM,5,USA,Jeff Bezos,941.03 Billion USD,1298000,"eCommerce, Internet or Mobile App Based Busine...",https://value.today/company/amazon.com
...,...,...,...,...,...,...,...,...
495,SEVEN & I HOLDINGS,471,Japan,Ryuichi Isaka,32.65 Billion USD,58165,"Consumer Defensive, Retail, Super Markets, Con...",https://value.today/company/seven-i-holdings
496,ASSICURAZIONI GENERALI,472,Italy,Philippe Donnet,32.61 Billion USD,72000,"Financial Services, Insurance",https://value.today/company/assicurazioni-gene...
497,AMPHENOL CORPORATION,473,USA,Richard Adam Norwitt,32.53 Billion USD,74000,"Technology, Electronics, Cables and Wires, Ele...",https://value.today/company/amphenol-corporation
498,GENERAL MILLS,474,USA,Jeff Harmening,32.52 Billion USD,40000,"Consumer Defensive, Food Products, FMCG, Dairy...",https://value.today/company/general-mills


In [27]:
# Parse all the information to a csv file 
companies_df.to_csv('companies.csv', index = None)

In [28]:
jovian.commit(project="web-scraping-final", files=[])

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "thakubhai-007/web-scraping-final" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/thakubhai-007/web-scraping-final[0m


'https://jovian.ai/thakubhai-007/web-scraping-final'

What we have done by now?
> We have Scraped the data availbe on the [web page](https://www.value.today/world/world-top-500-companies), 

This a very basic and beginner way of scraping. We have collected the data about:
* Company name
* Company CEO
* Company Headquarter
* Company Rank on January 2020
* Company market capitalization on January 2020
* Company URL
* Total number of employee in the company
* Sectors that the company works


Now it's time to go some deeper into `Web_Scraping`, further we will try to collect data from each company's individual page availbale on the same wesite. For example visit the page of [APPLE](https://www.value.today/company/apple)



## Writing some functions for scraping some data from each particular company page

Python gives us advantage for writing useful and reusable functions, these functions are nothing but a block of re-usable code to perform a specific task. It helps in code re-usability and modular application design

In this section we will write functions for getting information about `componay_names`, `Company_rank`, `Company_market_cap` and other things for different time period, but this time all this data will come form each separate company page so it is not posiible to write codes for each single page, that's why we are going to write functions. 

In [29]:
def get_company_url(topic_url):
    url = [] # create an empty list to add the urls of each company page.
    '''We will write codes for getting the company url, 
    and check the status code again as we have checked previous'''
    response = requests.get(topic_url)
    if response.status_code != 200:
        print("Status code :", response.status_code)
        raise Exception('Failed to fetch page content ' + topic_url)
    # If the response status code is good enough for our scraping then save all the html text in a variable
    page_content = BeautifulSoup(response.text, 'html.parser')
    # Then we will try to find the url_tag for each companies page inside the page, and then we will extract the exact url
    company_block_tag = page_content.find_all('div', {'class':'row well views-row'})
    base_url = 'https://value.today'
    for tag in company_block_tag:
        url_tag = tag.find('div',{'class':'views-field views-field-title col-sm-12 clearfix'})
        url.append(base_url + url_tag.find('a')['href'])
    return url

Let's extract all the companies urls from the web page we have chosed for scraping..

I have put a special time command just before calling the functions to extract all the urls, this time command is Jupyter Notebook's inbuilt magical command, that will tell us `How much time will it take to execute a single second`.

In [30]:
%%time
url_list = get_company_url('https://www.value.today/world/world-top-500-companies')

CPU times: user 1.12 s, sys: 61.3 ms, total: 1.18 s
Wall time: 1.51 s


Great we have collected all the urls available on the page, and saved into a vairable. Let's check for the first url

In [31]:
url_0 = url_list[0]
url_0

'https://value.today/company/saudi-arabian-oil-company'

In [32]:
# here we will save top 5 companies url in a variable to test and showing the sample of the data set we are going to create in the following code cells.
top_5_companies = url_list[:5]

Let's define a fuction to get all the HTML content for each company's wepage, since we will pick all the data from each single webpage so we will return s BeautifulSoup object from this function. Also we will check the status code in the same functoin as well.

In [33]:
def get_company_page(url):
    response = requests.get(url)
    if response.status_code != 200:
        print("Status code:", response.status_code)
        raise Exception('Failed to fetch page content ' + company_url +'!')
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup

Let's test this function on our veri first url, that we have checked in one of above cell.

In [34]:
page_content_0 = get_company_page(url_0)

Below I'll write some functions to get some important data of our interest, The data about rank and market capitalization of each company.

Let's go

In [35]:
def jan_2020_market_cap(soup):
    market_value_tag = soup.find('div', {'class':'clearfix col-sm-6 field field--name-field-market-value-jan-2020 field--type-float field--label-above'})
    # Getting the div tag for the market_cap, we will then fetch the detail inside it and extract the value 
    try:
        market_value = market_value_tag.find('div', {'class':'field--item'}).text
    except AttributeError:
        market_value = 'None'
    return market_value

def jan_2020_rank(soup):
    rank_tag = soup.find('div', {'class':'clearfix col-sm-6 field field--name-field-world-rank-jan-2020 field--type-integer field--label-above'})
     # Getting the div tag for the rank, we will then fetch the detail inside it and extract the value 
    try:
        rank = rank_tag.find('div', {'class':'field--item'}).text
    except AttributeError:
        rank = 'None'
    return rank

The above block of code will catch the data about rank and market capitalization on January 2020, 

Similarly, I'll write three more block of code same as above for collecting the data for `Rank` and `Market Capitalization` for `July 2020`, `August 2020` and `January 2021`.

In [37]:
def july_2020_market_cap(soup):
    market_value_tag = soup.find('div',{'class':'clearfix col-sm-6 field field--name-field-market-cap-july-04-2020- field--type-float field--label-above'})
    try:
        market_value = market_value_tag.find('div', {'class':'field--item'}).text
    except AttributeError:
        market_value = 'None'
    return market_value


def july_2020_rank(soup):
    rank_tag = soup.find('div', {'class':'clearfix col-sm-6 field field--name-field-world-rank-july-04-2020- field--type-integer field--label-above'})
    try:
        rank = rank_tag.find('div', {'class':'field--item'}).text
    except AttributeError:
        rank = 'None'
    
    return rank

In [36]:
def aug_2020_market_cap(soup):
    market_value_tag = soup.find('div',{'class':'clearfix col-sm-6 field field--name-field-market-cap-aug-22-2020 field--type-float field--label-above'})
    try:
        market_value = market_value_tag.find('div', {'class':'field--item'}).text
    except AttributeError:
        market_value = 'None'
    return market_value

def august_2020_rank(soup):
    rank_tag = soup.find('div', {'class':'clearfix col-sm-6 field field--name-field-world-rank-aug-22-2020- field--type-integer field--label-above'})
    try:
        rank = rank_tag.find('div', {'class':'field--item'}).text     
    except AttributeError:
        rank = 'None'
    
    return rank

In [38]:
def jan_2021_market_cap(soup):
    market_value_tag = soup.find('div',{'class':'clearfix col-sm-6 field field--name-field-market-value-jan012021 field--type-float field--label-above'})
    try:
        market_value = market_value_tag.find('div', {'class':'field--item'}).text
    except AttributeError:
        market_value = 'None'
    return market_value

def jan_2021_rank(soup):
    rank_tag = soup.find('div', {'class':'clearfix col-sm-6 field field--name-field-world-rank-jan012021 field--type-integer field--label-above'})
    try:
        rank = rank_tag.find('div', {'class':'field--item'}).text
    except AttributeError:
        rank = 'None'
    
    return rank 


Let's extract some more information about `CEO`, `Logo`, `HQ` and etc.

In [39]:
def get_ceo_name(soup):
    ceo_tag = soup.find('div', {'class':'clearfix col-sm-12 field field--name-field-ceo field--type-entity-reference field--label-above'})
    # We have picked the CEO tag for the company and below we will extract the name of CEO
    try:
        ceo_name = ceo_tag.find('a').text
    except AttributeError:
        ceo_name = 'None'
    return ceo_name

In [40]:
def get_company_logo(soup):
    logo_tag = soup.find('div', {'class':'clearfix col-sm-12 field field--name-field-company-logo-lc field--type-image field--label-hidden field--item'})
    try:
        logo = base_url + logo_tag.find('img')['src']
    except AttributeError:
        logo = 'None'
    return logo

In [41]:
def get_company_url(soup):
    company_url_tag = soup.find('div', {'class':'clearfix col-sm-12 field field--name-field-company-website field--type-link field--label-above'})
    try:
        company_url = company_url_tag.find('a')['href']
    except AttributeError:
        company_url = 'None'
    return company_url

In [42]:
def get_company_headquarter(soup):
    hq_tag = soup.find('div', {'class':'clearfix col-sm-12 field field--name-field-headquarters-of-company field--type-entity-reference field--label-above'})
    try:
        hq_name = hq_tag.find('div',{'class':'field--item'}).text
    except AttributeError:
        hq_name = 'None'
    return hq_name

In [43]:
def get_company_name(soup):
    h1_tag = soup.find('div', {'class':'field field--name-node-title field--type-ds field--label-hidden field--item'}).find('h1')
    company_name = h1_tag.find('a').text.strip()
    return company_name

## Let's write some function to parse the data to the csv file

Our almost work has been completed, as we have wrote almost all the functions regarding getting all the usefull information, for the website.

Let's write a function the extract all the data and parse that data into a list of dictionaries, That will make our work much easier to create a pandas dataframe.

> This is a tricky part here, Since we are going to extract data from each single company's webpage, and there are 500 companies, so we may get block from the website company, beacause we are extracting this much data so, what I'll do to protect myself, I'll use time.sleep() function from time library, it will allow our function to take a break for just 1 second(It depends on the argument value of the time.sleep() function) and that is enough for protecting ourself before gettimg block.

In [48]:
def parse_companies_data(url):
    # Let's create data variable to store data for parse data into a dictionary, that will make easy to create a `pandas` dataframe
    soup = get_company_page(url) # Make a BeatifulSoup object with our pre defined function
    ceo_name = get_ceo_name(soup)
    company_logo = get_company_logo(soup)
    company_url = get_company_url(soup)
    company_hq = get_company_headquarter(soup)
    company_name = get_company_name(soup)
    # In all the above 5 variables as you can see we have sotred the data for the CEO, Company_Name, Company_URL, Company_Headquarter and Company_LOGO
    # In all the below variables we will store data for their ranks and market capitalization for different time period
    rank_on_jan_2020 = jan_2020_rank(soup)
    rank_on_july_2020 = july_2020_rank(soup)
    rank_on_aug_2020 = august_2020_rank(soup)
    rank_on_jan_2021 = jan_2021_rank(soup)
    market_cap_on_jan_2020 = jan_2020_market_cap(soup)
    market_cap_on_july_2020 = july_2020_market_cap(soup)
    market_cap_on_aug_2020 = aug_2020_market_cap(soup)
    market_cap_on_jan_2021 = jan_2021_market_cap(soup)
    time.sleep(1)
    #Let's make a print statement just for ensuring the data is being scraped by our function.
    print("Required data has been scraped from the page: " + company_name)
    # And now we will make a return statement in the form of a dictionary
    return {
        'Company Name': company_name,
        'CEO': ceo_name,
        'Headquarter': company_hq,
        'Jan 2020 Rank': rank_on_jan_2020,
        'Jan 2020 Market Capitalization': market_cap_on_jan_2020,
        'July 2020 Rank': rank_on_july_2020,
        'July 2020 Market Capitalization': market_cap_on_july_2020,
        'Aug 2020 Rank': rank_on_aug_2020,
        'Aug 2020 Market Capitalization': market_cap_on_aug_2020,
        'Jan 2021 Rank': rank_on_jan_2021,
        'Jan 2021 Market Capitalization': market_cap_on_jan_2021,
        'Logo':company_logo,
        'URL':company_url
    }

Let's check if it works correctly, and also check the time to execute the cell.

In [49]:
%%time
top_5_companies_data = []
for url in top_5_companies:
    data = parse_companies_data(url)
    top_5_companies_data.append(data)

Required data has been scraped from the page: SAUDI ARABIAN OIL COMPANY (Saudi Aramco)
Required data has been scraped from the page: APPLE
Required data has been scraped from the page: MICROSOFT CORPORATION
Required data has been scraped from the page: ALPHABET
Required data has been scraped from the page: AMAZON.COM
CPU times: user 559 ms, sys: 26.7 ms, total: 586 ms
Wall time: 6.44 s


Nice enough, It took almost around 7 seconds to fetch the data from `5 pages`. Let's have a look at the data for these top 5 companies

In [50]:
top_5_companies_data

[{'Company Name': 'SAUDI ARABIAN OIL COMPANY (Saudi Aramco)',
  'CEO': 'Amin H. Al-Nasser',
  'Headquarter': 'Saudi Arabia',
  'Jan 2020 Rank': '1',
  'Jan 2020 Market Capitalization': '1,898.100 Billion USD',
  'July 2020 Rank': '1',
  'July 2020 Market Capitalization': '1,953.180 Billion USD',
  'Aug 2020 Rank': '2',
  'Aug 2020 Market Capitalization': '1,997.390 Billion USD',
  'Jan 2021 Rank': '2',
  'Jan 2021 Market Capitalization': '2,051.500 Billion USD',
  'Logo': 'https://value.today/sites/default/files/styles/medium/public/2020-05/saudi-aaramco.JPG?itok=aNP5GBsE',
  'URL': 'https://www.saudiaramco.com/'},
 {'Company Name': 'APPLE',
  'CEO': 'Tim Cook',
  'Headquarter': 'USA',
  'Jan 2020 Rank': '2',
  'Jan 2020 Market Capitalization': '1,323.000 Billion USD',
  'July 2020 Rank': '2',
  'July 2020 Market Capitalization': '1,578.000 Billion USD',
  'Aug 2020 Rank': '1',
  'Aug 2020 Market Capitalization': '2,127.000 Billion USD',
  'Jan 2021 Rank': '1',
  'Jan 2021 Market Capit

#### Final command
Now it's time to give a final command to collect all the data from those 500 pages, and then we will put that into our `Dataframe`
We will put the enitre url_list of 500 URLs into our `parse_companies_data` function.


In [51]:
%%time 
companies_data = []
for x in url_list:
    companies_data.append(parse_companies_data(x))

Required data has been scraped from the page: SAUDI ARABIAN OIL COMPANY (Saudi Aramco)
Required data has been scraped from the page: APPLE
Required data has been scraped from the page: MICROSOFT CORPORATION
Required data has been scraped from the page: ALPHABET
Required data has been scraped from the page: AMAZON.COM
Required data has been scraped from the page: FACEBOOK
Required data has been scraped from the page: ALIBABA GROUP HOLDING
Required data has been scraped from the page: BERKSHIRE HATHAWAY
Required data has been scraped from the page: TENCENT
Required data has been scraped from the page: J P MORGAN CHASE & CO
Required data has been scraped from the page: VISA
Required data has been scraped from the page: JOHNSON & JOHNSON
Required data has been scraped from the page: WALMART
Required data has been scraped from the page: SAMSUNG ELECTRONICS
Required data has been scraped from the page: BANK OF AMERICA CORPORATION
Required data has been scraped from the page: NESTLE AG
Requir

Required data has been scraped from the page: SOFTBANK GROUP
Required data has been scraped from the page: CHRISTIAN DIOR
Required data has been scraped from the page: ALTRIA GROUP
Required data has been scraped from the page: NTT DOCOMO
Required data has been scraped from the page: SBERBANK OF RUSSIA
Required data has been scraped from the page: CHINA PETROLEUM & CHEMICAL CORPORATION (SINOPEC)
Required data has been scraped from the page: CSL
Required data has been scraped from the page: STATE FARM
Required data has been scraped from the page: BOOKING HOLDINGS
Required data has been scraped from the page: FIDELITY NATIONAL INFORMATION SERVICES
Required data has been scraped from the page: KEYENCE CORPORATION
Required data has been scraped from the page: ITAU UNIBANCO HOLDING
Required data has been scraped from the page: KERING
Required data has been scraped from the page: MORGAN STANLEY
Required data has been scraped from the page: CATERPILLAR
Required data has been scraped from the p

Required data has been scraped from the page: SHERWIN-WILLIAMS COMPANY
Required data has been scraped from the page: ERNST & YOUNG
Required data has been scraped from the page: LAS VEGAS SANDS CORP
Required data has been scraped from the page: JIANGSU HENGRUI MEDICINE
Required data has been scraped from the page: AMERICA MOVIL
Required data has been scraped from the page: BAYERISCHE MOTOREN WERKE (BMW)
Required data has been scraped from the page: CHINA VANKE
Required data has been scraped from the page: DANONE
Required data has been scraped from the page: GENERAL MOTORS COMPANY
Required data has been scraped from the page: ABB
Required data has been scraped from the page: WALGREENS BOOTS ALLIANCE
Required data has been scraped from the page: BAIDU
Required data has been scraped from the page: WAL-MART DE MEXICO
Required data has been scraped from the page: ALDI
Required data has been scraped from the page: MERCK KGAA O.N.
Required data has been scraped from the page: BANK OF MONTREAL


Required data has been scraped from the page: NIDEC CORPORATION
Required data has been scraped from the page: COMPASS GROUP
Required data has been scraped from the page: HANG SENG BANK
Required data has been scraped from the page: MITSUBISHI CORPORATION
Required data has been scraped from the page: KEURIG DR PEPPER
Required data has been scraped from the page: DOW Inc
Required data has been scraped from the page: CENTRAL JAPAN RAILWAY COMPANY
Required data has been scraped from the page: MANULIFE FINANCIAL CORPORATION
Required data has been scraped from the page: MURATA MANUFACTURING
Required data has been scraped from the page: V.F. CORPORATION
Required data has been scraped from the page: CHINA CITIC BANK CORPORATION
Required data has been scraped from the page: DIDI CHUXING
Required data has been scraped from the page: DOLLAR GENERAL CORPORATION
Required data has been scraped from the page: FEDEX CORPORATION
Required data has been scraped from the page: ENGIE
Required data has been 

Great, It took almost around 13 minutes. According as our previously expected time it was a bit fast. That's exactly the Magic of `Coding with Python`.

Let's have a look on top 5 elements of the list if these are same list as we have got from top_5_companies_data

In [52]:
top_5_companies_data == companies_data[:5]

True

Great, we exactly got the same result.

Now it's time to cearte a Pandas Dataframe from the above list of dictioanries

In [55]:
companies_data_df = pd.DataFrame(companies_data)
companies_data_df

Unnamed: 0,Company Name,CEO,Headquarter,Jan 2020 Rank,Jan 2020 Market Capitalization,July 2020 Rank,July 2020 Market Capitalization,Aug 2020 Rank,Aug 2020 Market Capitalization,Jan 2021 Rank,Jan 2021 Market Capitalization,Logo,URL
0,SAUDI ARABIAN OIL COMPANY (Saudi Aramco),Amin H. Al-Nasser,Saudi Arabia,1,"1,898.100 Billion USD",1,"1,953.180 Billion USD",2,"1,997.390 Billion USD",2,"2,051.500 Billion USD",https://value.today/sites/default/files/styles...,https://www.saudiaramco.com/
1,APPLE,Tim Cook,USA,2,"1,323.000 Billion USD",2,"1,578.000 Billion USD",1,"2,127.000 Billion USD",1,"2,256.000 Billion USD",https://value.today/sites/default/files/styles...,https://www.apple.com/
2,MICROSOFT CORPORATION,Satya Nadella,USA,3,"1,215.000 Billion USD",3,"1,564.000 Billion USD",4,"1,612.000 Billion USD",3,"1,682.000 Billion USD",https://value.today/sites/default/files/styles...,https://www.microsoft.com/
3,ALPHABET,Sundar Pichai,USA,4,943.897 Billion USD,5,"1,002.000 Billion USD",5,"1,073.000 Billion USD",5,"1,185.000 Billion USD",https://value.today/sites/default/files/styles...,https://abc.xyz/
4,AMAZON.COM,Jeff Bezos,USA,5,941.028 Billion USD,4,"1,442.000 Billion USD",3,"1,645.000 Billion USD",4,"1,634.000 Billion USD",https://value.today/sites/default/files/styles...,https://www.amazon.com/
...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,SEVEN & I HOLDINGS,Ryuichi Isaka,Japan,471,32.646 Billion USD,500,28.219 Billion USD,529,28.890 Billion USD,573,31.270 Billion USD,,https://www.7andi.com
496,ASSICURAZIONI GENERALI,Philippe Donnet,Italy,472,32.607 Billion USD,596,24.205 Billion USD,657,23.968 Billion USD,674,27.258 Billion USD,,https://www.generali.com/
497,AMPHENOL CORPORATION,Richard Adam Norwitt,USA,473,32.525 Billion USD,498,28.367 Billion USD,452,32.296 Billion USD,442,39.121 Billion USD,,https://www.amphenol.com/
498,GENERAL MILLS,Jeff Harmening,USA,474,32.516 Billion USD,364,37.491 Billion USD,362,39.111 Billion USD,490,35.952 Billion USD,,https://www.generalmills.com/


Let's write this data into a `CSV` file

In [56]:
companies_data_df.to_csv('companies_data.csv', index=None)

Great, our CSV file has been created and we have done our work.

It's time to save our entire notebook and upload both of our csv files to the jovian account

In [58]:
jovian.commit(project="web-scraping-final", files=['companies_data.csv', 'companies.csv'])

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "thakubhai-007/web-scraping-final" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/thakubhai-007/web-scraping-final[0m


'https://jovian.ai/thakubhai-007/web-scraping-final'

## Summary:

Finally we got the data we required and we successfully have transformed all the data into a `CSV` file.

Here is very short description of our entire journey:
* First of all we have installed and imported all the required library, but anyone can install and import any library anytime between the code cells whenever you need a your required library you can imstall and import the library.
* We have make a HTTPs request with the `Requests` library to download the wesite into our notebook
* We read the entire web-page using `BeautifulSoup` library, And then we started our main job `Finding the important data under the Tags`.
* After getting all the required `tags` for extracting all the required data we than store all the data into a dictionary and then create a `DataFrame` using `Pandas` library.
* And our work has finished for the web-page we have selected.

Then we go much deeper into web-scraping, 

We wrote some resuable functions to get data from a web-page, this time we wrote functions because we are going to collect data from 500 different web pages for each single company. For example the [Saudi Arabian Oil Company (Saudi Aramco)](https://www.value.today/company/saudi-arabian-oil-company) and [APPLE](https://www.value.today/company/apple)

* We wrote functions to get data from each single company's web-page
* We wrote a parsing function to parse all the data into a list of dictionaries
* Finally we wrote codes for to getting the data and parsing all the required data into a list of dictionaries
* We make a pandas `DataFrame` for the collected data and then we create a `CSV` file with the data collected above.

## References

I have made this project under the guidance of [Aakash N S](https://aakashns.medium.com/?source=collection_about-------------------------------------) and [Jovian team](https://blog.jovian.ai/about)


Here is a list that can help you understand about some key points
- There is a workshop on `Web-Scraping` by `Aakash N S` the link is [Let's Build a Python Web Scraping Project from Scratch | Hands-On Tutorial](https://www.youtube.com/watch?v=RKsLLG-bzEY&t=6677s)
- You can read about all the libraries : [`Selenium`](https://www.selenium.dev/documentation/en/), [`Scrapy`](https://docs.scrapy.org/en/latest/) [`lxml`](https://lxml.de/) [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- You can also read some interesting blog post on web scraping written by my mates [Web Scraping Popular Movies using BeautifulSoup](https://blog.jovian.ai/web-scraping-popular-movies-using-beautifulsoup-5bab0852fee4) and [Web scraping tutorial for beginners](https://blog.jovian.ai/a-comprehensive-web-scraping-tutorial-385a2ac27107)

## Future work

"There is always a lot to do with everything you have done previously"

The project we have made successfully will lead us to do Exploratory Analysis onto itself. Here is a list of future work that I will try further:
- We can do our data science exploratory analysis on the same data set we have created so far.
- There is still lots of data available on the same website that I can scrape, wiz: `Top companies data by Stock Exchange`, `Top companies data by Sector`, `Country wise Top companies data` and lots of way to scrape the data from the website: [value.today](https://www.value.today/)
- I will make an another web scraping project using some different libraries that I have mentioned above, I'll soon try scraping using `Scrapy` and `Selenium`.


### Hope you enjoyed the whole journey from scratch, Thanks for reading my notebook patiently.

In [59]:
# Let's submit our project to jovian plateform
jovian.submit(assignment="zerotoanalyst-project1")

<IPython.core.display.Javascript object>

[jovian] Attempting to save notebook..[0m
[jovian] Updating notebook "thakubhai-007/web-scraping-final" on https://jovian.ai[0m
[jovian] Uploading notebook..[0m
[jovian] Uploading additional files...[0m
[jovian] Committed successfully! https://jovian.ai/thakubhai-007/web-scraping-final[0m
[jovian] Submitting assignment..[0m
[jovian] Verify your submission at https://jovian.ai/learn/zero-to-data-analyst-bootcamp/assignment/project-1-web-scraping-with-python[0m
