This is a script that scrapes global revenue figures from Microsoft Investor Relations (NOTE 17 — SEGMENT INFORMATION AND GEOGRAPHIC DATA; Revenue from external customers, classified by significant product and service offerings). One example from these webpages can be found at https://www.microsoft.com/en-us/Investor/earnings/FY-2021-Q1/IRFinancialStatementsPopups?tag=us-gaap:SegmentReportingDisclosureTextBlock&title=More%20Personal%20Computing.

The following block loads all necessary packages and libraries required by the script. The block needs to be run every time the code is used to scrape data. If required packages are not installed, running the code will throw an error. Refer to Statistics Canada's instructions for installing and requesting packages on your Net B VDI. If Python is not yet installed on your system, you will need to submit an SRM for access.

In [81]:
# the Python Requests package will allow us to send HTTP requests to get HTML files
import requests

# the GET method indicates that you’re trying to get or retrieve data from a specified resource. 
# to make a GET request, invoke requests.get()
from requests import get

# Beautiful Soup is a Python library for pulling data out of HTML and XML files
from bs4 import BeautifulSoup

# pandas is a Python data analysis library
import pandas as pd

# NumPy is a Python library used for working with large, multi-dimensional arrays and matrices
import numpy as np

# the time module in Python has a function sleep() that you can use to suspend execution of the calling thread 
from time import sleep

# The randint() method returns a pseudo-random integer number 
from random import randint

import re

import os

In [89]:
existing_data = pd.read_csv('microsoft.csv')

In [90]:
existing_data

Unnamed: 0,fiscal_year,fiscal_quarter,segment,revenue
0,2017,1,Office products and cloud services,5982
1,2017,1,Server products and cloud services,4689
2,2017,1,Windows,4643
3,2017,1,Gaming,1885
4,2017,1,Search advertising,1429
...,...,...,...,...
215,2022,2,Search and news advertising,3064
216,2022,2,Devices,2285
217,2022,2,Enterprise Services,1823
218,2022,2,Other,1357


The following block adds additional rows by individual quarter and can be used to update the existing file. Running this block will append the reference quarter values to the value vectors for the selected period. This is simply the previous HTML conversion method without a nested loop to accomodate for multiple files.

In [104]:
# year controls the reference year that you wish to scrape
# quarter controls the reference quarter that you wish to scrape
year = 2022
quarter = 3

fiscal_year = []
fiscal_quarter = []
segment = []
revenue = []
        
# specify the URL for each reference date
url = 'https://www.microsoft.com/en-us/Investor/earnings/FY-' + str(year) + '-Q' + str(quarter) + '/IRFinancialStatementsPopups?tag=us-gaap:SegmentReportingDisclosureTextBlock&title=More%20Personal%20Computing'

# method we use to grab the contents of the URL
results = requests.get(url)

# soup is the variable we create to assign the method BeatifulSoup
# The BeautifulSoup library specifies a desired format of results using the HTML parser
# This allows Python to read the components of the page rather than treating it as one long string
soup = BeautifulSoup(results.text, "html.parser")

# the necessary financial data are stored in 
alltables = soup.find_all('table', attrs={'style' : ["margin:auto;border-collapse:collapse; width:100%;", "border-collapse:collapse; width:99.86%;", "border-collapse:collapse; width:100%;"]})

# avoid overloading the website being scraped by reducing the crawl rate
sleep(randint(2,10))

for table in alltables:
    # only one table contains the necessary segment revenue
    if ('Office products and cloud services') in str(table):
        rows = table.find_all('tr')

        for row in rows:

            # the process of finding revenue values on the page is hard-coded for the specific HTML of the Microsoft website, as we look for data based on HTML styles
            # styles are subject to change in future quarters, so it is vital to reevaluate the data source website when scraping results are unexpected
            # segment category names in the table are not right-justified                    
            # The code finds segment names and adds them as dataset rows
            names = row.find_all('p', style=lambda value: value and ('text-align:justify;' in value or 'text-align:left;' in value) and 'font-size:10pt;' in value)
            for name in names:
                # adding the date from which the data is scraped from to "Fiscal Year" and "Fiscal Quarter"

                fiscal_year.append(str(year))
                fiscal_quarter.append(str(quarter))

                print(str(year) + 'Q' + str(quarter))

                print(name.text)
                # adding the name of each segment category to "Segment"
                segment.append(name.text)

            # some tables have multiple columns of revenue values. We are only interested in the first column
            # values is a list created for each row of the table 
            # it will store all revenue numbers from each row, so the value in the first column will be the first element
            # this is also a hard-coded solution to distinguishing from numbers in the first column and numbers in other columns
            # values corresponding to each segment in the table are right-justified
            # the find_all command will find all text with the stated properties
            values = row.find_all('p', style=lambda value: value and 'text-align:right;' in value and 'font-size:10pt;' in value)

            # don't count empty lists (empty rows) or empty strings
            if len(values) > 0:
                all_values = []
                for value in values:
                    value_stripped = re.sub("[^0-9]", "", value.text)
                    if value_stripped.isdigit() and len(value_stripped) > 0:
                        all_values.append(value_stripped)
                if len(all_values) > 0:
                    print(all_values[0])
                    revenue.append(all_values[0])

In [105]:
new_data = pd.DataFrame(
    {'fiscal_year': fiscal_year,
     'fiscal_quarter': fiscal_quarter,
     'segment': segment,
     'revenue': revenue,
    })

In [106]:
new_data

Unnamed: 0,fiscal_year,fiscal_quarter,segment,revenue


In [107]:
joined_data = pd.concat([existing_data, new_data])

In [109]:
joined_data.tail(30)

Unnamed: 0,fiscal_year,fiscal_quarter,segment,revenue
190,2021.0,4.0,Server products and cloud services,52589
191,2021.0,4.0,Office products and cloud services,39872
192,2021.0,4.0,Windows,23227
193,2021.0,4.0,Gaming,15370
194,2021.0,4.0,LinkedIn,10289
195,2021.0,4.0,Search advertising,8528
196,2021.0,4.0,Enterprise Services,6943
197,2021.0,4.0,Devices,6791
198,2021.0,4.0,Other,4479
199,2021.0,4.0,Total,168088


In [110]:
output_path = 'microsoft.csv'

df.to_csv(output_path, index = False, header = not os.path.exists(output_path))

The data will need to be cleaned before any analysis takes place. Specifically, scraping the relevant table on Microsoft's website will include the revenue total, which is not necessary for analysis. Another issue arises with Q4 values, which are not reported individually. Q4 revenues represent a total of the full fiscal year. To get the value corresponding to the three months belonging to Q4, the value must be subtracted from the sum of the three previous quarters.