This is a script that scrapes global revenue figures from Microsoft Investor Relations (NOTE 17 — SEGMENT INFORMATION AND GEOGRAPHIC DATA; Revenue from external customers, classified by significant product and service offerings). One example from these webpages can be found at https://www.microsoft.com/en-us/Investor/earnings/FY-2021-Q1/IRFinancialStatementsPopups?tag=us-gaap:SegmentReportingDisclosureTextBlock&title=More%20Personal%20Computing.

The following block loads all necessary packages and libraries required by the script. The block needs to be run every time the code is used to scrape data. If required packages are not installed, running the code will throw an error. Refer to Statistics Canada's instructions for installing and requesting packages on your Net B VDI. If Python is not yet installed on your system, you will need to submit an SRM for access.

In [89]:
# the Python Requests package will allow us to send HTTP requests to get HTML files
import requests

# the GET method indicates that you’re trying to get or retrieve data from a specified resource. 
# to make a GET request, invoke requests.get()
from requests import get

# Beautiful Soup is a Python library for pulling data out of HTML and XML files
from bs4 import BeautifulSoup

# pandas is a Python data analysis library
import pandas as pd

# NumPy is a Python library used for working with large, multi-dimensional arrays and matrices
import numpy as np

# the time module in Python has a function sleep() that you can use to suspend execution of the calling thread 
from time import sleep

# The randint() method returns a pseudo-random integer number 
from random import randint

The following block loads the initial Excel file. I created an original file which contains financial data from the 2017 fiscal year, not available through scraping by the below method. The block needs to be run every time the code is opened and used.

In [90]:
# reads the original Excel file I made with 2017 data as a Pandas dataframe
xl_file = pd.ExcelFile('Microsoft.xlsx')

# create a dictionary with sheet names as keys and dataframes corresponding to each sheet as values
# in this case, only a single sheet exists
dfs = {sheet_name: xl_file.parse(sheet_name) 
          for sheet_name in xl_file.sheet_names}

# storing the relevant sheet to the name microsoft_data
microsoft_data = dfs.get('Microsoft')

# converting the sheet to another dictionary, with 
my_dict = microsoft_data.to_dict('list')

my_dict

{'Fiscal Year': [2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017,
  2017],
 'Fiscal Quarter': [1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  1,
  2,
  2,
  2,
  2,
  2,
  2,
  2,
  2,
  2,
  2,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  3,
  4,
  4,
  4,
  4,
  4,
  4,
  4,
  4,
  4,
  4],
 'Segment': ['Office products and cloud services',
  'Server products and cloud services',
  'Windows',
  'Gaming',
  'Search advertising',
  'Enterprise Services',
  'Devices',
  'LinkedIn',
  'Other',
  'Total',
  'Office products and cloud services',
  'Server products and cloud services',
  'Windows',
  'Gaming',
  'Search advertising',
  'Enterprise Services',
  'Devices',
  'LinkedIn',
  'Other',
  'Total',
  'Office products and cloud

The following block populates the existing Excel file with segment revenues from FYQ1 2018 to FYQ4 2021. However, as of June 17, 2021, Q4 2021 figures are not available on the Microsoft website. Q4 data can be gathered directly once available. This code will duplicate rows if it is run twice. It does not need to be run again unless you wish to add several quarters of financial information at once, in which case you alter the range of years from which data will be required.

In [91]:
# using the NumPy arange() function, we can create vectors that return evenly spaced values within a given interval.
# years is a vector containing the range of years from which data will be scraped (in this case, 2018 to 2021) and should be altered for the desired range
# quarters is the vector [1 2 3 4]
years = np.arange(2018, 2022, 1) 
quarters = np.arange(1, 5, 1)

# iterate through each quarter of each year
for year in years:
    for quarter in quarters:
        
        # specify the URL for each reference date
        url = 'https://www.microsoft.com/en-us/Investor/earnings/FY-' + str(year) + '-Q' + str(quarter) + '/IRFinancialStatementsPopups?tag=us-gaap:SegmentReportingDisclosureTextBlock&title=More%20Personal%20Computing'
        
        # method we use to grab the contents of the URL
        results = requests.get(url)
        
        # soup is the variable we create to assign the method BeatifulSoup
        # The BeautifulSoup library specifies a desired format of results using the HTML parser
        # This allows Python to read the components of the page rather than treating it as one long string
        soup = BeautifulSoup(results.text, "html.parser")
        
        # the necessary financial data are stored in 
        alltables = soup.find_all('table', attrs={'style' : ["margin:auto;border-collapse:collapse; width:100%;", "border-collapse:collapse; width:99.86%;", "border-collapse:collapse; width:100%;"]})
        
        # avoid overloading the website being scraped by reducing the crawl rate
        sleep(randint(2,10))
        
        for table in alltables:
            # only one table contains the necessary segment revenue
            # as the HTML is 
            if ('Office products and cloud services') in str(table):
                rows = table.find_all('tr')

                for row in rows:

                    # the process of finding revenue values on the page is hard-coded for the specific HTML of the Microsoft website, as we look for data based on HTML styles
                    # styles are subject to change in future quarters, so it is vital to reevaluate the data source website when scraping results are unexpected
                    # segment category names in the table are not right-justified                    
                    # The code finds segment names and adds them as dataset rows
                    names = row.find_all('p', style=lambda value: value and ('text-align:justify;' in value or 'text-align:left;' in value) and 'font-size:10pt;' in value)
                    for name in names:
                        # adding the date from which the data is scraped from to "Fiscal Year" and "Fiscal Quarter"
                        my_dict['Fiscal Year'].append(year)
                        my_dict['Fiscal Quarter'].append(quarter)
                        print(str(year) + 'Q' + str(quarter))

                        print(name.text)
                        # adding the name of each segment category to "Segment"
                        my_dict['Segment'].append(name.text)
                    
                    # some tables have multiple columns of revenue values. We are only interested in the first column
                    # values is a list created for each row of the table 
                    # it will store all revenue numbers from each row, so the value in the first column will be the first element
                    # this is also a hard-coded solution to distinguishing from numbers in the first column and numbers in other columns
                    # values corresponding to each segment in the table are right-justified
                    # the find_all command will find all text with the stated properties
                    values = row.find_all('p', style=lambda value: value and 'text-align:right;' in value and 'font-size:10pt;' in value)
                    
                    values = [x for x in values if '$' not in x]
                    
                    # don't count empty lists (empty rows) or empty strings
                    if len(values) > 0:
                        if len(values[0].text) > 0:
                            # as previously mentioned, the first item of the list will correspond to the first column
                            print(values[0].text)
                            value_nocomma = values[0].text.replace(",", "")
                            # adding the revenue in each row to "Revenue"
                            
                            my_dict['Revenue'].append(value_nocomma)
                            # print(my_dict['Revenue']) 

2018Q1
Office products and cloud services
6,575
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575']
2018Q1
Server products and cloud services
5,496
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496']
2018Q1
Windows
4,643
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643']
2018Q1
Gaming
1,896
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697

2018Q4
Office products and cloud services
28,316
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343', '4612', '2251', '1784', '1489', '1219', '1335', '698', '26819', '28316']
2018Q4
Server products and cloud services
26,129
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343',

2019Q2
Office products and cloud services
7,747
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343', '4612', '2251', '1784', '1489', '1219', '1335', '698', '26819', '28316', '26129', '19518', '10353', '7012', '5846', '5134', '5259', '2793', '\xa0\xa0110360', '7622', '7058', '4901', '2738', '1788', '1530', '1450', '1261', '736', '29084', '7747']
2019Q2
Server products and cloud services
7,791
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '657

2019Q4
Server products and cloud services
32,622
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343', '4612', '2251', '1784', '1489', '1219', '1335', '698', '26819', '28316', '26129', '19518', '10353', '7012', '5846', '5134', '5259', '2793', '\xa0\xa0110360', '7622', '7058', '4901', '2738', '1788', '1530', '1450', '1261', '736', '29084', '7747', '7791', '4758', '4232', '1976', '1693', '1948', '1521', '805', '32471', '7889', '8053', '4944', '2363', '1911', '1696', '1423', '1542', '750', '30571', '32622']
2019Q4
Office products and cloud services
31,769
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617

2020Q1
Server products and cloud services
9,192
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343', '4612', '2251', '1784', '1489', '1219', '1335', '698', '26819', '28316', '26129', '19518', '10353', '7012', '5846', '5134', '5259', '2793', '\xa0\xa0110360', '7622', '7058', '4901', '2738', '1788', '1530', '1450', '1261', '736', '29084', '7747', '7791', '4758', '4232', '1976', '1693', '1948', '1521', '805', '32471', '7889', '8053', '4944', '2363', '1911', '1696', '1423', '1542', '750', '30571', '32622', '31769', '20395', '11386', '7628', '6754', '6124', '6095', '3070', '125843', '9192']
2020Q1
Office products and cloud services
8,4

2020Q2
Server products and cloud services
10,119
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343', '4612', '2251', '1784', '1489', '1219', '1335', '698', '26819', '28316', '26129', '19518', '10353', '7012', '5846', '5134', '5259', '2793', '\xa0\xa0110360', '7622', '7058', '4901', '2738', '1788', '1530', '1450', '1261', '736', '29084', '7747', '7791', '4758', '4232', '1976', '1693', '1948', '1521', '805', '32471', '7889', '8053', '4944', '2363', '1911', '1696', '1423', '1542', '750', '30571', '32622', '31769', '20395', '11386', '7628', '6754', '6124', '6095', '3070', '125843', '9192', '8466', '5353', '2542', '1991', '1909', '154

2020Q3
Server products and cloud services
10,490
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343', '4612', '2251', '1784', '1489', '1219', '1335', '698', '26819', '28316', '26129', '19518', '10353', '7012', '5846', '5134', '5259', '2793', '\xa0\xa0110360', '7622', '7058', '4901', '2738', '1788', '1530', '1450', '1261', '736', '29084', '7747', '7791', '4758', '4232', '1976', '1693', '1948', '1521', '805', '32471', '7889', '8053', '4944', '2363', '1911', '1696', '1423', '1542', '750', '30571', '32622', '31769', '20395', '11386', '7628', '6754', '6124', '6095', '3070', '125843', '9192', '8466', '5353', '2542', '1991', '1909', '154

2020Q4
Server products and cloud services
41,379
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343', '4612', '2251', '1784', '1489', '1219', '1335', '698', '26819', '28316', '26129', '19518', '10353', '7012', '5846', '5134', '5259', '2793', '\xa0\xa0110360', '7622', '7058', '4901', '2738', '1788', '1530', '1450', '1261', '736', '29084', '7747', '7791', '4758', '4232', '1976', '1693', '1948', '1521', '805', '32471', '7889', '8053', '4944', '2363', '1911', '1696', '1423', '1542', '750', '30571', '32622', '31769', '20395', '11386', '7628', '6754', '6124', '6095', '3070', '125843', '9192', '8466', '5353', '2542', '1991', '1909', '154

2021Q1
Server products and cloud services
11,195
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343', '4612', '2251', '1784', '1489', '1219', '1335', '698', '26819', '28316', '26129', '19518', '10353', '7012', '5846', '5134', '5259', '2793', '\xa0\xa0110360', '7622', '7058', '4901', '2738', '1788', '1530', '1450', '1261', '736', '29084', '7747', '7791', '4758', '4232', '1976', '1693', '1948', '1521', '805', '32471', '7889', '8053', '4944', '2363', '1911', '1696', '1423', '1542', '750', '30571', '32622', '31769', '20395', '11386', '7628', '6754', '6124', '6095', '3070', '125843', '9192', '8466', '5353', '2542', '1991', '1909', '154

2021Q2
Server products and cloud services
12,729
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343', '4612', '2251', '1784', '1489', '1219', '1335', '698', '26819', '28316', '26129', '19518', '10353', '7012', '5846', '5134', '5259', '2793', '\xa0\xa0110360', '7622', '7058', '4901', '2738', '1788', '1530', '1450', '1261', '736', '29084', '7747', '7791', '4758', '4232', '1976', '1693', '1948', '1521', '805', '32471', '7889', '8053', '4944', '2363', '1911', '1696', '1423', '1542', '750', '30571', '32622', '31769', '20395', '11386', '7628', '6754', '6124', '6095', '3070', '125843', '9192', '8466', '5353', '2542', '1991', '1909', '154

2021Q3
Server products and cloud services
13,204
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343', '4612', '2251', '1784', '1489', '1219', '1335', '698', '26819', '28316', '26129', '19518', '10353', '7012', '5846', '5134', '5259', '2793', '\xa0\xa0110360', '7622', '7058', '4901', '2738', '1788', '1530', '1450', '1261', '736', '29084', '7747', '7791', '4758', '4232', '1976', '1693', '1948', '1521', '805', '32471', '7889', '8053', '4944', '2363', '1911', '1696', '1423', '1542', '750', '30571', '32622', '31769', '20395', '11386', '7628', '6754', '6124', '6095', '3070', '125843', '9192', '8466', '5353', '2542', '1991', '1909', '154

2021Q4
Server products and cloud services
52,589 
[5982, 4689, 4643, 1885, 1429, 1355, 1362, 0, 583, 21928, 6439, 5332, 4805, 3617, 1605, 1371, 1697, 228, 732, 25826, 6234, 5297, 4253, 1906, 1599, 1383, 958, 976, 606, 23212, 25573, 21649, 18593, 9051, 6219, 5542, 5062, 2271, 2611, 96571, '6575', '5496', '4643', '1896', '1639', '1371', '1154', '1148', '616', '24538', '7075', '6299', '4839', '3920', '1820', '1435', '1478', '1312', '740', '28918', '7088', '6343', '4612', '2251', '1784', '1489', '1219', '1335', '698', '26819', '28316', '26129', '19518', '10353', '7012', '5846', '5134', '5259', '2793', '\xa0\xa0110360', '7622', '7058', '4901', '2738', '1788', '1530', '1450', '1261', '736', '29084', '7747', '7791', '4758', '4232', '1976', '1693', '1948', '1521', '805', '32471', '7889', '8053', '4944', '2363', '1911', '1696', '1423', '1542', '750', '30571', '32622', '31769', '20395', '11386', '7628', '6754', '6124', '6095', '3070', '125843', '9192', '8466', '5353', '2542', '1991', '1909', '15

In [92]:
# convert the dictionary back to a dataframe
df = pd.DataFrame.from_dict(my_dict)

# the dataframe is saved in a new CSV file
df.to_csv('Microsoft.csv', index = False)

In [93]:
df

Unnamed: 0,Fiscal Year,Fiscal Quarter,Segment,Revenue
0,2017,1,Office products and cloud services,5982
1,2017,1,Server products and cloud services,4689
2,2017,1,Windows,4643
3,2017,1,Gaming,1885
4,2017,1,Search advertising,1429
...,...,...,...,...
195,2021,4,Search advertising,8528
196,2021,4,Enterprise Services,6943
197,2021,4,Devices,6791
198,2021,4,Other,4479


The following block adds additional rows by individual quarter and can be used to update the existing file. Running this block will append the reference quarter values to the value vectors for the selected period. This is simply the previous HTML conversion method without a nested loop to accomodate for multiple files.

In [94]:
# year controls the reference year that you wish to scrape
# quarter controls the reference quarter that you wish to scrape
year = 2022
quarter = 1

# specify the URL for the specific reference period
url = 'https://www.microsoft.com/en-us/Investor/earnings/FY-' + str(year) + '-Q' + str(quarter) + '/IRFinancialStatementsPopups?tag=us-gaap:SegmentReportingDisclosureTextBlock&title=More%20Personal%20Computing'

# method we use to grab the contents of the URL
results = requests.get(url)

# soup is the variable we create to assign the method BeatifulSoup
# The BeautifulSoup library specifies a desired format of results using the HTML parser
# This allows Python to read the components of the page rather than treating it as one long string
soup = BeautifulSoup(results.text, "html.parser")

alltables = soup.find_all('table', attrs={'style' : ["margin:auto;border-collapse:collapse; width:100%;", "border-collapse:collapse; width:99.86%;", "border-collapse:collapse; width:100%;"]})

for table in alltables:
    if ('Office products and cloud services') in str(table):
        rows = table.find_all('tr')
        
        for row in rows:

            # the process of finding revenue values on the page is hard-coded for the specific HTML of the Microsoft website, as we look for data based on HTML styles
            # styles are subject to change in future quarters, so it is vital to reevaluate the data source website when scraping results are unexpected
            # segment category names in the table are not right-justified                    
            # The code finds segment names and adds them as dataset rows
            names = row.find_all('p', style=lambda value: value and ('text-align:justify;' in value or 'text-align:left;' in value) and 'font-size:10pt;' in value)
            for name in names:
                # adding the date from which the data is scraped from to "Fiscal Year" and "Fiscal Quarter"
                my_dict['Fiscal Year'].append(year)
                my_dict['Fiscal Quarter'].append(quarter)
                print(str(year) + 'Q' + str(quarter))

                print(name.text)
                # adding the name of each segment category to "Segment"
                my_dict['Segment'].append(name.text)

            # some tables have multiple columns of revenue values. We are only interested in the first column
            # values is a list created for each row of the table 
            # it will store all revenue numbers from each row, so the value in the first column will be the first element
            # this is also a hard-coded solution to distinguishing from numbers in the first column and numbers in other columns
            # values corresponding to each segment in the table are right-justified
            # the find_all command will find all text with the stated properties
            values = row.find_all('p', style=lambda value: value and 'text-align:right;' in value and 'font-size:10pt;' in value)

            values = [x for x in values if '$' not in x]

            # don't count empty lists (empty rows) or empty strings
            if len(values) > 0:
                if len(values[0].text) > 0:
                    # as previously mentioned, the first item of the list will correspond to the first column
                    print(values[0].text)
                    value_nocomma = values[0].text.replace(",", "")
                    # adding the revenue in each row to "Revenue"

                    my_dict['Revenue'].append(value_nocomma)
                    # print(my_dict['Revenue']) 


In [95]:
# convert the dictionary back to a dataframe
df = pd.DataFrame.from_dict(my_dict)

# save the modified file into CSV form to load into R

df.to_csv('Microsoft.csv', index = False)

In [96]:
df

Unnamed: 0,Fiscal Year,Fiscal Quarter,Segment,Revenue
0,2017,1,Office products and cloud services,5982
1,2017,1,Server products and cloud services,4689
2,2017,1,Windows,4643
3,2017,1,Gaming,1885
4,2017,1,Search advertising,1429
...,...,...,...,...
195,2021,4,Search advertising,8528
196,2021,4,Enterprise Services,6943
197,2021,4,Devices,6791
198,2021,4,Other,4479


The data will need to be cleaned before any analysis takes place. Specifically, scraping the relevant table on Microsoft's website will include the revenue total, which is not necessary for analysis. Another issue arises with Q4 values, which are not reported individually. Q4 revenues represent a total of the full fiscal year. To get the value corresponding to the three months belonging to Q4, the value must be subtracted from the sum of the three previous quarters.