This is a simple script that downloads Sony Investor Relations PDF financial report files from a specified range of dates into the working directory. The main webpage can be found at https://www.sony.com/en/SonyInfo/IR/library/presen/er/archive.html.

The following block loads all necessary packages and libraries required by the script. The block needs to be run every time the code is used to scrape data. If required packages are not installed, running the code will throw an error. Refer to Statistics Canada's instructions for installing and requesting packages on your Net B VDI. If Python is not yet installed on your system, you will need to submit an SRM for access.

In [1]:
# This module provides a portable way of using operating system dependent functionality. 
import os

# NumPy is a Python library used for working with large, multi-dimensional arrays and matrices
import numpy as np

# wget is a module used to download files
import wget

# PDFMiner is a text extraction tool for PDF documents that obtains the exact location of text as well as other layout
# information (fonts, etc.), performs automatic layout analysis, and can convert PDF into other formats (HTML/XML)
import pdfminer

The next block of code, if run, will download all PDF files from FYQ1 2018 to FYQ4 2020. Then, it converts the PDF files to HTML format. This code will duplicate files if it is run twice. This code will duplicate files if it is run twice. It does not need to be run again unless you wish to add several quarters of financial information at once, in which case you alter the range of years from which data will be required. Although HTML conversions are difficult to scrape, the HTML files may be used in the future. Otherwise, the block of code corresponding to HTML conversion can be commented out or removed. This code will duplicate files if it is run twice. 

In [3]:
# using the NumPy arange() function, we can create vectors that return evenly spaced values within a given interval.
# years is a vector containing the range of years from which data will be scraped (in this case, 2018 to 2022) and should be altered for the desired range
# quarters is the vector [1 2 3 4]
years = np.arange(18, 23, 1)
quarters = np.arange(1, 5, 1)

# iterate through each quarter of each year
for year in years:
    for quarter in quarters:
        
        # specify the URL for the specific reference period
        url = 'https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/' + str(year) + 'q' + str(quarter) + '_sony.pdf'
        
        # speficy the pathname that the downloaded file will take
        # the path should be changed to place files into the desired folder
        path = str(year) + 'q' + str(quarter) + '_sony.pdf'
        
        print(url)
        
        print(path)
        
        try:
            # downloads the file available at the URL into the specified directory in the second argument
            # the path should be changed to place files into the desired folder
            wget.download(url) 
            
            # the file pdf2txt.py included in PDFMiner converts PDF files to HTML
            # currently, each HTML file is named 'FYXXQX', but this can be changed by altering the command string
            command = 'pdf2txt.py -o FY' + str(year) + 'Q' + str(quarter) + '.html -t html ' + path
            
            # run the command through the system terminal/command prompt
            # HTML files will not be duplicated when the file name already exists in the directory
            os.system(command) 
            
        except:
            print('Page not found')


https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/18q1_sony.pdf
18q1_sony.pdf
https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/18q2_sony.pdf
18q2_sony.pdf
https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/18q3_sony.pdf
18q3_sony.pdf
https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/18q4_sony.pdf
18q4_sony.pdf
https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/19q1_sony.pdf
19q1_sony.pdf
https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/19q2_sony.pdf
19q2_sony.pdf
https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/19q3_sony.pdf
19q3_sony.pdf
https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/19q4_sony.pdf
19q4_sony.pdf
https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/20q1_sony.pdf
20q1_sony.pdf
https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/20q2_sony.pdf
20q2_sony.pdf
https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/20q3_sony.pdf
20q3_sony.pdf
https://www.sony.com/en/SonyInfo/IR/library/presen/er/

For future periods, you may download a financial report from a single quarter and can be used to download the most recent available file. Then, the block converts the file to HTML. Running this block will download the file located at the URL for the specific year and quarter. Similarly, you can convert a single file to HTML with the following block of code. This is simply the previous HTML conversion method without a nested loop to accomodate for multiple files.

In [4]:
# year controls the reference year that you wish to scrape
# quarter controls the reference quarter that you wish to scrape
year = 22
quarter = 4

# specify the URL for the specific reference period
url = 'https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/' + str(year) + 'q' + str(quarter) + '_sony.pdf'

# speficy the pathname that the downloaded file will take
# the path should be changed to place files into the desired folder
path = str(year) + 'q' + str(quarter) + '_sony.pdf'

print(url)

print(path)

try:
    # downloads the file available at the URL into the specified directory in the second argument
    # the path should be changed to place files into the desired folder
    wget.download(url) 
    
    # the file pdf2txt.py included in PDFMiner converts PDF files to HTML
    # currently, each HTML file is named 'FYXXQX', but this can be changed by altering the command string
    command = 'pdf2txt.py -o FY' + str(year) + 'Q' + str(quarter) + '.html -t html ' + path
    
    # run the command through the system terminal/command prompt
    # HTML files will not be duplicated when the file name already exists in the directory
    os.system(command) 

except:
    print('Page not found')


https://www.sony.com/en/SonyInfo/IR/library/presen/er/pdf/22q4_sony.pdf
22q4_sony.pdf
Page not found
