### Summarization of financial reports

Task: use various Azure OpenAI engines in Zero shot setup to generate summaries of financial reports

In [24]:
try:
    !pip install pdfminer.six
    import openai
except ModuleNotFoundError:
    !pip install openai
except Exception as e:
    print(e)



You should consider upgrading via the 'C:\Users\zabakhti.REDMOND.000\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.


In [25]:
import os
!pip install num2words
import re
import requests
from num2words import num2words
from pdfminer.high_level import extract_text



You should consider upgrading via the 'C:\Users\zabakhti.REDMOND.000\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.7_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip' command.


#### Loading and setting attributes for Azure Open AI endpoint

The instructors will provide the endpoint URL and key from their Azure OpenAI portal. Add them below.

In [26]:
openai.api_base = "" # Locate the Endpoint URL in the Azure OpenAI portal and add it here
openai.api_key = "" # Add provided api key here
openai.api_type = "azure"
openai.api_version =  "2022-03-01-preview"

Note: Although here we are explicitly specifying the endpoint and the key, in real-world deployments, it is strongly suggested to use key vault and AAD based authentication.

#### Summary of summaries approach

GPT3 has a token limit of 2048 - which means that the prompt and completion together cannot exceed 2048 tokens. 

For summarizing text documents that are longer than 2048 tokens, we use a summary of summaries approach. It consists of 2 stages as explained below:

<b>Stage 1</b>

The combined text extracted from the PDF is broken down into smaller sub-documents and all of them are summarized individually.

<b>Stage 2</b>

The summaries generated in Stage 1 are appended together and this is summarized again.


Note: The token limit has been increased to 4096 for the 002 series of Instruct model (aka GPT3.5)

In [27]:
# Split the text scraped from the PDF into shorter sub-documents that fit within the token limit
def splitter(n, s):
    pieces = s.split()
    list_out = [" ".join(pieces[i:i+n]) for i in range(0, len(pieces), n)]
    return list_out

# Data cleaning
def normalize_text(s, sep_token = " \n "):
    # remove all instances of multiple spaces
    s = re.sub(r'\s+',  ' ', s).strip()
    # remove specific replacements to curb discrepancies, if any, in the text content
    s = re.sub(r". ,","",s)
    s = s.replace("..",".")
    s = s.replace(". .",".")
    return s

def trim_incomplete(t):
    if t.endswith('.'):
        if not re.search('[a-z]\.$',t):
            t = t[:-1]

    if not t.endswith('.'):
        t = t.rsplit('. ', 1)[:-1]
        t = "".join(t)+'.'
    
    t = t.strip()
    return t

Here, we are going to summarize a real-world financial document.

We are summarizing the document at this URL: https://www.rathbones.com/sites/rathbones.com/files/imce/rathbones_2020_preliminary_results_announcement_-_final-.pdf

Go ahead and checkout the report by clicking on the link above.

In [28]:
# URL of financial report to be summarized
url = "https://www.rathbones.com/sites/rathbones.com/files/imce/rathbones_2020_preliminary_results_announcement_-_final-.pdf"
r = requests.get(url)

Please go ahead and use a file handler to write the contents of the above request into a local document called 'report.pdf'

Hint: Use open() with 'wb' parameter

In [29]:
### Insert code here
with open("report.pdf","wb") as f:

    f.write(r.content)

In [30]:
# Setting the directory to read the local document from
name = os.path.abspath(os.path.join(os.getcwd(), 'report.pdf')).replace('\\', '/')

In this exercise, we only want to summarize the first page of the PDF that has the relevant page that has a broad level description of financial performance for the year 2020. We specify the page indices below. However, it can be extended to cover more pages as well.

In [31]:
pdfms_pages = [0]

In [32]:
def summarizer_wrapper(engine_name, name, pdfms_pages):
    """A wrapper function around the summary of summaries completion calls

    Args:
        engine_name (str): Deployment name from Azure OpenAI portal
        name (str): Path to the pdf file
        pdfms_pages (list): Indices of the pages to be summarized in the PDF (starting from 0)

    Returns:
        str: Summary of the long document
    """
    text = extract_text(name
    , page_numbers=pdfms_pages
    )

    r = splitter(200, text)

    # The token limit of GPT3 is 2048.
    # We approximate it and find the total no. of summaries we can have in Stage 1 of summaries.
    # We use this to dynamically control the length of the generated summaries in Stage 1
    tok_l = int(2000/len(r))

    # Adding max_tokens to prompt as words
    tok_l_w = num2words(tok_l)

    res_lis = []


    # Stage 1
    # The sub-documents of the PDF are summarized here
    # The nature of summaries is controlled by the prompt, hyperparameters and engine
    for i in range(len(r)):
        prompt_i = f'Extract and summarize the key financial numbers and percentages mentioned in the Text in less than {tok_l_w} words.\n\nText:\n'+normalize_text(r[i])+'\n\nSummary in one paragraph:'
        response = openai.Completion.create(
            engine=engine_name,
            prompt = prompt_i,
            temperature = 0,
            max_tokens = tok_l,
            top_p = 1.0,
            frequency_penalty=0.5,
            presence_penalty = 0.5,
            best_of = 1
        )
        t = response.choices[0].text
        
        t = trim_incomplete(t)
        res_lis.append(t)



    # Stage 2
    # The summaries generated above are stored in Python list res_lis
    # The sub-document summaries are concatenated into a string and passed to the Completions endpoint
    prompt_i = 'Summarize the financial performance of the business like revenue, profit, etc. in less than one hundred words. Do not make up values that are not mentioned in the Text.\n\nText:\n'+" ".join([normalize_text(res) for res in res_lis])+'\n\nSummary:\n'
    response = openai.Completion.create(
            engine=engine_name,
            prompt = prompt_i,
            temperature = 0,
            max_tokens = 200,
            top_p = 1.0,
            frequency_penalty=0.5,
            presence_penalty = 0.5,
            best_of = 1
        )

    return trim_incomplete(response.choices[0].text)

Using GPT3 with Curie engine

Instructors will provide you with the engine names.

In [33]:
### Add the engine name as first argument
summarizer_wrapper("curie", name, pdfms_pages)

'Rathbones reported strong financial performance in 2020, with FUMA growing by 8.5% and underlying profit before tax increasing by 4.3%. The company also reported profits before tax of £43.8 million, up 5.2% from the previous year. The company reported an increase in operating income and earnings per share, as well as a declaration of final dividend.'

Let's look at summarization with Da vinci Instruct with GPT3 and GPT3.5

Activity: Using GPT3 with Da Vinci engine

In [36]:
### Insert code here
summarizer_wrapper("davinci", name, pdfms_pages)

'Rathbones delivered a strong performance in 2020, with funds under management and administration (FUMA) growing by 8.5% to reach £54.7 billion at the end of the year. Underlying profit before tax increased by 4.3% to £92.5 million, delivering an underlying operating margin of 25.3%. Total net inflows across the group were £2.1 billion, representing a growth rate of 4.2%. Profit before tax for the year was £43.8 million, with basic earnings per share totalling 49.6p. Operating income for the year was 5.2% ahead of the prior year, totalling £366.1 million.'

Activity: Using GPT3.5 with Da Vinci engine

In [37]:
### Insert code here
summarizer_wrapper("davinci", name, pdfms_pages)

'Rathbones delivered a strong performance in 2020, with funds under management and administration (FUMA) growing by 8.5% to reach £54.7 billion at the end of the year. Underlying profit before tax increased by 4.3% to £92.5 million, delivering an underlying operating margin of 25.3%. Total net inflows across the group were £2.1 billion, representing a growth rate of 4.2%. Profit before tax for the year was £43.8 million, with basic earnings per share totalling 49.6p. Operating income for the year was 5.2% ahead of the prior year, totalling £366.1 million.'

The hyperparameters specified above were found to give good performance for the summarization task in question. Feel free to play around with the hyperparameters and see how the completions change.

Activity: Go ahead and try to summarize the first five pages instead of just the first page and see how the generated summaries are

In [39]:
# 5 pages summarization
