**Author:** Ronny Ashar
<br>
**Date:** May 2024
<br>
**Description:** this notebook demonstrates using OpenAI and some basic prompt engineering for document summarization. Since we want to be able to summarize large documents which may exceed the token limits of some models, we summarize in chunks a few pages at a time, and then consolidate those into a final report
<br>
**Note:** some code copied and modified from #StanfordTech16 examples


In [4]:
!pip install openai



will need an OpenAI account to authenticate and use API Key
refer to this documentation for setup instructions: https://platform.openai.com/docs/api-reference/introduction

In [5]:
from google.colab import userdata
api_key = userdata.get('open_ai_key')

In [6]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/290.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.0/290.4 kB[0m [31m1.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m286.7/290.4 kB[0m [31m4.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdf
Successfully installed pypdf-4.2.0


In [7]:
from openai import OpenAI
import requests

In [8]:
client = OpenAI(api_key=api_key)

We'll download the Federal Reserve's Monetory Policy Report which is a 71 page document, then use chunking to generate summaries for 5-page chunks, and then consolidate into a final report. In case of smaller documents chunking is not needed

In [9]:
!curl  -o fed_notes.pdf https://www.federalreserve.gov/publications/files/20240301_mprfullreport.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 5081k  100 5081k    0     0  31.0M      0 --:--:-- --:--:-- --:--:-- 31.2M


In [10]:
def summarize_document(document):
    # Call OpenAI API for summarization

      summary = client.chat.completions.create(
          model = "gpt-3.5-turbo",
          #temperature=0.9,
          #max_tokens=500,
          messages = [
              {"role": "system", "content": "You are an analyst at Golamn Sachs summarizing sections of a document for a report your team will send to a managing director. Please summarize and be sure to include all relevant details"},
              {"role": "user", "content": document}
          ]
      )

      return summary

In [11]:
from pypdf import PdfReader

reader = PdfReader("fed_notes.pdf")
num_pages = len(reader.pages)

In [12]:
def getText(reader, start_page, end_page):
      consolidated_text=''
      for num in range(start_page, end_page+1):
        page = reader.pages[start_page]
        text = page.extract_text()
        consolidated_text += text
      return consolidated_text

Note: below cell page range loop generated by ChatGPT, then I mad eminor edit to insert in the call to my summarize_document function

In [13]:
total_pages = num_pages
pages_per_batch = 5

summaries = ''
for i in range(0, total_pages, pages_per_batch):
    start_page = i
    end_page = min(i + pages_per_batch - 1, total_pages)
    print(f"Summarizing pages {start_page} to {end_page}")
    summary = summarize_document(getText(reader, start_page, end_page))
    summaries += (summary.choices[0].message.content)

Summarizing pages 0 to 4
Summarizing pages 5 to 9
Summarizing pages 10 to 14
Summarizing pages 15 to 19
Summarizing pages 20 to 24
Summarizing pages 25 to 29
Summarizing pages 30 to 34
Summarizing pages 35 to 39
Summarizing pages 40 to 44
Summarizing pages 45 to 49
Summarizing pages 50 to 54
Summarizing pages 55 to 59
Summarizing pages 60 to 64
Summarizing pages 65 to 69
Summarizing pages 70 to 71


In the below cell we simply take the concatenated summaries and use the exact original prompt to have is do the consolidated summarization. We can contrast that with a few cells below where we provide more specific instructions regarding the format and length of consolidated report

In [14]:
consolidated_summary=summarize_document(summaries)

In [15]:
print(consolidated_summary.choices[0].message.content)

The document provides a comprehensive overview of several key economic and financial aspects that are likely to be covered in the Federal Reserve's Monetary Policy Report for March 1, 2024. 

1. **Technology Sector Trends**: Discusses market trends in cloud computing, artificial intelligence, and cybersecurity, emphasizing opportunities for growth and profitability in these subsectors.

2. **Inflation and Consumer Energy Prices**: Highlights that while inflation has eased from its peak, it remains above the Federal Open Market Committee's target of 2 percent. It also details the moderation in oil prices and the impact on consumer energy prices.

3. **Labor Market Trends**: Examines employment patterns across different demographic groups, noting record highs in employment for prime-age women and decreasing gaps in employment ratios between various demographic groups.

4. **Wage Growth and Productivity**: Wage growth has slowed compared to the previous year, but productivity growth has s

In [16]:
len(consolidated_summary.choices[0].message.content)

2655

In the below cell our prompt is a bit more specific, asking for a 3-page summary

In [17]:
summary_report = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    #temperature=0.9,
    #max_tokens=500,
    messages = [
        {"role": "system", "content": "You are an analyst at Golamn Sachs preparing a consoliated summary report to send to a managing director. \
        Please do a 10-page summary including relevant details. Make sure the summary is at least 5000 word length. \
        Be sure to include details on employment trends and interest rate forecasts!"},
        {"role": "user", "content": summaries}
    ]
)

In [18]:
from ipywidgets import widgets
out = widgets.Output(layout={'border': '1px solid black'})

In [19]:
with out:
  print(summary_report.choices[0].message.content)

out

Output(layout=Layout(border='1px solid black'))

In [20]:
len(summary_report.choices[0].message.content)

5303

In [21]:
summary_report = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    temperature=0.8,
    #max_tokens=500,
    messages = [
        {"role": "system", "content": "You are a professor trying to make a finance and economics class fun for your students. Please do a summary sonnet in the style of shakespeare\
        Be sure to include details on employment trends and interest rate forecasts!"},
        {"role": "user", "content": summaries}
    ]
)

In [23]:
print(summary_report.choices[0].message.content)


In realms of finance and markets so grand,
The Monetary Policy Report doth stand,
Interest rates and trends of employment strong,
A sonnet for students, a merry song.

Technological trends in cloud and AI,
Employment rises, reaching for the sky,
Cybersecurity, a vital need,
For companies to prosper, to succeed.

Inflation eases, yet remains quite high,
Interest rates forecasted to the sky,
The labor market, trends of varied kind,
Equality in work, a goal enshrined.

To make finance fun for students keen,
Economics and markets, a vibrant scene.
