**Author:** Ronny Ashar
<br>
**Date:** May 2024
<br>
**Description:** this notebook demonstrates using OpenAI and some basic prompt engineering for document summarization. Since we want to be able to summarize large documents which may exceed the token limits of some models, we summarize in chunks a few pages at a time, and then consolidate those into a final report
<br>
**Note:** some code copied and modified from #StanfordTech16 examples*


In [None]:
!pip install openai

Collecting openai
  Downloading openai-1.30.1-py3-none-any.whl (320 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/320.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m112.6/320.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m320.6/320.6 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-

will need an OpenAI account to authenticate and use API Key
refer to this documentation for setup instructions: https://platform.openai.com/docs/api-reference/introduction

In [None]:
from google.colab import userdata
api_key = userdata.get('open_ai_key')

In [None]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/290.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.2/290.4 kB[0m [31m2.6 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdf
Successfully installed pypdf-4.2.0


In [None]:
from openai import OpenAI
import requests

In [None]:
client = OpenAI(api_key=api_key)

We'll download the Federal Reserve's Monetory Policy Report which is a 71 page document, then use chunking to generate summaries for 5-page chunks, and then consolidate into a final report. In case of smaller documents chunking is not needed

In [None]:
!curl  -o fed_notes.pdf https://www.federalreserve.gov/publications/files/20240301_mprfullreport.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5081k  100 5081k    0     0  14.1M      0 --:--:-- --:--:-- --:--:-- 14.1M


In [None]:
def summarize_document(document):
    # Call OpenAI API for summarization

      summary = client.chat.completions.create(
          model = "gpt-3.5-turbo",
          #temperature=0.9,
          max_tokens=500,
          messages = [
              {"role": "system", "content": "You are an analyst at Golamn Sachs summarizing sections of a document for a report your team will send to a managing director. Please summarize and be sure to include all relevant details"},
              {"role": "user", "content": document}
          ]
      )

      return summary

In [None]:
from pypdf import PdfReader

reader = PdfReader("fed_notes.pdf")
num_pages = len(reader.pages)

In [None]:
def getText(reader, start_page, end_page):
      consolidated_text=''
      for num in range(start_page, end_page+1):
        page = reader.pages[start_page]
        text = page.extract_text()
        consolidated_text += text
      return consolidated_text

Note: below cell page range loop generated by ChatGPT, then I mad eminor edit to insert in the call to my summarize_document function

In [None]:
total_pages = num_pages
pages_per_batch = 5

summaries = ''
for i in range(0, total_pages, pages_per_batch):
    start_page = i
    end_page = min(i + pages_per_batch - 1, total_pages)
    print(f"Summarizing pages {start_page} to {end_page}")
    summary = summarize_document(getText(reader, start_page, end_page))
    summaries += (summary.choices[0].message.content)

Summarizing pages 0 to 4
Summarizing pages 5 to 9
Summarizing pages 10 to 14
Summarizing pages 15 to 19
Summarizing pages 20 to 24
Summarizing pages 25 to 29
Summarizing pages 30 to 34
Summarizing pages 35 to 39
Summarizing pages 40 to 44
Summarizing pages 45 to 49
Summarizing pages 50 to 54
Summarizing pages 55 to 59
Summarizing pages 60 to 64
Summarizing pages 65 to 69
Summarizing pages 70 to 71


In the below cell we simply take the concatenated summaries and use the exact original prompt to have is do the consolidated summarization. We can contrast that with a few cells below where we provide more specific instructions regarding the format and length of consolidated report

In [None]:
consolidated_summary=summarize_document(summaries)

In [None]:
print(consolidated_summary.choices[0].message.content)

The Monetary Policy Report from the Board of Governors of the Federal Reserve System for March 1, 2024, provides a comprehensive overview of key economic indicators and factors influencing the Federal Reserve's monetary policy decisions. The report covers various aspects including market trends, sector performance, stock recommendations, risk assessment, inflation, employment trends, wage growth, housing market developments, interest rates, and foreign economic growth. It highlights the importance of understanding the current state of the economy and the Federal Reserve's perspective to guide investment decisions. Key points discussed in the document include:

1. **Inflation**: Inflation has eased but remains elevated, with the price index for personal consumption expenditures (PCE) still above the Federal Open Market Committee's (FOMC) longer-run objective of 2 percent. Core PCE prices and consumer energy prices have also shown fluctuations influenced by geopolitical tensions.

2. **E

In [None]:
len(consolidated_summary.choices[0].message.content)

2903

In the below cell our prompt is a bit more specific, asking for a 3-page summary

In [None]:
summary_report = client.chat.completions.create(
    model = "gpt-3.5-turbo",
    #temperature=0.9,
    #max_tokens=500,
    messages = [
        {"role": "system", "content": "You are an analyst at Golamn Sachs preparing a consoliated summary report to send to a managing director. \
        Please do a 10-page summary including relevant details. Make sure the summary is at least 5000 word length. \
        Be sure to include details on employment trends and interest rate forecasts!"},
        {"role": "user", "content": summaries}
    ]
)

In [None]:
from ipywidgets import widgets
out = widgets.Output(layout={'border': '1px solid black'})

In [None]:
with out:
  print(summary_report.choices[0].message.content)

out

Output(layout=Layout(border='1px solid black'))

In [None]:
len(summary_report.choices[0].message.content)

4794