# Long Form Summarization using VertexAI's `GPT-4 8k` and `DocAI`

This notebook will cover the steps as described below:

<span style="color:blue; font-weight:bold">Setup</span>
- Import Libraries: Get all the necessary Python libraries onboard.
- Initialize Logging: Make sure we have a log of all the operations.

<span style="color:green; font-weight:bold">Configuration</span>
- Define Parameters: Lay out all the essential parameters for our operations.
- Authentication: Authenticate the user and ensure security.
- Setup Clients: Initialize the API clients for our services.

<span style="color:purple; font-weight:bold">Document Preprocessing</span>
- Split PDF: Break the PDF into manageable chunks for better OCR results.

<span style="color:orange; font-weight:bold">Document OCR</span>
- Convert to Text: Use DocAI to transcribe the PDF content into text format.

<span style="color:red; font-weight:bold">Summarization</span>
- Use `GPT-4 8k`: Employ the power of the `GPT-4 8k` model to perform multiple rounds of summarization on our document.

<span style="color:teal; font-weight:bold">Final Consolidation</span>
- Refine Summary: Make sure the summary is crisp and to the point.
- Save: Store the final summary for future reference.

<span style="color:brown; font-weight:bold">Focused Summarization</span>
- Extract Sections: Dive deeper and pull out specific sections from the final summary for a more focused understanding.

#### Imports 

In [46]:
from langchain.prompts.chat import SystemMessagePromptTemplate
from langchain.prompts.chat import HumanMessagePromptTemplate
from google.api_core.client_options import ClientOptions
from langchain.prompts.chat import ChatPromptTemplate
from concurrent.futures import ThreadPoolExecutor
from langchain.chat_models import ChatOpenAI
from langchain.schema import SystemMessage
from langchain.schema import HumanMessage
from google.cloud import documentai
from pypdf import PdfWriter
from pypdf import PdfReader 
from tqdm import tqdm
import tiktoken
import vertexai
import requests
import logging
import time
import json
import yaml
import os

##### Setup logging 

In [2]:
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

#### Essentials 

In [3]:
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './../credentials/vai-key.json'
access_token = !gcloud auth print-access-token

In [5]:
with open('./../credentials/oai-key.yml', 'rb') as f:
    credentials = yaml.safe_load(f)
    
api_key = credentials['key']
os.environ['OPENAI_API_KEY'] = api_key

In [6]:
PROJECT_ID = 'arun-genai-bb'
LOCATION = 'us-central1'
MODEL_NAME = 'gpt-4'
ENCODING_NAME = 'cl100k_base'
CONTEXT_LENGTH = 8196  # GPT-4
STREAMING_API_URL = f'https://us-central1-aiplatform.googleapis.com/ui/projects/{PROJECT_ID}/locations/us-central1/publishers/google/models/{MODEL_NAME}:serverStreamingPredict'
DOCAI_PROCESSOR_NAME = 'projects/390991481152/locations/us/processors/ad9557a5be49204e'  # copy from notebook 00
vertexai.init(project=PROJECT_ID, location=LOCATION)

In [7]:
client_options = ClientOptions(api_endpoint=f'us-documentai.googleapis.com')
docai_client = documentai.DocumentProcessorServiceClient(client_options=client_options)

In [8]:
encoder = tiktoken.get_encoding(ENCODING_NAME)
logger.info(f'Using encoder=={encoder.name}')

Using encoder==cl100k_base


In [10]:
model = ChatOpenAI(openai_api_key=api_key, 
                   model_name=MODEL_NAME, 
                   temperature=0.0, 
                   max_tokens=512)
logger.info(f'Using model=={model.model_name}')

Using model==gpt-4


#### Use Google DocumentAI to process input PDF

##### Break PDF into smaller PDFs for OCR

In [11]:
LOCAL_INPUT_DIR = './DATA/INPUT'
LOCAL_OUTPUT_DIR = './DATA/OUTPUT'
FILE_NAME = 'file-2'

In [12]:
reader = PdfReader(f'{LOCAL_INPUT_DIR}/{FILE_NAME}.pdf') 
pages = {}

for i, page in enumerate(reader.pages):
    pages[i] = page

In [13]:
n = len(reader.pages)
d = 15  # docai has a current constraint of 15 pages per document 
for i in range(0, n, d):
    writer = PdfWriter()
    for j in range(i, i+d):
        if j < n:
            writer.add_page(pages[j])
    os.makedirs(f'{LOCAL_INPUT_DIR}/{FILE_NAME}/PARTS/', exist_ok=True)
    with open(f'{LOCAL_INPUT_DIR}/{FILE_NAME}/PARTS/{FILE_NAME}_{i+1}-{i+d}.pdf', 'wb') as f:
        writer.write(f)

In [14]:
def layout_to_text(layout: documentai.Document.Page.Layout, text: str) -> str:
    """
    Document AI identifies text in different parts of the document by their
    offsets in the entirety of the document's text. This function converts
    offsets to a string.
    """
    # If a text segment spans several lines, it will be stored in different text segments.
    return ''.join(text[int(segment.start_index): int(segment.end_index)] for segment in layout.text_anchor.text_segments)

In [15]:
def get_file_paths(dir_name: str) -> list:
    file_paths = []
    for file_name in os.listdir(dir_name):
        if os.path.isfile(os.path.join(dir_name, file_name)):
            file_path = os.path.join(dir_name, file_name)
            file_paths.append(file_path)
    return file_paths

In [16]:
def ocr_docai(file_path: str) -> dict:
    pages_map = {}

    with open(file_path, 'rb') as f:
        pdf = f.read()
        raw_document = documentai.RawDocument(content=pdf, mime_type='application/pdf')
        request = documentai.ProcessRequest(name=DOCAI_PROCESSOR_NAME, raw_document=raw_document)
        response = docai_client.process_document(request=request)
        text = response.document.text
        file_name = file_path.split('/')[-1]
        page_number = int(file_name.split('.')[0].split('-')[-1])
        for page in response.document.pages:
            page_text = []
            for paragraph in page.paragraphs:
                paragraph_text = layout_to_text(paragraph.layout, text)
                page_text.append(paragraph_text)
            pages_map[page_number] = ''.join(page_text)
            page_number += 1
    return pages_map

In [17]:
%%time 

input_dir = f'./DATA/INPUT/{FILE_NAME}/PARTS/'
file_paths = get_file_paths(input_dir)
    
pages_map_list = []
with ThreadPoolExecutor(max_workers=os.cpu_count()) as executor:  
    pages_map_list = list(tqdm(executor.map(ocr_docai, file_paths)))

merged_dict = {k: v for d in pages_map_list for k, v in d.items()}   
sorted_pages_map = dict(sorted(merged_dict.items()))

pages = []
for _, page_text in sorted_pages_map.items():
    pages.append(page_text)

73it [00:47,  1.54it/s]

CPU times: user 3.37 s, sys: 4.17 s, total: 7.55 s
Wall time: 47.4 s





Save concatenated pages as txt for later use (if needed)

In [18]:
extracted_pages = ''.join(pages)
os.makedirs(f'{LOCAL_OUTPUT_DIR}/{FILE_NAME}/OAI/', exist_ok=True)
with open(f'{LOCAL_OUTPUT_DIR}/{FILE_NAME}/OAI/{FILE_NAME}.txt', 'w') as out:
    out.write(extracted_pages)

In [19]:
def get_total_tokens(contexts: list) -> int:
    total_tokens = 0
    for context in contexts:
        n_tokens = len(encoder.encode(context))
        total_tokens += n_tokens 
    return total_tokens

In [20]:
total_tokens = get_total_tokens([extracted_pages])
logger.info(f'Total tokens in the input doc = {total_tokens}')

Total tokens in the input doc = 414088


In [21]:
def get_max_tokens_per_page(contexts: list) -> list:
    max_tokens_per_page = 0
    for context in contexts:
        n_tokens = len(encoder.encode(context))
        if n_tokens > max_tokens_per_page:
            max_tokens_per_page = n_tokens
    return max_tokens_per_page

#### Map Reduce 1

In [22]:
model = ChatOpenAI(openai_api_key=api_key, 
                   model_name=MODEL_NAME, 
                   temperature=0.0, 
                   max_tokens=256)

In [26]:
def get_summary(chunk: str) -> str:
    template = "You are a Financial Regulations & Derivatives Expert"
    system_message = SystemMessagePromptTemplate.from_template(template)

    template = "Summarize the following information into five brief sentences in English, capturing the essential details.\n\n{chunk}"
    human_message = HumanMessagePromptTemplate.from_template(template)

    chat_prompt = ChatPromptTemplate.from_messages([system_message, human_message])
    prompt = chat_prompt.format_prompt(chunk=chunk).to_messages()

    response = model(prompt)
    return response.content

In [27]:
%%time 

MAX_OUTPUT_TOKENS = 256
CONTEXTS_PER_CALL = 5  # process 5 pages per API call


def reduce(contexts: list) -> list:
    partitions = []
    max_input_tokens = CONTEXT_LENGTH - MAX_OUTPUT_TOKENS
    logger.info(f'Max input tokens allowed per API call = {max_input_tokens}')
    max_tokens_per_page = get_max_tokens_per_page(contexts)
    logger.info(f'Max tokens per page = {max_tokens_per_page}')
    logger.info(f'Processing {CONTEXTS_PER_CALL} pages per API call')
    
    for i in range(0, len(contexts), CONTEXTS_PER_CALL):
        partitions.append(contexts[i: i+CONTEXTS_PER_CALL])

    chunks = []
    for partition in partitions:
        chunks.append('\n'.join(partition))

    reduced_contexts = []

    # max_workers can result in running over quota limits for invocation | current limit for text bison is 60/min
    # for our experiments, we set max_workers=4 cores without any limit breach
    with ThreadPoolExecutor(max_workers=4) as executor:  
        reduced_contexts = list(tqdm(executor.map(get_summary, chunks),  total=len(chunks)))
    return reduced_contexts


CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 10 µs


In [28]:
logger.info(f'Number of pages to process = {len(pages)}')
summaries = reduce(pages)
logger.info(f'Number of generated summaries = {len(summaries)}')
n_tokens = get_total_tokens(summaries)
logger.info(f'Total number of tokens in generated summaries = {n_tokens}')

Number of pages to process = 1089
Max input tokens allowed per API call = 7940
Max tokens per page = 720
Processing 5 pages per API call
 65%|██████▍   | 141/218 [10:20<06:12,  4.83s/it]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..
100%|██████████| 218/218 [16:58<00:00,  4.67s/it]
Number of generated summaries = 218
Total number of tokens in generated summaries = 45598


In [29]:
logger.info(summaries[5])
logger.info('-' * 100)
logger.info(summaries[15])
logger.info('-' * 100)
logger.info(summaries[20])

Banking organizations with over $100 billion in total consolidated assets are subject to a standardized approach capital conservation buffer requirement. This requirement is calculated as the sum of the organization's stress capital buffer requirement, applicable countercyclical capital buffer requirement, and applicable GSIB surcharge. The Board is proposing to amend its capital plan rule to take into account capital ratios calculated under the expanded risk-based approach, in addition to the standardized approach. The proposal would also revise the calculation of the stress capital buffer requirement for large banking organizations. The Board is not proposing changes to the targeted severity of the stress capital buffer requirement, as it aims to better reflect the risk of banking organizations' exposures in the calculation of risk-weighted assets.

----------------------------------------------------------------------------------------------------
1. Evaluating defaulted residential

##### Persist internediate summaries (Map Reduce 1) to local disk

In [30]:
logger.info(f'Total number of summaries = {len(summaries)}')

Total number of summaries = 218


In [31]:
for i, summary in enumerate(summaries):
    os.makedirs(f'{LOCAL_OUTPUT_DIR}/{FILE_NAME}/OAI/MAP_REDUCE_1/', exist_ok=True)
    with open(f'{LOCAL_OUTPUT_DIR}/{FILE_NAME}/OAI/MAP_REDUCE_1/summary-{i}.txt', 'w') as f:
        f.write(summary)

#### Map Reduce 2

In [47]:
model = ChatOpenAI(openai_api_key=api_key, 
                   model_name=MODEL_NAME, 
                   temperature=0.0, 
                   max_tokens=2048)

def get_summary(context: str) -> str:
    template = "You are a Financial Regulations & Derivatives Expert"
    system_message = SystemMessagePromptTemplate.from_template(template)

    template = """"For the context below, create a consolidated refined short summary with the most important pointers only.\n\n{context}\n\nDo not repeat pointers. Breakdown the summary into SECTIONS. Make it crisp and concise."""
    human_message = HumanMessagePromptTemplate.from_template(template)

    chat_prompt = ChatPromptTemplate.from_messages([system_message, human_message])
    prompt = chat_prompt.format_prompt(context=context).to_messages()

    response = model(prompt)
    return response.content

In [50]:
%%time 

CONTEXTS_PER_CALL = 25  # process 50 summaries per API call

def reduce(contexts: list) -> list:
    partitions = []
    max_input_tokens = CONTEXT_LENGTH - MAX_OUTPUT_TOKENS
    logger.info(f'Max input tokens allowed per API call = {max_input_tokens}')
    max_tokens_per_page = get_max_tokens_per_page(contexts)
    logger.info(f'Max tokens per page = {max_tokens_per_page}')
    logger.info(f'Processing {CONTEXTS_PER_CALL} pages per API call')
    
    for i in range(0, len(contexts), CONTEXTS_PER_CALL):
        partitions.append(contexts[i: i+CONTEXTS_PER_CALL])

    chunks = []
    for partition in partitions:
        chunks.append('\n'.join(partition))
    logger.info(f'Total number of chunks of summaries = {len(chunks)}')

    reduced_contexts = []
    for chunk in tqdm(chunks):
        summary = get_summary(chunk)
        reduced_contexts.append(summary)
        time.sleep(1)
    return reduced_contexts

CPU times: user 5 µs, sys: 2 µs, total: 7 µs
Wall time: 10 µs


In [51]:
reduced_summaries = reduce(summaries)

Max input tokens allowed per API call = 7940
Max tokens per page = 256
Processing 25 pages per API call
Total number of chunks of summaries = 9
100%|██████████| 9/9 [09:30<00:00, 63.44s/it]


In [52]:
logger.info(reduced_summaries[0])
logger.info('-' * 100)
logger.info(reduced_summaries[1])
logger.info('-' * 100)
logger.info(reduced_summaries[2])
logger.info('-' * 100)

**Section 1: Proposed Rule Change**
The Office of the Comptroller of the Currency, the Board of Governors of the Federal Reserve System, and the Federal Deposit Insurance Corporation have proposed a rule change to revise capital requirements for large banking organizations and those with significant trading activity. The changes aim to improve risk-based capital requirement calculations, reduce complexity, enhance consistency, and facilitate more effective supervisory and market assessments of capital adequacy. The revisions include replacing current requirements that use banking organizations' internal models for credit risk and operational risk with standardized approaches.

**Section 2: Basel Committee Reforms**
The Basel Committee on Banking Supervision has proposed reforms to strengthen risk-based capital requirements for large banking organizations. The changes would align with international capital standards issued by the Basel Committee, known as the Basel III reforms, but woul

##### Persist internediate summaries (Map Reduce 2) to local disk

In [53]:
logger.info(f'Total number of summaries after map reduce 2 = {len(reduced_summaries)}')

Total number of summaries after map reduce 2 = 9


In [54]:
for i, summary in enumerate(reduced_summaries):
    os.makedirs(f'{LOCAL_OUTPUT_DIR}/{FILE_NAME}/OAI/MAP_REDUCE_2/', exist_ok=True)
    with open(f'{LOCAL_OUTPUT_DIR}/{FILE_NAME}/OAI/MAP_REDUCE_2/summary-{i}.txt', 'w') as f:
        f.write(summary)

In [55]:
consolidated_summaries = '\n'.join(reduced_summaries)
logger.info(get_total_tokens([consolidated_summaries]))

6530


In [56]:
logger.info(consolidated_summaries)

**Section 1: Proposed Rule Change**
The Office of the Comptroller of the Currency, the Board of Governors of the Federal Reserve System, and the Federal Deposit Insurance Corporation have proposed a rule change to revise capital requirements for large banking organizations and those with significant trading activity. The changes aim to improve risk-based capital requirement calculations, reduce complexity, enhance consistency, and facilitate more effective supervisory and market assessments of capital adequacy. The revisions include replacing current requirements that use banking organizations' internal models for credit risk and operational risk with standardized approaches.

**Section 2: Basel Committee Reforms**
The Basel Committee on Banking Supervision has proposed reforms to strengthen risk-based capital requirements for large banking organizations. The changes would align with international capital standards issued by the Basel Committee, known as the Basel III reforms, but woul

#### Final Consolidation

In [57]:
model = ChatOpenAI(openai_api_key=api_key, 
                   model_name=MODEL_NAME, 
                   temperature=0.0, 
                   max_tokens=1536)

def get_summary(context: str) -> str:
    template = "You are a Financial Regulations & Derivatives Expert"
    system_message = SystemMessagePromptTemplate.from_template(template)

    template = """"Given the context below, combine and merge duplicate sections and pointers.\n\n{context}\nAdd SECTIONS and bullets wherever needed. Clean rewrite and re-number sections."""
    human_message = HumanMessagePromptTemplate.from_template(template)

    chat_prompt = ChatPromptTemplate.from_messages([system_message, human_message])
    prompt = chat_prompt.format_prompt(context=context).to_messages()

    response = model(prompt)
    return response.content

In [58]:
final_summary = get_summary(consolidated_summaries)
logger.info(final_summary)

**Section 1: Proposed Rule Change and Basel Committee Reforms**
The Office of the Comptroller of the Currency, the Board of Governors of the Federal Reserve System, and the Federal Deposit Insurance Corporation, along with the Basel Committee on Banking Supervision, have proposed a rule change to revise capital requirements for large banking organizations and those with significant trading activity. The changes aim to improve risk-based capital requirement calculations, reduce complexity, enhance consistency, and facilitate more effective supervisory and market assessments of capital adequacy. The revisions include replacing current requirements that use banking organizations' internal models for credit risk and operational risk with standardized approaches. The changes would align with international capital standards issued by the Basel Committee, known as the Basel III reforms, but would also reflect specific characteristics of U.S. markets and legal requirements.

**Section 2: GSIB 

In [59]:
MAX_OUTPUT_TOKENS = 1536
max_input_tokens = CONTEXT_LENGTH - MAX_OUTPUT_TOKENS
logger.info(f'Max input tokens allowed per API call = {max_input_tokens}')
logger.info(f'Total tokens in final summary = {get_total_tokens([final_summary])}')

Max input tokens allowed per API call = 6660
Total tokens in final summary = 1536


##### Persist final summary to local disk

In [60]:
with open(f'{LOCAL_OUTPUT_DIR}/{FILE_NAME}/OAI/final-summary.txt', 'w') as f:
    f.write(final_summary)

##### Create a filtered summary with all the proposed changes on the `Processing of Derivative Contracts`

In [63]:
model = ChatOpenAI(openai_api_key=api_key, 
                   model_name=MODEL_NAME, 
                   temperature=0.0, 
                   max_tokens=1536)

def get_summary(context: str) -> str:
    template = "You are a Financial Regulations & Derivatives Expert"
    system_message = SystemMessagePromptTemplate.from_template(template)

    template = """"Given the SUMMARY below, extract and refine all proposed changes related to the processing of derivative contracts into a separate list. Create a concise and detailed summary with clear pointers.\n\n{context}"""
    human_message = HumanMessagePromptTemplate.from_template(template)

    chat_prompt = ChatPromptTemplate.from_messages([system_message, human_message])
    prompt = chat_prompt.format_prompt(context=context).to_messages()

    response = model(prompt)
    return response.content

In [64]:
proposed_changes_summary = get_summary(consolidated_summaries)
logger.info(proposed_changes_summary)

Proposed Changes Related to the Processing of Derivative Contracts:

1. The proposal suggests replacing the use of banking organizations' internal models for setting regulatory capital requirements for credit risk with a new expanded risk-based approach. This approach would apply the same risk-weight treatment for on-balance sheet exposures, including those to sovereigns, certain supranational entities, and multilateral development banks.

2. The proposal suggests that banking organizations should maintain capital in line with the level and nature of their risk exposure. It proposes replacing the use of internal models for setting regulatory capital requirements for credit risk with a new expanded risk-based approach.

3. The proposal would eliminate the use of models for credit risk under the current capital rule and replace them with standardized approaches.

4. The proposal outlines term-based haircuts for investment grade senior securitization exposures with a risk weight of less t

In [65]:
MAX_OUTPUT_TOKENS = 1536
max_input_tokens = CONTEXT_LENGTH - MAX_OUTPUT_TOKENS
logger.info(f'Max input tokens allowed per API call = {max_input_tokens}')
logger.info(f'Total tokens in final summary = {get_total_tokens([proposed_changes_summary])}')

Max input tokens allowed per API call = 6660
Total tokens in final summary = 1116


In [66]:
with open(f'{LOCAL_OUTPUT_DIR}/{FILE_NAME}/OAI/proposed-changes-summary.txt', 'w') as f:
    f.write(proposed_changes_summary)