## Summarization with Claude-2

This notebook leverages the Claude-2 model with a long context window of 100k tokens to tackle long-form summarization of financial documents.

##### Table of Contents
- Initialization
  - Import Libraries
  - Logging Setup
- Configuration
  - Load API Credentials
  - Client and Encoder Initialization
- Preprocessing
  - Token Calculation Helpers
  - Load PDF Content
  - Segment PDF into Chunks
- Summarization
  - Generate Summaries for Chunks
  - Consolidate and Refine Summaries
- Output
  - Save Final Summary
  - Save Individual Summaries


#### Imports 

In [1]:
from anthropic import HUMAN_PROMPT
from anthropic import AI_PROMPT
from anthropic import Client
from typing import List
import tiktoken
import logging
import yaml 
import os


##### Setup logging

In [2]:
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
logger.addHandler(logging.StreamHandler())

#### Essentials

In [3]:
with open('./../credentials/claude-api.yml', 'rb') as f:
    credentials = yaml.safe_load(f)
    
api_key = credentials['key']
os.environ['ANTHROPIC_API_KEY'] = api_key

client = Client(api_key=api_key)

In [4]:
ENCODING_NAME = 'cl100k_base'
encoder = tiktoken.get_encoding(ENCODING_NAME)

#### Summarize 

In [5]:
with open('./DATA/file-2.txt', 'r') as f:
    pdf = f.read()

In [6]:
def get_total_tokens(contexts: list) -> int:
    total_tokens = 0
    for context in contexts:
        n_tokens = len(encoder.encode(context))
        total_tokens += n_tokens 
    return total_tokens

In [7]:
total_tokens = get_total_tokens([pdf])
logger.info(f'Approximate number of tokens in the PDF = {total_tokens}')

Approximate number of tokens in the PDF = 414095


Segment the PDF into N chunks of length ~100k each

In [8]:
def split_into_chunks(text: str, c: int) -> List[str]:
    """
    Split the input text into chunks where each chunk contains approximately c tokens.
    
    Parameters:
    - text (str): The input text to be chunked.
    - c (int): The approximate number of tokens per chunk.
    
    Returns:
    - List[str]: A list of string chunks.
    """
    chunks = []
    current_chunk = ""
    current_token_count = 0

    # Iterate over tokens in the text
    for token in encoder.encode(text):
        # Update the current chunk and token count
        current_chunk += encoder.decode([token])
        current_token_count += 1

        # If the current token count reaches c, add the chunk to the list and reset
        if current_token_count >= c:
            chunks.append(current_chunk)
            current_chunk = ""
            current_token_count = 0

    # Add the last chunk if it's non-empty
    if current_chunk:
        chunks.append(current_chunk)
    
    return chunks


In [9]:
CLAUDE_CONTEXT_WINDOW = 100000
MAX_OUTPUT_TOKENS = 8192
c = CLAUDE_CONTEXT_WINDOW - MAX_OUTPUT_TOKENS
chunks = split_into_chunks(pdf, c)
logger.info(len(chunks))

5


In [12]:
logger.info(f'Number of tokens in chunk 0 = {get_total_tokens([chunks[0]])}')
logger.info(f'Number of tokens in chunk 4 = {get_total_tokens([chunks[4]])}')

Number of tokens in chunk 0 = 91757
Number of tokens in chunk 4 = 46860


In [11]:
logger.info(HUMAN_PROMPT)



Human:


In [None]:
logger.info(AI_PROMPT)

In [15]:
%%time 

summaries = []

for chunk in chunks:
    prompt = f"""{HUMAN_PROMPT} You are a Financial Regulations & Derivatives Expert. Given the chunk below, extract all the proposed changes related to `the processing of derivative contracts` into a long detailed summary with bullet points.\n\n{chunk}\n\n{AI_PROMPT}"""
    response = client.completions.create(prompt=prompt, 
                             model='claude-2', 
                             max_tokens_to_sample=MAX_OUTPUT_TOKENS)
    summary = response.completion
    summaries.append(summary)

CPU times: user 35.9 ms, sys: 9.29 ms, total: 45.1 ms
Wall time: 6min 16s


#### Consolidate and refine generated summaries 

In [25]:
%%time

stacked_summaries = '\n'.join(summaries)

prompt = f"""{HUMAN_PROMPT} You are a Financial Regulations & Derivatives Expert. Given context below, create a detailed summary broken down by sections.\n\n{stacked_summaries}\n\n{AI_PROMPT}"""
response = client.completions.create(prompt=prompt, 
                            model='claude-2', 
                            max_tokens_to_sample=MAX_OUTPUT_TOKENS)
final_summary = response.completion
logger.info(final_summary)


 Here is a summary of the key proposed changes related to the processing of derivative contracts:

- Require banks to use the standardized approach for counterparty credit risk (SA-CCR) to calculate exposure amounts for all derivative contracts. This replaces internal models and aims to standardize calculations across banks.

- Make technical revisions to SA-CCR to improve implementation consistency. This includes revisions to the treatment of collateral, supervisory delta adjustments, decomposition of indices, etc. 

- Introduce minimum haircut floors for certain transactions with unregulated entities to limit leverage build up outside the banking system. The floors are based on collateral type.

- Replace model-based approaches for credit risk mitigation with standardized approaches from the current framework. This prohibits recognition of certain credit derivatives. 

- Make revisions to the securitization framework, including changes to the standardized approach, treatment of overl

CPU times: user 7.49 ms, sys: 3.13 ms, total: 10.6 ms
Wall time: 21.2 s


In [26]:
with open('./DATA/final-summary.txt', 'w') as f:
    f.write(final_summary)

In [28]:
for i, summary in enumerate(summaries):
    i += 1
    with open(f'./DATA/summary-{i}.txt', 'w') as f:
        f.write(summary)