## Search, Retrieve & Summarize using Vertex AI's Text-Bison (PaLM 2) Model
This notebook leverages the capabilities of Vertex AI and the Text Bison model to search, retrieve, and subsequently summarize text documents. <br>
The entire process can be outlined as:<br><br>
* Configuration and Setup
* Encoding and Indexing Text Content
* Document Retrieval
* Summarization Using MapReduce

#### 1. Configuration and Setup
1.1. Import Necessary Libraries

In [1]:
# Importing the required libraries for chains, embeddings, document loading, Vertex AI, vector stores, etc.
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain, StuffDocumentsChain
from langchain.embeddings import VertexAIEmbeddings
from langchain.embeddings.base import Embeddings
from langchain.document_loaders import JSONLoader
from langchain.llms.vertexai import VertexAI
from langchain.prompts import PromptTemplate
from langchain.chains.llm import LLMChain
from langchain.vectorstores import FAISS
from typing import List
from tqdm import tqdm
import logging
import os 

1.2. Logging Setup

In [2]:
# Configure the logging mechanism to capture progress and potential issues
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler())

1.3. Load API Key and Set Paths

In [3]:
# Set the path for Vertex AI API credentials
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = './../credentials/vai-key.json'

# Define the paths for input data and output results
LOCAL_INPUT_DIR = './DATA/INPUT'
LOCAL_OUTPUT_DIR = './DATA/OUTPUT'
TEXT_GEN_MODEL_NAME = 'text-bison@latest'
TEXT_EMBED_MODEL_NAME = 'textembedding-gecko'
FILE_NAME = 'file-2'

1.4. Define Custom Embeddings Class

In [4]:
# Custom VertexAIEmbeddings class for handling batch processing
class MyVertexAIEmbeddings(VertexAIEmbeddings, Embeddings):
    model_name = 'textembedding-gecko'
    max_batch_size = 5
    
    def embed_segments(self, segments: List) -> List:
        embeddings = []
        for i in tqdm(range(0, len(segments), self.max_batch_size)):
            batch = segments[i: i+self.max_batch_size]
            embeddings.extend(self.client.get_embeddings(batch))
        return [embedding.values for embedding in embeddings]
    
    def embed_query(self, query: str) -> List:
        embeddings = self.client.get_embeddings([query])
        return embeddings[0].values

text_embeder = MyVertexAIEmbeddings(model_name=TEXT_EMBED_MODEL_NAME)

#### 2. Encoding and Indexing Text Content
2.1. Metadata Extraction Function

In [5]:
# Function to extract necessary metadata from a record
def extract_metadata(record: dict, metadata: dict) -> dict:
    metadata['doc_name'] = record.get('doc_name')
    metadata['page_num'] = record.get('page_num')
    return metadata

2.2. Load Documents and Build Vector Store

In [6]:
# Load documents from a JSONL file and index them in a FAISS vector store
loader = JSONLoader(file_path=f'{LOCAL_OUTPUT_DIR}/{FILE_NAME}/TEXT/{FILE_NAME}.jsonl', 
                    jq_schema='.', 
                    metadata_func=extract_metadata, 
                    content_key='page_content', 
                    json_lines=True)
segments = loader.load()
vector_store = FAISS.from_documents(segments, text_embeder)

#### 3. Document Retrieval
3.1. Initialize Retrieval Chain

In [7]:
# Set up the retriever using FAISS and initialize the Vertex AI model for text generation
retriever = vector_store.as_retriever(search_type='mmr', search_kwargs={'k': 10})
llm = VertexAI(max_output_tokens=2048, 
               model_name=TEXT_GEN_MODEL_NAME,
               temperature=0,
               top_p=0.8,
               top_k=40,
               verbose=True)

3.2. Retrieve Relevant Documents

In [8]:
# Execute a query and log the resulting documents
query = "derivative contracts"
pages = retriever.get_relevant_documents(query)
for page in pages:
    logger.info(f'Content => {page.page_content}')
    logger.info(f'Metadata => {page.metadata}')
    logger.info(f"\n{'-' * 100}\n")

Content =>   
  
 (ii) The maturity factor of a derivative cont ract that is not subject to a variation margin 
agreement, or derivative contracts under which the counterparty is not required to post variation 
margin, is determined by the following formula: 
min {ܯ250 }Maturity factor = ඨ 
250 
Where M equals the greater of 10 business days and the remaining maturity of the 
contract, as measured in business days. 
(iii) For purposes of paragraph (i)(4) of this section, if a [BANKING ORGANIZATION] 
has elected pursuant to paragraph (e)(5) of this s ection to treat a deriva tive contract that is a 
cleared transaction that is not s ubject to a variation margin agreement as one that is subject to a 
variation margin agreement, the [BANKING OR GANIZATION] must treat the derivative 
contract as subject to a variation margin agreement with maturity factor as determined according 
to paragraph (i)(4)(i) of this section, and daily s ettlement does not change the end date of the 
period refer

#### 4. Summarization Using MapReduce
4.1. Map Chain for Page Summarization

In [9]:
# Define the map chain to produce a summary for each individual page
map_template = """You are a Derivatives Risk Analyst. Your job is to create a detailed SUMMARY of financial documents.
Your summary should include all important facts, math formulas, and technical details, and leave out extra information.
Your goal is to provide a clear and precise SUMMARY of the PAGE below that shows your knowledge of finance, with high accuracy and quality.
---
PAGE => {PAGE}
---
SUMMARY =>"""
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt, verbose=True)

4.2. Reduce Chain for Consolidation

In [10]:
# Define the reduce chain to consolidate individual summaries into a single summary
reduce_template = """The following is set of SUMMARIES:
---
SUMMARIES => {SUMMARIES}
---
Combine the summaries into a single cohesive CONSOLIDATED summary without distilling or removing any information. Add sections wherever applicable to clearly delineate topics or themes.
CONSOLIDATED SUMMARY =>"""
reduce_prompt = PromptTemplate.from_template(reduce_template)
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt, verbose=True)
combine_documents_chain = StuffDocumentsChain(llm_chain=reduce_chain, 
                                              document_variable_name='SUMMARIES', 
                                              verbose=True)
reduce_documents_chain = ReduceDocumentsChain(combine_documents_chain=combine_documents_chain, 
                                              collapse_documents_chain=combine_documents_chain, 
                                              token_max=2048, 
                                              verbose=True)

4.3. Execute the MapReduce Chain

In [11]:
# Run the map-reduce chain to produce the final consolidated summary
map_reduce_chain = MapReduceDocumentsChain(llm_chain=map_chain, 
                                           reduce_documents_chain=reduce_documents_chain, 
                                           document_variable_name='PAGE', 
                                           return_intermediate_steps=False, 
                                           verbose=True)
completion = map_reduce_chain.run(pages)
completion = completion.strip()
logger.info(completion)



[1m> Entering new MapReduceDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mYou are a Derivatives Risk Analyst. Your job is to create a detailed SUMMARY of financial documents.
Your summary should include all important facts, math formulas, and technical details, and leave out extra information.
Your goal is to provide a clear and precise SUMMARY of the PAGE below that shows your knowledge of finance, with high accuracy and quality.
---
PAGE =>   
  
 (ii) The maturity factor of a derivative cont ract that is not subject to a variation margin 
agreement, or derivative contracts under which the counterparty is not required to post variation 
margin, is determined by the following formula: 
min {ܯ250 }Maturity factor = ඨ 
250 
Where M equals the greater of 10 business days and the remaining maturity of the 
contract, as measured in business days. 
(iii) For purposes of paragraph (i)(4) of this section, if a [BANKING ORGANIZAT

  from .autonotebook import tqdm as notebook_tqdm
Token indices sequence length is longer than the specified maximum sequence length for this model (1741 > 1024). Running this sequence through the model will result in indexing errors




[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is set of SUMMARIES:
---
SUMMARIES =>  1. Maturity factor of a derivative contract that is not subject to a variation margin agreement, or derivative contracts under which the counterparty is not required to post variation margin, is determined by the formula: 
Maturity factor = min{250/M, 1}
Where M equals the greater of 10 business days and the remaining maturity of the contract, as measured in business days.

2. For purposes of paragraph (i)(4) of this section, if a [BANKING ORGANIZATION] has elected pursuant to paragraph (e)(5) of this section to treat a derivative contract that is a cleared transaction that is not subject to a variation margin agreement as one that is subject to a variation margin agreement, the [BANKING ORGANIZATION] must treat the derivative contract as subject to a variation margin agreement with maturity factor as determined according to paragraph (i)(4)(i) of this s

**Maturity Factor** 

- Maturity factor of a derivative contract that is not subject to a variation margin agreement, or derivative contracts under which the counterparty is not required to post variation margin, is determined by the formula: 
Maturity factor = min{250/M, 1}
Where M equals the greater of 10 business days and the remaining maturity of the contract, as measured in business days.

- For purposes of paragraph (i)(4) of this section, if a [BANKING ORGANIZATION] has elected pursuant to paragraph (e)(5) of this section to treat a derivative contract that is a cleared transaction that is not subject to a variation margin agreement as one that is subject to a variation margin agreement, the [BANKING ORGANIZATION] must treat the derivative contract as subject to a variation margin agreement with maturity factor as determined according to paragraph (i)(4)(i) of this section, and daily settlement does not change the end date of the period referenced by the derivative contract.


*


[1m> Finished chain.[0m

[1m> Finished chain.[0m


4.4. Write the Final Summary to a File

In [12]:
# Save the consolidated summary to a file for future reference
SUMMARY_OUT_PATH = f'{LOCAL_OUTPUT_DIR}/{FILE_NAME}/SUMMARY'
os.makedirs(SUMMARY_OUT_PATH, exist_ok=True)
with open(f'{SUMMARY_OUT_PATH}/{FILE_NAME}-summary-vai.txt', 'w') as f:
    f.write(completion)