### Extracting Information  
This notebook extracts information from an insurance policy. 
This notebook uses:  
- Langchain to load and retrieve information
- Open AI language model to rephrase the user question and generate responses  
- OpenAI Embeddings  
- FAISS in-memory vector store

##### Load the required modules

In [None]:
#!pip install unstructured
#!pip install tqdm # This useful for showing progress

In [1]:
!python --version

Python 3.9.18


#### 1 - Import dependencies and set up the language model

In [2]:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.prompts import PromptTemplate
from langchain.chains.combine_documents import create_stuff_documents_chain 
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import UnstructuredHTMLLoader, BSHTMLLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, HTMLHeaderTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores.faiss import FAISS
from langchain.chains import create_retrieval_chain
from langchain_core.output_parsers import JsonOutputParser
from langchain.chains.summarize import load_summarize_chain
from langchain_community.document_loaders import DirectoryLoader
from langchain.memory import ConversationBufferMemory
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain.schema import format_document
from langchain_core.messages import AIMessage, HumanMessage, get_buffer_string
from langchain_core.runnables import RunnableParallel
from langchain.chains import LLMChain
from operator import itemgetter
from datetime import datetime
import subprocess

##### Read in the OpenAI API key

In [3]:
import os
# Read OpenAI API key from environment variable
openai_api_key = os.getenv("OPENAI_API_KEY")
if openai_api_key is None:
    raise ValueError("OpenAI API Key not set in environment variables")
os.environ["OPENAI_API_KEY"] = openai_api_key

##### Set up the language models

In [4]:
# This is the model used to rephrase the user question
rephrase_model = ChatOpenAI(
    model = "gpt-3.5-turbo-1106",
    temperature = 0.1
)

In [5]:
# This is the model used to generate responses
model = ChatOpenAI(
    model = "gpt-3.5-turbo-1106",
    #model = "gpt-4-1106-preview", # Note: Map reduce with GPT4 is long and costly
    temperature = 0.1
)

#### 3 - Create Useful Functions

##### Create a function to rephrase the user question

In [54]:
def rephrase_question(user_question):
    rephrase_prompt = ChatPromptTemplate.from_template(f"""
    You are an AI trained as an experienced insurance underwriter, well-versed in South African insurance policies. 
    Your responses should be careful, concise, authoritative, quantitative and factual, drawing specifically from the provided context. 
    If the answer to a question is not clearly supported by the context, state that you do not know the answer. 
    Avoid speculations or assumptions outside the given context. 
    Focus on delivering brief, factual, and directly relevant responses.
    
    Given the diverse range of inquiries an LLM receives, it's crucial to first present questions in a manner that enhances the \
    LLM's ability to comprehend and address them effectively. Your task is to refine the following user questions to make \
    them more specific, concise, and directly aligned with what an LLM can answer accurately.\
    Please rephrase each question below, aiming for clarity and directness in seeking information.
    
    The questions are not generic. They relate to a specific insurance policy document.
    When answering a question, always give the corresponding amounts if you can.
    Take your time and always think step by step.
    
    Example 1 (Understanding Coverage Limits):
    Original: "How much coverage do I have for personal property?"
    Rephrased: "What are the specific coverage limits for personal property under this policy?"
    
    Example 2 (Clarifying Exclusions):
    Original: "Is water damage from leaks covered?"
    Rephrased: "Under what conditions does this policy cover water damage from plumbing leaks?"
    
    Example 3 (Excess Details):
    Original: "What’s my excess if something happens?"
    Rephrased: "What are the excess amounts for different types of claims under this policy?"

    Example 4 (Cost breakdown):
    Original: "What makes up the total monthly premium for the Jaguar?"
    Rephrased: "List all the quantified items that are included in the monthly premium for the jaguar."

    Example 5 (Extensions)
    Original: "Tell me about the landslip extension."
    Rephrased : "Can you provide specific details about the landslip extension included in this policy?"
    
    By adjusting the questions as demonstrated, you can better leverage the LLM's capabilities to provide informative and \
    precise answers. Now, proceed to rephrase the user question {user_question} following the examples provided.
    If the user question {user_question} is phrased adequately, do not rephrase it. Simply return the user question.
    
    Rephrased question:
    """)
    
    chain = LLMChain(llm=rephrase_model, prompt=rephrase_prompt)
    answer = chain.invoke({"user question": user_question})
    rephrased_question = answer['text'].replace('"', '')

    return rephrased_question

    

In [7]:
# This uses poppler, a PDF rendering library, commonly used as a backend for PDF to image, html or text conversion tools 
# in various programming environments.
# You can ignore the warning: Syntax Warning: Bad annotation destination

def convert_pdf_to_html(pdf_path, html_output_path):
    # Construct the command to convert the PDF file to HTML
    command = ['pdftohtml', '-c', '-noframes', pdf_path, html_output_path]

    # Run the command
    subprocess.run(command, check=True)


##### Create a function to build the vector data base from the html document

In [62]:
def create_vector_db(source_html_file):
    loader = UnstructuredHTMLLoader(source_html_file)
    #loader = PyPDFLoader("./insurance/Southsure policy docs.pdf")
    pages=loader.load()

   # Define chunk size, overlap and separators
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=8000, #experiment with this
        chunk_overlap=800,
        separators=['\n\n', '\n', '(?=>\. )', ' ', '']
    )
    docs  = text_splitter.split_documents(pages)

    # Select the embeddings to use
    embeddings = OpenAIEmbeddings()

    #Create the vectorized db
    # Vectorstore: https://python.langchain.com/en/latest/modules/indexes/vectorstores.html

    vector_db = FAISS.from_documents(docs, embeddings)

    return vector_db, docs


##### Create a function to summarize the document

In [9]:
def summarize_doc(docs):

    summarization_prompt = """
    Write a concise summary of the following:
    {text}
    CONCISE SUMMARY:
    """

    prompt = PromptTemplate(template=summarization_prompt, input_variables=["text"])

    summary_chain = load_summarize_chain(
    llm = model,
    chain_type="map_reduce",
    map_prompt=prompt,
    combine_prompt=prompt
    )

    summary = summary_chain.run(docs)
    #print(summary)

    return summary

##### Create a function to create the retrieval chain with conversation history and memory
First add in Conversational History  
Given a conversation history and a follow-up question, the goal is to rephrase the follow-up question so that it becomes a standalone question.  
This means the rephrased question should be understandable without needing the context of the chat history.  

Then format and combine multiple documents into a single string. Each document is formatted according to a specified or default template, and then they are combined into one string, with each document separated by a specified separator.  

Finally, add in memory and return source documents.

In [55]:
def create_retrieval_chain():
    # Given the chat history, rephrase the follow-up question to be a stand-alone question.
    _template = """Given the following conversation and a follow up question, rephrase the follow up question \
    to be a standalone question, in its original language.
    
    Chat History:
    {chat_history}
    Follow Up Input: {question}
    Standalone question:"""
    CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

    # Create a summary prompt template from the document summary
    summary_template = PromptTemplate(input_variables=["summary"], template="{summary}")

    # Add it to the QA prompt template
    template = summary_template.format(summary=summary) + """
    You are an AI trained as an experienced insurance underwriter, well-versed in South African insurance policies, \
    including interpreting tabulated data. Your responses should be quantitative, careful, concise, authoritative, and factual,\
    focusing specifically on extracting explicit numerical values provided in the context.
    
    When answering a question about the components of a monthly premium for a specific insured item, please ensure you list \
    not only the factors but also the explicit numerical values associated with each factor, \
    as stated in the policy document. For each factor contributing to the monthly premium, provide the exact amount or rate \
    if it is directly mentioned in the provided document.
    
    If the document explicitly lists figures, rates, or conditions, reference these details directly in your response. \
    If certain numerical values or specifics about the premium components are not mentioned or are unclear in the document, \
    state clearly that the numerical detail is not available. \
    Avoid speculation and focus on delivering factual responses with direct relevance to the question asked.
    
    Your role is to accurately convey the information as presented in the policy document, emphasizing the extraction and \
    reporting of numerical data to ensure clarity and precision in understanding the policy's cost breakdown.
    
    Only answer the question from the following context, with a special emphasis on directly quoted numerical data:
    {context}
    
    Question: {question}

    """
    ANSWER_PROMPT = ChatPromptTemplate.from_template(template)

    # Format and combine multiple documents into a single string
    DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")


    def _combine_documents(
        docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
    ):
        doc_strings = [format_document(doc, document_prompt) for doc in docs]
        return document_separator.join(doc_strings)

    # Reset memory
    memory = ConversationBufferMemory(
        return_messages=True, output_key="answer", input_key="question"
    )

    # First we add a step to load memory
    # This adds a "memory" key to the input object
    loaded_memory = RunnablePassthrough.assign(
        chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("history"),
    )
    # Now we calculate the standalone question
    standalone_question = {
        "standalone_question": {
            "question": lambda x: x["question"],
            "chat_history": lambda x: get_buffer_string(x["chat_history"]),
        }
        | CONDENSE_QUESTION_PROMPT
        | model
        | StrOutputParser(),
    }
    # Now we retrieve the documents
    retrieved_documents = {
        "docs": itemgetter("standalone_question") | retriever,
        "question": lambda x: x["standalone_question"],
    }
    # Now we construct the inputs for the final prompt
    final_inputs = {
        "context": lambda x: _combine_documents(x["docs"]),
        "question": itemgetter("question"),
    }
    # And finally, we do the part that returns the answers
    answer = {
        "answer": final_inputs | ANSWER_PROMPT | model,
        "docs": itemgetter("docs"),
    }
    # And now we put it all together!
    final_chain = loaded_memory | standalone_question | retrieved_documents | answer

    return final_chain, memory

#### 4 - Run the functions

##### A - Convert the original pdf to html

In [13]:
pdf_file = 'insurance/Southsure policy docs.pdf' #This is the source pdf
html_output_file = 'insurance/output_file' #This is the destination html file name

convert_pdf_to_html(pdf_file, html_output_file)

# This creates a file output_file.html as well as a png for each page.
# The png files are required to maintain the html structure.

Page-1
Page-2
Page-3
Page-4
Page-5
Page-6
Page-7
Page-8
Page-9
Page-10
Page-11
Page-12
Page-13
Page-14
Page-15
Page-16
Page-17
Page-18
Page-19
Page-20
Page-21
Page-22


##### B - Create the vector data base from the source OCR processed documents

In [63]:
vector_db, docs = create_vector_db(source_html_file="./insurance/output_file.html") 
#This is a one time event and can take some time depending on the length of the documents

##### C - Create a summary of the case or document

In [None]:
# Get the current date and time
current_time = datetime.now()
# Format the date and time as a string
formatted_time = current_time.strftime("Started summarizing document at %Y-%m-%d %H:%M:%S")
# Print the formatted date and time
print(formatted_time)

summary = summarize_doc(docs) #This is a one time event and can take some time

# Get the current date and time
current_time = datetime.now()
# Format the date and time as a string
formatted_time = current_time.strftime("Finished summarizing document at %Y-%m-%d %H:%M:%S")
# Print the formatted date and time
print(formatted_time + "\n")
print(summary)

In [None]:
# Edit as necesary
summary = """
The document is an insurance policy for MISS NB RADEBE & MR EA DUKER
"""

##### D - Optional: Save summary to disk

In [None]:
# Specify the filename where you want to save the summary
filename = "./insurance/summaries/summary.txt"

# Open the file in write mode and write the summary
with open(filename, 'w') as file:
    file.write(summary)

print(f"The document summary has been saved to {filename}")

##### E - Optional: Load summary from disk

In [15]:
# Specify the filename from which you want to load the summary
filename = "./insurance/summaries/summary.txt"

# Open the file in read mode and read the contents
with open(filename, 'r') as file:
    summary = file.read()

# Optionally, print the loaded summary
print(f"Summary:\n {summary}")


Summary:
 
The document is an insurance policy for MISS NB RADEBE & MR EA DUKER



#### 5 - Set up the retriever  
    
Parameters for search_type =  similarity_score_threshold:    
search_type="similarity_score_threshold",  
search_kwargs={'score_threshold': 0.4},  
  
Parameters for search_type = Maximal Marginal Relevance (MMR)   
k = number of documents to retrieve (default = 4)  
lambda_mult = degree of diversity where 0 is more (default = 0.5)  
fetch_k = number of documents to return to the mmr algorithm (default = 20)

In [64]:
retriever = vector_db.as_retriever(
    #search_type="similarity",
    search_type = 'similarity_score_threshold',
    search_kwargs={'score_threshold': 0.5},
    #search_type = 'mmr',
    #search_kwargs={'k': 6, 'lambda_mult': 0.5, 'fetch_k': 2 },
)

#### 6 - Ask questions

In [67]:
user_question = input("Your question here: ")
rephrased_question = rephrase_question(user_question)

final_chain, memory = create_retrieval_chain()

inputs = {"question": rephrased_question}
result = final_chain.invoke(inputs)

# Print results
print("\033[1m" + "Your question: " + user_question + "\033[0m")
print("\033[1m" + "Rephrased question: " + "\033[0m" +rephrased_question)
print("\033[1m" + "Answer: " + "\033[0m" + result['answer'].content)
print("\n")

Your question here:  what makes up the monthly premium for the jaguar?


[1mYour question: what makes up the monthly premium for the jaguar?[0m
[1mRephrased question: [0mWhat are the specific components included in the total monthly premium for the Jaguar?
[1mAnswer: [0mThe specific components included in the total monthly premium for the Jaguar are as follows:

1. Premium: 690.06
2. Sasria Cover: 2.02
3. Private and Professional Use: Included in the premium
4. Comprehensive Cover Type: Included in the premium
5. Tracking device installed: Included in the premium
6. Basic Excess Waiver: Included in the premium
7. Car Hire: Included in the premium
8. Total: 692.08




You can also ask questions without rephrasing to help optimise the examples in the rephrase_question function.

In [52]:
# Ask a question without rephrasing
user_question = input("Your question here: ")
final_chain, memory = create_retrieval_chain()

inputs = {"question": user_question}
result = final_chain.invoke(inputs)

# Print results
print("\033[1m" + "Your question: " + user_question + "\033[0m")
print(result['answer'].content)
print("\n")


Your question here:  under whhat conditions would landslip cover be excluded?


[1mYour question: under whhat conditions would landslip cover be excluded?[0m
The document does not explicitly state the conditions that would result in the exclusion of landslip cover.




##### Alternate ways to summarize the document

In [None]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
chain = load_summarize_chain(llm, chain_type="map_reduce")

summary_map_reduce=chain.run(docs)
print(summary_map_reduce)

In [None]:
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
chain = load_summarize_chain(llm, chain_type="refine")

summary_refine=chain.run(docs)
print(summary_refine)

In [None]:
# For long documents, this method may exceed the model token limit
llm = ChatOpenAI(temperature=0, model_name="gpt-3.5-turbo-1106")
chain = load_summarize_chain(llm, chain_type="stuff")

summary_stuff=chain.run(docs)
print(summary_stuff)

References

[Langchain: Retrieval Augmented Generation](https://python.langchain.com/docs/expression_language/cookbook/retrieval)