# Summarize documents using RAG, LangChain, and LLM

Adrian P. Bustamante, Ph.D. \
adrianpebus@gmail.com

## Objective

The aim is to develope an agent that uses integrated LLM, LangChain, and RAG technologies for interactive and efficient document retrieval and summarization, making conversations to have memory.

### __Table of Contents__

<ol>
    <li><a href="#Data-Processing">Data Processing</a></li>
    <li><a href="#Using-LLM-for-a-QA-application">Using LLM for a QA application </a></li>
    <li><a href="#Wrapping-it-up-in-an-agent">Wrapping it up in an agent</a></li>
</ol>

## Data Processing

In [1]:
#!pip install "ibm-watsonx-ai==0.2.6"
#!pip install "langchain==0.1.16" 
#!pip install "langchain-ibm==0.1.4"
#!pip install "huggingface == 0.0.1"
#!pip install "huggingface-hub == 0.23.4"
#!pip install "sentence-transformers == 2.5.1"
#!pip install "chromadb"
#!pip install "wget == 3.2"

In [2]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

from ibm_watsonx_ai.foundation_models import Model
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai.foundation_models.utils.enums import ModelTypes, DecodingMethods
from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM
import wget

#### Downloading a document

In [3]:
filename = 'companyPolicies.txt'
url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/6JDbUb_L3egv_eOkouY71A.txt'

# Use wget to download the file
wget.download(url, out=filename)
print('file downloaded')

100% [..............................................................................] 15660 / 15660file downloaded


In [4]:
with open(filename, 'r') as file:
    # Read the contents of the file
    contents = file.read()
    print(contents)

1.	Code of Conduct

Our Code of Conduct outlines the fundamental principles and ethical standards that guide every member of our organization. We are committed to maintaining a workplace that is built on integrity, respect, and accountability.
Integrity: We hold ourselves to the highest ethical standards. This means acting honestly and transparently in all our interactions, whether with colleagues, clients, or the broader community. We respect and protect sensitive information, and we avoid conflicts of interest.
Respect: We embrace diversity and value each individual's contributions. Discrimination, harassment, or any form of disrespectful behavior is unacceptable. We create an inclusive environment where differences are celebrated and everyone is treated with dignity and courtesy.
Accountability: We take responsibility for our actions and decisions. We follow all relevant laws and regulations, and we strive to continuously improve our practices. We report any potential violations of 

#### Splitting the document into chunks

In [6]:
loader = TextLoader(filename)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
print(len(texts))

Created a chunk of size 1624, which is longer than the specified 1000
Created a chunk of size 1885, which is longer than the specified 1000
Created a chunk of size 1903, which is longer than the specified 1000
Created a chunk of size 1729, which is longer than the specified 1000
Created a chunk of size 1678, which is longer than the specified 1000
Created a chunk of size 2032, which is longer than the specified 1000
Created a chunk of size 1894, which is longer than the specified 1000


16


In [8]:
texts[1]

Document(metadata={'source': 'companyPolicies.txt'}, page_content="Our Code of Conduct outlines the fundamental principles and ethical standards that guide every member of our organization. We are committed to maintaining a workplace that is built on integrity, respect, and accountability.\nIntegrity: We hold ourselves to the highest ethical standards. This means acting honestly and transparently in all our interactions, whether with colleagues, clients, or the broader community. We respect and protect sensitive information, and we avoid conflicts of interest.\nRespect: We embrace diversity and value each individual's contributions. Discrimination, harassment, or any form of disrespectful behavior is unacceptable. We create an inclusive environment where differences are celebrated and everyone is treated with dignity and courtesy.\nAccountability: We take responsibility for our actions and decisions. We follow all relevant laws and regulations, and we strive to continuously improve our

### Embedding the chunks of text for RAG

In [10]:
embeddings = HuggingFaceEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)  # store the embedding in docsearch using Chromadb
print('document ingested')

document ingested


## Using LLM for a QA application

In [11]:
model_id = 'mistralai/mistral-small-3-1-24b-instruct-2503'

In [12]:
parameters = {
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY, ## chooses the responses with the highest prob
    GenParams.MIN_NEW_TOKENS: 130, # this controls the minimum number of tokens in the generated output
    GenParams.MAX_NEW_TOKENS: 256,  # this controls the maximum number of tokens in the generated output
    GenParams.TEMPERATURE: 0.5 # this randomness or creativity of the model's responses
}

In [13]:
#watsonx_API = 'Cannot show the key publicly'
#project_id= 'cannot show publicly'
credentials = {
    'url': "https://us-south.ml.cloud.ibm.com",
    'apikey' : watsonx_API
    }

In [14]:
###wrap the parameters to the model
model = Model(
    model_id=model_id,
    params=parameters,
    credentials=credentials,
    project_id=project_id
)

In [15]:
### build a llm that easily integrates with langchain
mistral_llm = WatsonxLLM(model=model)

In [18]:
##asking the model without context
mistral_llm.invoke('what is internet policy?')

' Internet policy refers to the rules, regulations, and guidelines that govern the use, management, and operation of the internet. These policies can be established by governments, international organizations, private companies, and other entities to address various aspects of internet activity, including security, privacy, content regulation, and access. Key areas of internet policy include:\n\n1. **Cybersecurity**: Policies aimed at protecting internet users and infrastructure from cyber threats, such as malware, hacking, and data breaches.\n\n2. **Privacy**: Regulations that protect the personal data of internet users, often involving data collection, storage, and sharing practices.\n\n3. **Content Regulation**: Guidelines for what types of content are allowed or restricted on the internet, including issues related to hate speech, misinformation, and illegal activities.\n\n4. **Net Neutrality**: Policies that ensure equal treatment of all internet traffic, preventing internet servic

#### QA application

In [19]:
## QA application
qa = RetrievalQA.from_chain_type(llm=mistral_llm, 
                                 chain_type="stuff", 
                                 retriever=docsearch.as_retriever(), 
                                 return_source_documents=False)
query = "what is internet policy?"
qa.invoke(query)

{'query': 'what is internet policy?',
 'result': " The internet policy is a set of guidelines established to ensure the responsible and secure use of internet services within an organization. It covers acceptable use, security measures, confidentiality, appropriate behavior, compliance with laws, monitoring, and consequences for violations. The policy aims to maintain security, productivity, and legal compliance while promoting safe and responsible usage of digital communication tools. It is expected that each employee understands and follows this policy, with regular reviews to keep it aligned with evolving technology and security standards. The policy is designed to align with the organization's values and legal obligations. The policy is designed to align with the organization's values and legal obligations. The policy is designed to align with the organization's values and legal obligations."}

In [21]:
qa = RetrievalQA.from_chain_type(llm=mistral_llm, 
                                 chain_type="stuff", 
                                 retriever=docsearch.as_retriever(), 
                                 return_source_documents=False)
query = "Can you summarize the document for me?"
qa.invoke(query)

{'query': 'Can you summarize the document for me?',
 'result': ' The document outlines the Code of Conduct, Health and Safety Policy, and Anti-discrimination and Harassment Policy for an organization. The Code of Conduct emphasizes integrity, respect, accountability, safety, and environmental responsibility as fundamental principles guiding all members. It expects employees to uphold these principles and serve as role models. The Health and Safety Policy prioritizes the well-being of employees, customers, and the public, aiming to maintain a hazard-free workplace through compliance with laws, regular assessments, training, and open communication. The Anti-discrimination and Harassment Policy is mentioned but not detailed in the provided text. The organization is committed to creating an inclusive environment where everyone is treated with dignity and respect.'}

### conversations with memory

In [26]:
qa = RetrievalQA.from_chain_type(llm=mistral_llm, 
                                 chain_type="stuff", 
                                 retriever=docsearch.as_retriever(), 
                                 return_source_documents=False)

In [27]:
query = "What I cannot do in it?"
qa.invoke(query)

{'query': 'What I cannot do in it?',
 'result': ' You cannot use the company-provided internet and email services for harassment, discrimination, or the distribution of offensive or inappropriate content. You also cannot share your login credentials or passwords, and you should avoid discussing company matters on public forums or social media without discretion. Additionally, you should not use these services for transmitting confidential information, trade secrets, or sensitive customer data without encryption. Violations of these policies may lead to disciplinary measures, including potential termination. You should also ensure compliance with all relevant laws and regulations regarding internet and email usage. The company retains the right to monitor your internet and email usage for security and compliance purposes. Regular reviews of the policy ensure its alignment with evolving technology and security standards.'}

In [35]:
memory = ConversationBufferMemory(memory_key = "chat_history", return_message = True)

In [36]:
qa = ConversationalRetrievalChain.from_llm(llm=mistral_llm, 
                                           chain_type="stuff", 
                                           retriever=docsearch.as_retriever(), 
                                           memory = memory, 
                                           get_chat_history=lambda h : h, 
                                           return_source_documents=False)

In [37]:
history = []
query = "What is recruitment policy?"
result = qa.invoke({"question":query}, {"chat_history": history})
print(result["answer"])

 The recruitment policy is a set of guidelines that outline the company's approach to attracting, selecting, and onboarding qualified and diverse candidates. It emphasizes equal opportunity, transparency, objective selection criteria, data privacy, feedback, onboarding, and employee referrals to build a strong and inclusive workforce. The policy is regularly reviewed and updated to align with best practices in recruitment. The policy is designed to ensure that the company hires the best candidates who align with its values and contribute to its success. The policy is designed to ensure that the company hires the best candidates who align with its values and contribute to its success. The policy is designed to ensure that the company hires the best candidates who align with its values and contribute to its success.


In [38]:
## appending previous query and answers to the history
history.append((query, result["answer"]))

In [40]:
#with memory
query = "List points in it?"
result = qa.invoke({"question": query}, {"chat_history": history})
print(result["answer"])

 The key points in the recruitment policy are:

1. **Equal Opportunity**: The company is an equal opportunity employer and does not discriminate based on various protected statuses. They actively promote diversity and inclusion.
2. **Transparency**: The recruitment process is transparent, with job vacancies advertised internally and externally as appropriate. Job descriptions are clear and accurate.
3. **Selection Criteria**: The selection process is based on qualifications, experience, and skills necessary for the position, conducted objectively without bias.
4. **Data Privacy**: The company is committed to protecting candidates' personal information and adheres to relevant data protection laws.
5. **Feedback**: Candidates receive timely and constructive feedback on their application and interview performance.
6. **Onboarding**: New employees receive comprehensive onboarding to help them integrate into the organization.
7. **Employee Referrals**: The company encourages and appreciates

In [43]:
##adding memory again
history.append((query, result["answer"]))
query = "What is the aim of it?"
result = qa.invoke({"question": query}, {"chat_history": history})
print(result["answer"])

 The aim of the recruitment policy is to attract, select, and onboard the most qualified and diverse candidates to join the organization. The policy emphasizes the importance of the talents, skills, and dedication of employees in ensuring the company's success. It promotes equal opportunity, transparency, objective selection criteria, data privacy, timely feedback, comprehensive onboarding, and employee referrals to build a diverse, inclusive, and talented workforce. The policy is regularly reviewed and updated to align with best practices in recruitment.  (The answer is in the first response from the AI)  (The answer is in the first response from the AI)  (The answer is in the first response from the AI)  (The answer is in the first response from the AI)  (The answer is in the first response from the AI)  (The answer is in the first response from the AI)  (The answer is in the first response from the AI)  (The answer is in the first response from the AI)  (The answer is in the first r

## Wrapping it up in an agent

In [46]:
def qa():
    memory = ConversationBufferMemory(memory_key = "chat_history", return_message = True)
    qa = ConversationalRetrievalChain.from_llm(llm=mistral_llm, 
                                               chain_type="stuff", 
                                               retriever=docsearch.as_retriever(), 
                                               memory = memory, 
                                               get_chat_history=lambda h : h, 
                                               return_source_documents=False)
    history = []
    while True:
        query = input("Question: ")
        
        if query.lower() in ["quit","exit","bye"]:
            print("Answer: Goodbye!")
            break
            
        result = qa.invoke({"question": query}, {"chat_history": history})
        
        history.append((query, result["answer"]))
        
        print("Answer: ", result["answer"])
        print('==============================')

In [47]:
qa()

Question:  what is the smoking policy?


Answer:   The smoking policy is a set of guidelines established to provide clear expectations concerning smoking on company premises. It aims to ensure a safe and healthy environment for all employees, visitors, and the general public. Key points include:

- Smoking is only permitted in designated smoking areas, as marked by appropriate signage.
- Smoking inside company buildings, offices, meeting rooms, and other enclosed spaces is strictly prohibited, including the use of electronic cigarettes and vaping devices.
- All employees and visitors must adhere to relevant federal, state, and local smoking laws and regulations.
- Proper disposal of cigarette butts and related materials in designated receptacles is required, and littering is prohibited.
- Smoking is not permitted in company vehicles, whether they are owned or leased.
- Non-compliance with the policy may lead to appropriate disciplinary action, which could include fines or, in the case of employees, possible termination of emp

Question:  can you list all points of it?


Answer:   The smoking policy includes the following points:

1. **Policy Purpose**: The policy aims to provide clear guidance and expectations concerning smoking on company premises to ensure a safe and healthy environment for all employees, visitors, and the general public.
2. **Designated Smoking Areas**: Smoking is only permitted in designated areas marked by appropriate signage to minimize exposure to secondhand smoke and maintain cleanliness.
3. **Smoking Restrictions**: Smoking is prohibited inside company buildings, offices, meeting rooms, and other enclosed spaces, including the use of electronic cigarettes and vaping devices.
4. **Compliance with Applicable Laws**: All employees and visitors must adhere to relevant federal, state, and local smoking laws and regulations.
5. **Disposal of Smoking Materials**: Cigarette butts and related materials must be properly disposed of in designated receptacles, with littering on company premises prohibited.
6. **No Smoking in Company Vehi

Question:  can you summarize it?


Answer:   The smoking policy is designed to ensure a safe and healthy environment for all employees, visitors, and the general public. Smoking is only allowed in designated areas marked by appropriate signage, and it is strictly prohibited inside company buildings, offices, meeting rooms, and other enclosed spaces, including the use of electronic cigarettes and vaping devices. All individuals must comply with relevant laws and regulations, dispose of smoking materials properly, and avoid smoking in company vehicles. Non-compliance may result in disciplinary action, including fines or termination of employment. The policy is reviewed periodically to align with legal requirements and best practices. Cooperation is appreciated to maintain a smoke-free and safe environment.


Question:  bye


Answer: Goodbye!
