# CHAT WITH YOUR DATA - PDFs

1. [Document Loading](#Document-Loading)
2. [Document Splitting](#Document-Splitting)
3. [Vector Stores & Embeddings](#Vector-Stores-&-Embeddings)
4. [Retrieval](#Retrieval)
5. [Question Answering](#Question-Answering)

<img src='../data/images/rag_workflow.jpeg' alt='RAG Workflow'>

In [1]:
import os
from pathlib import Path
from dotenv import load_dotenv

env_path = Path('..') / '.env'
_ = load_dotenv(dotenv_path=env_path)

db_credentials = {
    'username': os.getenv('DB_USERNAME'),
    'password': os.getenv('DB_PASSWORD'),
    'host': os.getenv('DB_HOST'),
    'name': os.getenv('DB_NAME')
}

if not all(db_credentials.values()):
    raise ValueError('One or more environment variables are missing.')

In [2]:
import warnings

warnings.filterwarnings('ignore')

## Google Palm

In [3]:
from langchain.llms import GooglePalm

In [5]:
llm = GooglePalm(google_api_key=os.environ["GOOGLE_API_KEY"], temperature=0)

## Document Loading

In [6]:
from langchain.chains import create_sql_query_chain
from langchain.document_loaders import PyPDFLoader, WebBaseLoader
from langchain_community.utilities.sql_database import SQLDatabase
from langchain_community.tools.sql_database.tool import QuerySQLDataBaseTool

### SQL Loader

In [7]:
database_uri = f'mysql+pymysql://{db_credentials["username"]}:{db_credentials["password"]}@{db_credentials["host"]}/{db_credentials["name"]}'

db = SQLDatabase.from_uri(database_uri)

In [8]:
print(f'Dialect: {db.dialect}')
print(f'Table Names: {db.get_usable_table_names()}')

Dialect: mysql
Table Names: ['blog', 'user']


In [9]:
generate_query = create_sql_query_chain(llm, db)
execute_query = QuerySQLDataBaseTool(db=db)

In [10]:
query = generate_query.invoke({'question': 'What is the total number of blogs in the database?'})
query

'SELECT COUNT(*) FROM blog'

In [11]:
result = execute_query.invoke(query)

print(f'Type: {type(result)}')
print(f'Result: {result}')

Type: <class 'str'>
Result: [(3,)]


In [12]:
query = generate_query.invoke({'question': 'Give me all the details of the blogs.'})
query

'SELECT blog_id, user_id, title, content, created_at FROM blog'

In [13]:
result = execute_query.invoke(query)

print(f'Type: {type(result)}')
print(f'Result: {result}')

Type: <class 'str'>
Result: [(1, 1, 'First Blog Post', 'This is the content of the first blog post.', datetime.datetime(2024, 5, 24, 16, 44, 22)), (2, 1, 'Second Blog Post', 'This is the content of the second blog post.', datetime.datetime(2024, 5, 24, 16, 44, 22)), (3, 2, "Jane's Blog Post", "This is the content of Jane's blog post.", datetime.datetime(2024, 5, 24, 16, 44, 22))]


### PDF Loader

In [14]:
loader = PyPDFLoader(r'../data/pdfs/chatbot_description.pdf')
pages = loader.load()

In [15]:
total_pages = len(pages)
print(f'Total Length: {total_pages}\n')

for i in range(total_pages):
    print(f'Page no. {i+1}')
    print(f'{pages[i].metadata}')
    print(f'{pages[i].page_content}\n')

Total Length: 2

Page no. 1
{'source': '../data/pdfs/chatbot_description.pdf', 'page': 0}
 
Client Support Chatbot  
 
Objectives  
• Automate FAQs   
Provide immediate, accurate answers to frequently asked questions, reducing the need for 
human intervention. The chatbot will use natural language processing to understand and 
respond to a variety of question formats and phrasings, ensuring users receive p rompt and 
relevant information without having to navigate through extensive content manually.  
 
• Consultation Scheduling   
Enable clients to schedule consultations directly through the chatbot interface. By 
integrating with the calendar system, the chatbot can offer real -time availability and booking 
confirmations, making the scheduling process seamless and efficient for both clients and 
service providers . 
 
• Personalized Service Recommendations  
Analyze user preferences, past interactions, and inputs to suggest tailored services. 
Leveraging machine learning algorithms,

## Document Splitting

In [16]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [17]:
def get_content(docs):
    total_docs = len(docs)
    output = f'Total Length: {total_docs}\n\n'
    
    for i in range(total_docs):
        output += f'Document no. {i+1}\n'
        output += f'{docs[i].metadata}\n'
        output += f'{docs[i].page_content}\n\n'
    
    return output

In [18]:
r_splitter = RecursiveCharacterTextSplitter(
    chunk_size=450,
    chunk_overlap=50, 
    separators=['\n\n', '\n', '(?<=\. )', ' ', '']
)

In [19]:
r_splitter_docs = r_splitter.split_documents(pages)

In [20]:
content = get_content(r_splitter_docs)
print(content)

Total Length: 10

Document no. 1
{'source': '../data/pdfs/chatbot_description.pdf', 'page': 0}
Client Support Chatbot  
 
Objectives  
• Automate FAQs   
Provide immediate, accurate answers to frequently asked questions, reducing the need for 
human intervention. The chatbot will use natural language processing to understand and 
respond to a variety of question formats and phrasings, ensuring users receive p rompt and 
relevant information without having to navigate through extensive content manually.  
 
• Consultation Scheduling

Document no. 2
{'source': '../data/pdfs/chatbot_description.pdf', 'page': 0}
• Consultation Scheduling   
Enable clients to schedule consultations directly through the chatbot interface. By 
integrating with the calendar system, the chatbot can offer real -time availability and booking 
confirmations, making the scheduling process seamless and efficient for both clients and 
service providers . 
 
• Personalized Service Recommendations

Document no. 3
{'sou

## Vector Stores & Embeddings

In [21]:
from langchain.vectorstores import Chroma
from langchain.embeddings import GooglePalmEmbeddings

In [22]:
embedding = GooglePalmEmbeddings()

In [23]:
import numpy as np

sentence1 = 'Coffee fuels productivity.'
sentence2 = 'Sitting breaks boost programmer health.'

result1 = embedding.embed_query(sentence1)
result2 = embedding.embed_query(sentence2)

print(f'Total Length: {len(result1)}, {len(result2)}')
print(f'Similarity Score: {np.dot(result1, result2)}\n')
print(result1[:5])
print(result2[:5])

Total Length: 768, 768
Similarity Score: 0.6409481923464352

[-0.03758112, 0.0025011664, -0.0037450416, 0.05306987, 0.054309156]
[0.009226829, -0.0013863699, -0.039381962, 0.08616447, 0.048203044]


In [24]:
vectordb = Chroma.from_documents(
    documents=r_splitter_docs,
    embedding=embedding,
    persist_directory='../data/vector_db/chroma_pdfs/'
)

In [25]:
print(f'Collection Count: {vectordb._collection.count()}')

Collection Count: 10


## Retrieval

- Similarity Search
- Maximum Marginal Relevance
- Self Query Retriever
- Contextual Compression Retriever

In [26]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [27]:
question = 'What are the main features included in this chatbot?'

In [28]:
docs_ss = vectordb.similarity_search(question, k=4, filter=None)

result = get_content(docs_ss)
print(result)

Total Length: 4

Document no. 1
{'page': 0, 'source': '../data/pdfs/chatbot_description.pdf'}
• Content Navigation Assistance  
Assist users in finding relevant content on the website quickly and efficiently. The chatbot 
will offer search functionality and guide users through the website's structure, helping them 
locate specific information without frustration, thereby improving us er satisfaction and 
engagement.  
 
Additional Features  
• Personality and Tone

Document no. 2
{'page': 0, 'source': '../data/pdfs/chatbot_description.pdf'}
Client Support Chatbot  
 
Objectives  
• Automate FAQs   
Provide immediate, accurate answers to frequently asked questions, reducing the need for 
human intervention. The chatbot will use natural language processing to understand and 
respond to a variety of question formats and phrasings, ensuring users receive p rompt and 
relevant information without having to navigate through extensive content manually.  
 
• Consultation Scheduling

Document 

In [29]:
docs_mmr = vectordb.max_marginal_relevance_search(question, k=4)

result = get_content(docs_mmr)
print(result)

Number of requested results 20 is greater than number of elements in index 10, updating n_results = 10


Total Length: 4

Document no. 1
{'page': 0, 'source': '../data/pdfs/chatbot_description.pdf'}
• Content Navigation Assistance  
Assist users in finding relevant content on the website quickly and efficiently. The chatbot 
will offer search functionality and guide users through the website's structure, helping them 
locate specific information without frustration, thereby improving us er satisfaction and 
engagement.  
 
Additional Features  
• Personality and Tone

Document no. 2
{'page': 0, 'source': '../data/pdfs/chatbot_description.pdf'}
Client Support Chatbot  
 
Objectives  
• Automate FAQs   
Provide immediate, accurate answers to frequently asked questions, reducing the need for 
human intervention. The chatbot will use natural language processing to understand and 
respond to a variety of question formats and phrasings, ensuring users receive p rompt and 
relevant information without having to navigate through extensive content manually.  
 
• Consultation Scheduling

Document 

In [30]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=LLMChainExtractor.from_llm(llm),
    base_retriever=vectordb.as_retriever(search_type='mmr')
)

compressed_docs = compression_retriever.invoke(question)

result = get_content(compressed_docs)
print(result)

Number of requested results 20 is greater than number of elements in index 10, updating n_results = 10


Total Length: 4

Document no. 1
{'page': 0, 'source': '../data/pdfs/chatbot_description.pdf'}
• Content Navigation Assistance  
Assist users in finding relevant content on the website quickly and efficiently. The chatbot 
will offer search functionality and guide users through the website's structure, helping them 
locate specific information without frustration, thereby improving us er satisfaction and 
engagement.  
 
Additional Features  
• Personality and Tone

Document no. 2
{'page': 0, 'source': '../data/pdfs/chatbot_description.pdf'}
• Automate FAQs   
Provide immediate, accurate answers to frequently asked questions, reducing the need for 
human intervention. The chatbot will use natural language processing to understand and 
respond to a variety of question formats and phrasings, ensuring users receive p rompt and 
relevant information without having to navigate through extensive content manually.  
 
• Consultation Scheduling

Document no. 3
{'page': 1, 'source': '../data/pdf

## Question Answering

In [31]:
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.chains import RetrievalQA, ConversationalRetrievalChain

### RetrievalQA Chain

In [32]:
def process_qa_retrieval_chain(chain, query):
    response = chain.invoke({'query': query})
    
    result_str = f'Query: {response["query"]}\n\n'
    result_str += f'Result: {response["result"]}\n\n'
    
    relevant_docs = response['source_documents']
    for i in range(len(relevant_docs)):
        result_str += f'Relevant Doc {i+1}:\n'
        result_str += relevant_docs[i].page_content + '\n'
        result_str += str(relevant_docs[i].metadata) + '\n\n'
    
    return result_str

In [33]:
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, 
just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. 
Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [34]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={'prompt': QA_CHAIN_PROMPT}
)

In [35]:
result = process_qa_retrieval_chain(qa_chain, question)
print(result)

Query: What are the main features included in this chatbot?

Result: The main features of this chatbot are: 
- Content Navigation Assistance
- Automate FAQs
- Consultation Scheduling
- Real-Time Updates and News
- Multi-Language Support
- Voice Integration
Thanks for asking!

Relevant Doc 1:
• Content Navigation Assistance  
Assist users in finding relevant content on the website quickly and efficiently. The chatbot 
will offer search functionality and guide users through the website's structure, helping them 
locate specific information without frustration, thereby improving us er satisfaction and 
engagement.  
 
Additional Features  
• Personality and Tone
{'page': 0, 'source': '../data/pdfs/chatbot_description.pdf'}

Relevant Doc 2:
Client Support Chatbot  
 
Objectives  
• Automate FAQs   
Provide immediate, accurate answers to frequently asked questions, reducing the need for 
human intervention. The chatbot will use natural language processing to understand and 
respond to a var

In [41]:
result = process_qa_retrieval_chain(qa_chain, 'Tell me about Mars')
print(result)

Query: Tell me about Mars

Result: Mars is the fourth planet from the Sun, and the second smallest planet in the Solar System after Mercury. Thanks for asking!

Relevant Doc 1:
• Real -Time Updates and News  
Provide users with the latest updates, news, and announcements directly through the 
chatbot. By integrating with news feeds and internal update systems, the chatbot ensures 
timely information delivery, keeping users informed and up -to-date with the most re cent 
developments.  
 
• Multi -Language Support  
Offer support for multiple languages to cater to a diverse audience. Utilizing language
{'page': 1, 'source': '../data/pdfs/chatbot_description.pdf'}

Relevant Doc 2:
• Content Navigation Assistance  
Assist users in finding relevant content on the website quickly and efficiently. The chatbot 
will offer search functionality and guide users through the website's structure, helping them 
locate specific information without frustration, thereby improving us er satisfaction and

In [42]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type='map_reduce',
    chain_type_kwargs={'question_prompt': QA_CHAIN_PROMPT}
)

In [43]:
result = process_qa_retrieval_chain(qa_chain_mr, question)
print(result)

Query: What are the main features included in this chatbot?

Result: The main features included in this chatbot are content navigation assistance, personality and tone, automate FAQs, consultation scheduling, real-time updates and news, multi-language support, language detection and translation, and voice integration.

Relevant Doc 1:
• Content Navigation Assistance  
Assist users in finding relevant content on the website quickly and efficiently. The chatbot 
will offer search functionality and guide users through the website's structure, helping them 
locate specific information without frustration, thereby improving us er satisfaction and 
engagement.  
 
Additional Features  
• Personality and Tone
{'page': 0, 'source': '../data/pdfs/chatbot_description.pdf'}

Relevant Doc 2:
Client Support Chatbot  
 
Objectives  
• Automate FAQs   
Provide immediate, accurate answers to frequently asked questions, reducing the need for 
human intervention. The chatbot will use natural language pr

### Conversational Retrieval Chain

In [44]:
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

In [45]:
qa_chain_cr = ConversationalRetrievalChain.from_llm(
    llm,
    memory=memory,
    retriever=vectordb.as_retriever(),
)

In [46]:
def process_conversational_retrieval_chain(chain, question):
    result = chain.invoke({'question': question})
    output_str = f'Query: {result["question"]}\n'
    
    total_length = len(result['chat_history'])
    output_str += f'Total Memory: {total_length}\n'
    
    for i in range(total_length):
        output_str += f'#{i+1}.\n'
        output_str += result['chat_history'][i].content + '\n\n'
    
    return output_str

In [47]:
result = process_conversational_retrieval_chain(qa_chain_cr, question)
print(result)

Query: What are the main features included in this chatbot?
Total Memory: 2
#1.
What are the main features included in this chatbot?

#2.
The main features included in this chatbot are:

- Content Navigation Assistance
- Automate FAQs
- Consultation Scheduling
- Real-Time Updates and News
- Multi-Language Support
- Voice Integration




In [48]:
result = process_conversational_retrieval_chain(qa_chain_cr, 'Tell me more about Automate FAQs')
print(result)

Query: Tell me more about Automate FAQs
Total Memory: 4
#1.
What are the main features included in this chatbot?

#2.
The main features included in this chatbot are:

- Content Navigation Assistance
- Automate FAQs
- Consultation Scheduling
- Real-Time Updates and News
- Multi-Language Support
- Voice Integration

#3.
Tell me more about Automate FAQs

#4.
Provide immediate, accurate answers to frequently asked questions, reducing the need for human intervention.


