## Project: Question-Answering on Private documents

In [89]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)


True

## Load PDF File 

In [90]:
pip install pypdf -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [91]:
def load_document(file):
    from langchain.document_loaders import PyPDFLoader
    print(f'Loading {file}')
    loader = PyPDFLoader(file)
    data = loader.load()
    return data

### Start Running the Code

In [92]:
data = load_document('ngConstitution.pdf')
print(data[1].page_content)

Loading ngConstitution.pdf


(5) The provisions of this Constitution in Part I of Chapter VIII hereof shall in relation to the Federal 
Capital Territory, Abuja, have effect in the manner set out thereunder. 
(6) There shall be 768 Local Government Areas in Nigeri a as shown in the second column of Part I of the 
First Schedule to this Constitution and six area councils as shown in Part II of that Schedule. 
  
  
  
  
Part II  
  
Powers of the Federal Republic of Nigeria 
  
  
4. (1) The legislative powers of the Federal Republic of Nigeria shall be vested in a National Assembly for 
the Federation, which shall consist of a Senate and a House of Representatives. 
 
(2) The National Assembly shall have power to make laws for the peace, order and good government of 
the Federation or any part thereof with respect to an y matter included in the Exclusive Legislative List set 
out in Part I of the Second Schedule to this Constitution.  
(3) The power of the National Assembly to make laws for the peace, order and g

In [93]:
data = load_document('ngConstitution.pdf')
print(data[10].metadata)

Loading ngConstitution.pdf
{'source': 'ngConstitution.pdf', 'page': 10}


In [94]:
# Checking how many pages in my data
data = load_document('ngConstitution.pdf')
print(f'You have {len(data)} pages in your data')

# Checking the number of character in a data
print(f'There are {len(data[20].page_content)} characters in the page')

Loading ngConstitution.pdf
You have 118 pages in your data
There are 3314 characters in the page


### Document Format
Uploading different document format of file

In [95]:
pip install docx2txt -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [96]:
def load_document(file):
    import os
    name, extension = os.path.splitext(file)

    if extension == '.pdf':
        from langchain.document_loaders import PyPDFLoader
        print(f'Loading {file}')
        loader = PyPDFLoader(file)
    elif extension == '.docx':
        from langchain.document_loaders import Docx2txtLoader
        print(f'Loading {file}')
        loader = Docx2txtLoader(file)
    # Add all other format you wanted
    else:
        print('Document format is not supported')
        return None
    data = loader.load()
    return data

### Running Data

In [97]:
data = load_document('podcast.docx')
print(data[0].page_content)

Loading podcast.docx
INDIVIDUAL PODCAST





Hello,

Welcome to my podcast, I will be talking about HIV/AIDS which is my health campaign topic. The purpose is to help control HIV/AIDS in Nigeria and help everyone know their status.

Symptoms of HIV/AIDS include night sweats, fatigue, weight loss, vomiting, skin rashes, etc. The transmission of this disease is through having unsafe sex and sharing equipment such as needles, blades, etc. Antiretroviral drugs help HIV/AIDS reduce the risks of transmission.





I would like to include a story about an HIV patient, about a young lady called Joy who was diagnosed with HIV.This led to her family members mocking and leaving her and thus made the young lady destabilized. After some few months, she went for counseling and was given HIV drugs which are the antiretroviral drugs i mentioned previously. These brought about lots of changes in her and she opened her own organization to help people living with HIV/AIDS to gain their confidence and mov

## Public and Private Service Loaders

Load from Wikipedia

In [98]:
pip install wikipedia -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [99]:
# Wikipedia
def load_from_wikipedia(query, lang='en', load_max_docs=2):
    from langchain.document_loaders import WikipediaLoader
    loader = WikipediaLoader(query=query, lang=lang, load_max_docs=load_max_docs)
    data = loader.load()
    return data  

In [100]:
data = load_from_wikipedia('Olusegun Obasanjo')
print(data[0].page_content)

Chief Olusegun Matthew Okikiola Ogunboye Aremu Obasanjo  (  oh-BAH-sən-joh; Yoruba: Olúṣẹ́gun Ọbásanjọ́ [olúʃɛ́ɡũ ɔbásanɟɔ] ; born c. 5 March 1937) is a Nigerian retired military officer and statesman who served as Nigeria's head of state from 1976 to 1979 and later as its president from 1999 to 2007. Ideologically a Nigerian nationalist, he was a member of the Peoples Democratic Party (PDP) from 1998 to 2015, and since 2018.
Born in the village of Ibogun-Olaogun to a farming family of the Owu branch of the Yoruba, Obasanjo was educated largely in Abeokuta, Ogun State. He joined the Nigerian Army and specialised in engineering and was assigned to the Congo, Britain, and India, rising to the rank of major. In the late 1960s, he played a senior role in combating Biafran separatists during the Nigerian Civil War, accepting their surrender in 1970. In 1975, a military coup established a junta with Obasanjo as part of its ruling triumvirate. After the triumvirate's leader, Murtala Muhammed,

### Chunking Strategies and Splithing the Document

In [101]:
def chunk_data(data, chunk_size=256):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap=0)
    chunks = text_splitter.split_documents(data)
    return chunks

In [102]:
# Checking how many pages in my data
data = load_document('ngConstitution.pdf')
print(f'You have {len(data)} pages in your data')

# Checking the number of character in a data
print(f'There are {len(data[20].page_content)} characters in the page')

Loading ngConstitution.pdf
You have 118 pages in your data
There are 3314 characters in the page


In [103]:
chunks = chunk_data(data)
print(len(chunks))

1892


In [104]:
# Print one of the chunks
print(chunks[11].page_content)

(5) The provisions of this Constitution in Part I of Chapter VIII hereof shall in relation to the Federal 
Capital Territory, Abuja, have effect in the manner set out thereunder.


In [105]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embodding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}')

In [106]:
print_embedding_cost(chunks)

Total Tokens: 82159
Embodding Cost in USD: 0.032864


### Embedding and Uploading to a Vector Database (Pinecone)

In [107]:
# Embed Pinecone Api key from env file
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [108]:
pip install -q pinecone-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [109]:
pip install --upgrade -q pinecone-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [110]:
def insert_or_fetch_embeddings(index_name, chunks):
    import pinecone
    from langchain_community.vectorstores import Pinecone
    from langchain_openai import OpenAIEmbeddings
    from pinecone import PodSpec

    pc = pinecone.Pinecone()

    embeddings = OpenAIEmbeddings(model='text-embedding-3-small', dimensions=1536)

    if index_name in pc.list_indexes().names():
        print(f'Index {index_name} already exists. Loading embeddings ...', end='')
        vector_store = Pinecone.from_existing_index(index_name, embeddings)
        print('ok')
    else:
        print(f'Creating index {index_name} and embeddings ...', end='')
        pc.create_index(
            name=index_name,
            dimension=1536,
            metric='cosine',
            spec=PodSpec(environment='gcp-starter')
        )
        vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)
        print('Ok')
        return vector_store

In [111]:
def delete_pinecone_index(index_name='all'):
    import pinecone
    pc = pinecone.Pinecone()
    if index_name == 'all':
        indexes = pc.list_indexes().names()
        print('Deleting all indexes ... ')
        for index in indexes:
            pc.delete_index(index)
        print('ok')
    else:
        print(f'Deleting index {index_name} ...', end='')
        pc.delete_index(index_name)
        print('ok')

In [112]:
# Delete All Indexes in PineCone 'app.pinecone.io'
delete_pinecone_index()


Deleting all indexes ... 
ok


In [113]:
chunks = chunk_data(data)
print(len(chunks))

1892


In [114]:
index_name  =  'askadocument'
vector_store = insert_or_fetch_embeddings(index_name, chunks)

Creating index askadocument and embeddings ...Ok


### Asking and Getting Answers

In [115]:
def ask_and_get_answer(vector_store, q):
    from langchain.chains import RetrievalQA
    from langchain.chat_models import ChatOpenAI

    llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=1)
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': 3})
    chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)

    answer = chain.run(q)
    return answer

In [116]:
# Run the function

q = 'What is the whole document about?'
answer = ask_and_get_answer(vector_store, q)
print(answer)

The document being referred to is a Constitution, specifically outlining the Fundamental Objectives and Directive Principles of State Policy. It emphasizes principles of freedom, equality, and justice, aiming to consolidate the unity of the people in the country. It contains provisions regarding governance, duties of government organs, and authority, as well as guidelines for matters such as stamp duties and welfare of all individuals in the country.


### Looping through the application 

In [117]:
import time
i = 1
print('Write Quit or Exit to quit.')
while True:
    q = input(f'Question #{i}: ')
    i = i + 1
    if q.lower() in ['quit', 'exit']:
        print('Quitting ... bye bye!')
        time.sleep(2)
        break

    answer = ask_and_get_answer(vector_store, q)
    print(f'\nAnswer: {answer}')
    print(f'\n {"-" * 50} \n')

Write Quit or Exit to quit.
Quitting ... bye bye!


In [118]:
delete_pinecone_index()

Deleting all indexes ... 
ok


In [121]:
# Loading from Wikipedia
data = load_from_wikipedia('ChatGPT', 'ro')
chunks = chunk_data(data)
index_name='chatgpt'
vector_store = insert_or_fetch_embeddings(index_name, chunks)

Creating index chatgpt and embeddings ...Ok


In [122]:
q = "Ce este ChatGPT?"
answer = ask_and_get_answer(vector_store, q)
print(answer)

ChatGPT este un membru al familiei de modele de limbaj generative pre-antrenate dezvoltat de OpenAI. A fost construit pe versiunea îmbunătățită a lui GPT-3, cunoscută sub numele de „GPT-3.5”, iar o versiune mai nouă, bazată pe GPT-4, a fost lansată pentru abonații plătitori pe 14 martie 2023. Murati consideră că ChatGPT poate fi o oportunitate pentru educație, dar regulile și limitele trebuie stabilite în utilizarea lui zilnică.


## RAG: Retrieval Augmented Generation

### Using Chroma as a Vector DB

In [123]:
pip install -q chromadb


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [133]:
def create_embeddings_chroma(chunks, persist_directory='./chroma_db'):
    from langchain.vectorstores import Chroma
    from langchain_openai import OpenAIEmbeddings

    embeddings = OpenAIEmbeddings(model='text-embedding-3-small', dimensions=1536)
    vector_store = Chroma.from_documents(chunks, embeddings, persist_directory=persist_directory)
    return vector_store

def load_embeddings_chroma(persist_directory='./chroma_db'):
    from langchain.vectorstores import Chroma
    from langchain_openai import OpenAIEmbeddings

    embeddings = OpenAIEmbeddings(model='text-embedding-3-small', dimensions=1536)
    vector_store = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
    return vector_store

In [127]:
# Load the PDF File
data = load_document('./ngConstitution.pdf')
chunks = chunk_data(data, chunk_size=256)
vector_store = create_embeddings_chroma(chunks)

Loading ./ngConstitution.pdf


In [128]:
q = 'What is Constitution?'
answer = ask_and_get_answer(vector_store, q)
print(answer)

A constitution is a set of fundamental principles or established precedents that a state or organization is governed by. It typically outlines the rights and duties of citizens and the structure of the government.


In [134]:
db = load_embeddings_chroma()
q = 'What is the costitution all about?'
answer = ask_and_get_answer(vector_store, q)
print(answer)

The constitution is about establishing and regulating authorities for the country or any part thereof in order to promote and enforce the observance of fundamental objectives and directive principles contained in the constitution. It also aims to consolidate the unity of the people and promote good government based on principles of freedom, equality, and justice.


### Adding Memory to the RAG System (CHat History)

In [144]:
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)
retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k' : 5})
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

crc = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    chain_type='stuff',
    verbose=True
)

In [145]:
def ask_question(q, chain):
    result = chain.invoke({'question': q})
    return result

In [146]:
data = load_document('./ngConstitution.pdf')
chunks = chunk_data(data, chunk_size=256)
vector_store = create_embeddings_chroma(chunks)

Loading ./ngConstitution.pdf


In [147]:
q = 'What is constitution?'
result = ask_question(q, crc)
print(result)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
Constitution, ensure their existence under a Law wh ich provides for the establishment, structure, 
composition, finance and functions of such councils.  
(2) The person authorised by law to prescribe th e area over which a local government council may

Constitution, ensure their existence under a Law wh ich provides for the establishment, structure, 
composition, finance and functions of such councils.  
(2) The person authorised by law to prescribe th e area over which a local government council may

Constitution, ensure their existence under a Law wh ich provides for the establishment, structure, 
composition, finance and functions of such councils.  
(2) The person au

In [148]:
print(result['answer'])

A constitution is a set of fundamental principles or established precedents according to which a state or other organization is governed. It typically outlines the structure, powers, and duties of the government, as well as the rights of the citizens.


In [149]:
q = 'how does the constitution related to US constitution'
result = ask_question(q, crc)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mGiven the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:

Human: What is constitution?
Assistant: A constitution is a set of fundamental principles or established precedents according to which a state or other organization is governed. It typically outlines the structure, powers, and duties of the government, as well as the rights of the citizens.
Follow Up Input: how does the constitution related to US constitution
Standalone question:[0m

[1m> Finished chain.[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
Constitution; and (c) r

In [150]:
print(result['answer'])

The US Constitution is a specific constitution that governs the United States of America. It is a written document that outlines the structure of the government, the rights of the citizens, and the powers of the different branches of government. In general, a constitution is a set of fundamental principles or established precedents according to which a state or other organization is governed. The US Constitution is an example of a constitution that serves as the supreme law of the land in the United States.


In [151]:
# Print the history
for item in result['chat_history']:
    print(item)

content='What is constitution?'
content='A constitution is a set of fundamental principles or established precedents according to which a state or other organization is governed. It typically outlines the structure, powers, and duties of the government, as well as the rights of the citizens.'
content='how does the constitution related to US constitution'
content='The US Constitution is a specific constitution that governs the United States of America. It is a written document that outlines the structure of the government, the rights of the citizens, and the powers of the different branches of government. In general, a constitution is a set of fundamental principles or established precedents according to which a state or other organization is governed. The US Constitution is an example of a constitution that serves as the supreme law of the land in the United States.'


## Using a Custom Prompt

In [159]:
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory

from langchain.prompts import ChatPromptTemplate, SystemMessagePromptTemplate, HumanMessagePromptTemplate

llm = ChatOpenAI(model_name='gpt-3.5-turbo', temperature=0)
retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k' : 5})
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

# You can use any language
system_template = r'''
Use the following pieces of context to answer the user's question.
If you don't find the answer in the provided context, just respond "I don't know."
--------------------
Context: ```{context}```
'''

user_template = '''
Question: ```{question}```
Chat History: ```{chat_history}```
'''

messages= [
    SystemMessagePromptTemplate.from_template(system_template),
    HumanMessagePromptTemplate.from_template(user_template)
]

qa_prompt = ChatPromptTemplate.from_messages(messages)

crc = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=retriever,
    memory=memory,
    chain_type='stuff',
    combine_docs_chain_kwargs={'prompt': qa_prompt},
    verbose=True
)

In [156]:
print(qa_prompt)

input_variables=['chat_history', 'context', 'question'] messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], template="\nUse the following pieces of context to answer the user's question.\n--------------------\nContext: ```{context}```\n")), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['chat_history', 'question'], template='\nQuestion: ```{question}```\nChat History: ```{chat_history}```\n'))]


In [157]:
db = load_embeddings_chroma()
q = 'how does the constitution related to US constitution'
result = ask_question(q, crc)
print(result)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: 
Use the following pieces of context to answer the user's question.
--------------------
Context: ```Constitution; and (c) references to persons, offices and authorities of a St ate were references to the persons, offices and authorities

Constitution; and (c) references to persons, offices and authorities of a St ate were references to the persons, offices and authorities

Constitution; and (c) references to persons, offices and authorities of a St ate were references to the persons, offices and authorities

(3) If any other law is inconsistent with the provisions of this Constitution, this Constitution shall prevail, 
and that other law shall, to the extent of the inconsistency, be void.

(3) If any other law is inconsistent with the provisions of this Constitution, this Constitution shall prevail, 
and that other law shall, to the extent of t

In [160]:
q = 'When was Elon Musk born?'
result = ask_question(q, crc)
print(result)



[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: 
Use the following pieces of context to answer the user's question.
If you don't find the answer in the provided context, just respond "I don't know."
--------------------
Context: ```citizen of Nigeria if at the time of the birth of that person such parent or grandparent would have possessed that status by birth if he had been alive on the date of independence; and in this section, "the

citizen of Nigeria if at the time of the birth of that person such parent or grandparent would have possessed that status by birth if he had been alive on the date of independence; and in this section, "the

citizen of Nigeria if at the time of the birth of that person such parent or grandparent would have possessed that status by birth if he had been alive on the date of independence; and in this section, "the

Provided that a person shall not become a citizen

In [161]:
print(result['answer'])

I don't know.
