## Project: Question-Answering on Private documents

In [1]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)


True

## Load PDF File 

In [2]:
pip install pypdf -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
def load_document(file):
    from langchain.document_loaders import PyPDFLoader
    print(f'Loading {file}')
    loader = PyPDFLoader(file)
    data = loader.load()
    return data

### Start Running the Code

In [4]:
data = load_document('ngConstitution.pdf')
print(data[1].page_content)

Loading ngConstitution.pdf
(5) The provisions of this Constitution in Part I of Chapter VIII hereof shall in relation to the Federal 
Capital Territory, Abuja, have effect in the manner set out thereunder. 
(6) There shall be 768 Local Government Areas in Nigeri a as shown in the second column of Part I of the 
First Schedule to this Constitution and six area councils as shown in Part II of that Schedule. 
  
  
  
  
Part II  
  
Powers of the Federal Republic of Nigeria 
  
  
4. (1) The legislative powers of the Federal Republic of Nigeria shall be vested in a National Assembly for 
the Federation, which shall consist of a Senate and a House of Representatives. 
 
(2) The National Assembly shall have power to make laws for the peace, order and good government of 
the Federation or any part thereof with respect to an y matter included in the Exclusive Legislative List set 
out in Part I of the Second Schedule to this Constitution.  
(3) The power of the National Assembly to make laws

In [5]:
data = load_document('ngConstitution.pdf')
print(data[10].metadata)

Loading ngConstitution.pdf
{'source': 'ngConstitution.pdf', 'page': 10}


In [7]:
# Checking how many pages in my data
data = load_document('ngConstitution.pdf')
print(f'You have {len(data)} pages in your data')

# Checking the number of character in a data
print(f'There are {len(data[20].page_content)} characters in the page')

Loading ngConstitution.pdf
You have 118 pages in your data
There are 3314 characters in the page


### Document Format
Uploading different document format of file

In [8]:
pip install docx2txt -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [9]:
def load_document(file):
    import os
    name, extension = os.path.splitext(file)

    if extension == '.pdf':
        from langchain.document_loaders import PyPDFLoader
        print(f'Loading {file}')
        loader = PyPDFLoader(file)
    elif extension == '.docx':
        from langchain.document_loaders import Docx2txtLoader
        print(f'Loading {file}')
        loader = Docx2txtLoader(file)
    # Add all other format you wanted
    else:
        print('Document format is not supported')
        return None
    data = loader.load()
    return data

### Running Data

In [11]:
data = load_document('podcast.docx')
print(data[0].page_content)

Loading podcast.docx
INDIVIDUAL PODCAST





Hello,

Welcome to my podcast, I will be talking about HIV/AIDS which is my health campaign topic. The purpose is to help control HIV/AIDS in Nigeria and help everyone know their status.

Symptoms of HIV/AIDS include night sweats, fatigue, weight loss, vomiting, skin rashes, etc. The transmission of this disease is through having unsafe sex and sharing equipment such as needles, blades, etc. Antiretroviral drugs help HIV/AIDS reduce the risks of transmission.





I would like to include a story about an HIV patient, about a young lady called Joy who was diagnosed with HIV.This led to her family members mocking and leaving her and thus made the young lady destabilized. After some few months, she went for counseling and was given HIV drugs which are the antiretroviral drugs i mentioned previously. These brought about lots of changes in her and she opened her own organization to help people living with HIV/AIDS to gain their confidence and mov

## Public and Private Service Loaders

Load from Wikipedia

In [12]:
pip install wikipedia -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [13]:
# Wikipedia
def load_from_wikipedia(query, lang='en', load_max_docs=2):
    from langchain.document_loaders import WikipediaLoader
    loader = WikipediaLoader(query=query, lang=lang, load_max_docs=load_max_docs)
    data = loader.load()
    return data  

In [15]:
data = load_from_wikipedia('Olusegun Obasanjo')
print(data[0].page_content)

Chief Olusegun Matthew Okikiola Ogunboye Aremu Obasanjo  (  oh-BAH-sən-joh; Yoruba: Olúṣẹ́gun Ọbásanjọ́ [olúʃɛ́ɡũ ɔbásanɟɔ] ; born 5 March 1937) is a Nigerian retired military General and statesman who served as Nigeria's head of state from 1976 to 1979 and later as its president from 1999 to 2007. Ideologically a Nigerian nationalist, he was a member of the Peoples Democratic Party (PDP) from 1998 to 2015, and since 2018.
Born in the village of Ibogun-Olaogun to a farming family of the Owu branch of the Yoruba, Obasanjo was educated largely in Abeokuta, Ogun State. He joined the Nigerian Army and specialised in engineering and was assigned to the Congo, Britain, and India, rising to the rank of major. In the late 1960s, he played a senior role in combating Biafran separatists during the Nigerian Civil War, accepting their surrender in 1970. In 1975, a military coup established a junta with Obasanjo as part of its ruling triumvirate. After the triumvirate's leader, Murtala Muhammed, wa

### Chunking Strategies and Splithing the Document

In [23]:
def chunk_data(data, chunk_size=256):
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap=0)
    chunks = text_splitter.split_documents(data)
    return chunks

In [24]:
# Checking how many pages in my data
data = load_document('ngConstitution.pdf')
print(f'You have {len(data)} pages in your data')

# Checking the number of character in a data
print(f'There are {len(data[20].page_content)} characters in the page')

Loading ngConstitution.pdf
You have 118 pages in your data
There are 3314 characters in the page


In [29]:
chunks = chunk_data(data)
print(len(chunks))

1892


In [30]:
# Print one of the chunks
print(chunks[11].page_content)

(5) The provisions of this Constitution in Part I of Chapter VIII hereof shall in relation to the Federal 
Capital Territory, Abuja, have effect in the manner set out thereunder.


In [40]:
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-ada-002')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embodding Cost in USD: {total_tokens / 1000 * 0.0004:.6f}')

In [34]:
print_embedding_cost(chunks)

Total Tokens: 82159
Embodding Cost in USD: 0.032864


### Embedding and Uploading to a Vector Database (Pinecone)

In [43]:
# Embed Pinecone Api key from env file
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)

True

In [60]:
pip install -q pinecone-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [61]:
pip install --upgrade -q pinecone-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip3 install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [77]:
def insert_or_fetch_embeddings(index_name, chunks):
    import pinecone
    from langchain_community.vectorstores import Pinecone
    from langchain_openai import OpenAIEmbeddings
    from pinecone import PodSpec

    pc = pinecone.Pinecone()

    embeddings = OpenAIEmbeddings(model='text-embedding-3-small', dimensions=1536)

    if index_name in pc.list_indexes().names():
        print(f'Index {index_name} already exists. Loading embeddings ...', end='')
        vector_store = Pinecone.from_existing_index(index_name, embeddings)
        print('ok')
    else:
        print(f'Creating index {index_name} and embeddings ...', end='')
        pc.create_index(
            name=index_name,
            dimension=1536,
            metric='cosine',
            spec=PodSpec(environment='gcp-starter')
        )
        vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)
        print('Ok')
        return vector_store

In [54]:
def delete_pinecone_index(index_name='all'):
    import pinecone
    pc = pinecone.Pinecone()
    if index_name == 'all':
        indexes = pc.list_indexes().names()
        print('Deleting all indexes ... ')
        for index in indexes:
            pc.delete_index(index)
        print('ok')
    else:
        print(f'Deleting index {index_name} ...', end='')
        pc.delete_index(index_name)
        print('ok')

In [47]:
# Delete All Indexes in PineCone 'app.pinecone.io'
delete_pinecone_index()


Deleting all indexes ... 
ok


In [78]:
chunks = chunk_data(data)
print(len(chunks))

1892


In [79]:
index_name  =  'askadocument'
vector_store = insert_or_fetch_embeddings(index_name, chunks)

Creating index askadocument and embeddings ...Ok
