# RAG for private documents

How can LLMs learn new knowledge? 

    - Fine tuning on a training set
    - MOdel inputs

The recommended approach is to use model with embedded-based search.

1. Prepare the search data
    - load data into langchain documents
    - split the document into chunks
    - embed the chunks into numeric vectors
    - save the chunks and embeddings to a vector database
      
2. Search, once per query
    - Embed a user's question
    - Using the question embedding and the chunk embeddings, rank the vectors by similarity to the question embedding, whee the neares vector represents chunks most relevant.

3. Ask
    - Insert the question and the most relevant chunks into a message to a GPT model
    - Return GPT answer.



# Load the documents

In [None]:
# !pip install -q pypdf  #already installed so commenting out 

In [None]:
# !pip install -q docx2txt   #already installed so commenting out 

In [27]:
# !pip install wikipedia -q   # alrady loaded so commenting out


[notice] A new release of pip is available: 24.2 -> 24.3.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [38]:
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv(), override=True)


###############  loading PDF, DOCX and TXT files as LangChain Documents
def load_document(file):
    import os
    name, extension = os.path.splitext(file)

    if extension == '.pdf':
        from langchain.document_loaders import PyPDFLoader
        print(f'Loading {file}')
        loader = PyPDFLoader(file) # cam also be used as pointer to online pdf in arg
    elif extension == '.docx':
        from langchain.document_loaders import Docx2txtLoader
        print(f'Loading {file}')
        loader = Docx2txtLoader(file)
    # elif extension == '.txt':
    #     from langchain.document_loaders import TextLoader
    #     loader = TextLoader(file)
    else:
        print('Document format is not supported!')
        return None

    data = loader.load()
    return data


##############  Loading the data from wikipedia
def load_from_wikipedia(query, lang="en",load_max_docs=2):
    from langchain.document_loaders import WikipediaLoader
    loader = WikipediaLoader(query=query, lang=lang, load_max_docs=load_max_docs)
    data = loader.load()
    return data

############# Data chunking before embedding int vector database
def chunk_data(data, chunk_size=256):
    from langchain.text_splitter import RecursiveCharacterTextSplitter # recommended for langchain : tries to split \\n \n and whitespace
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=0)
    chunks = text_splitter.split_documents(data)
    return chunks

############ Calculate the cost of embedding
def print_embedding_cost(texts):
    import tiktoken
    enc = tiktoken.encoding_for_model('text-embedding-3-small')
    total_tokens = sum([len(enc.encode(page.page_content)) for page in texts])
    print(f'Total Tokens: {total_tokens}')
    print(f'Embedding Cost in USD: {total_tokens / 1000 * 0.00002:.6f}')
    
    

# Embedding and Uploading a Vector Database to Pinecone

In [40]:
def insert_or_fetch_embeddings(index_name, chunks):
    # importing the necessary libraries and initializing the Pinecone client
    import pinecone
    from langchain_community.vectorstores import Pinecone
    from langchain_openai import OpenAIEmbeddings
    from pinecone import PodSpec

    
    pc = pinecone.Pinecone()
        
    embeddings = OpenAIEmbeddings(model='text-embedding-3-small', dimensions=1536)  # 512 works as well

    # loading from existing index
    if index_name in pc.list_indexes():
        print(f'Index {index_name} already exists. Loading embeddings ... ', end='')
        vector_store = Pinecone.from_existing_index(index_name, embeddings)
        print('Ok')
    else:
        # creating the index and embedding the chunks into the index 
        print(f'Creating index {index_name} and embeddings ...', end='')

        # creating a new index
        pc.create_index(
            name=index_name,
            dimension=1536,
            metric='cosine',
            spec=PodSpec(
                environment='gcp-starter'
            )
        )

        # processing the input documents, generating embeddings using the provided `OpenAIEmbeddings` instance,
        # inserting the embeddings into the index and returning a new Pinecone vector store object. 
        vector_store = Pinecone.from_documents(chunks, embeddings, index_name=index_name)
        print('Ok')
        
    return vector_store

# Delete a pinecone index

In [44]:
def delete_pinecone_index(index_name='all'):
    import pinecone
    pc = pinecone.Pinecone()
    
    if index_name == 'all':
        indexes = pc.list_indexes().names()
        print('Deleting all indexes ... ')
        for index in indexes:
            pc.delete_index(index)
        print('Ok')
    else:
        print(f'Deleting index {index_name} ...', end='')
        pc.delete_index(index_name)
        print('Ok')

# Running the code

In [34]:
# Loading the pdf and docx document into LangChain 

data = load_document('files/us_constitution.pdf')   # for PDF
# data = load_document('files/the_great_gatsby.docx')   # for docx
print(data[0].page_content)
print(data[10].metadata)
print(f'You have {len(data)} pages in your document')
print(f'There are  {len(data[20].page_content)} characters in page')


Loading files/us_constitution.pdf
The
United
States
Constitution
W e
the
People
of
the
United
States,
in
Order
to
form
a
more
perfect
Union,
establish
Justice,
insure
domestic
T ranquility ,
provide
for
the
common
defence,
promote
the
general
W elfare,
and
secure
the
Blessings
of
Liberty
to
ourselves
and
our
Posterity ,
do
ordain
and
establish
this
Constitution
for
the
United
States
of
America.
The
Constitutional
Con v ention
Article
I
Section
1:
Congress
All
legislative
Powers
herein
granted
shall
be
vested
in
a
Congress
of
the
United
States,
which
shall
consist
of
a
Senate
and
House
of
Representatives.
Section
2:
The
House
of
Representatives
{'source': 'files/us_constitution.pdf', 'page': 10}
You have 41 pages in your document
There are  1137 characters in page


In [32]:
# Loading the data from Wikipedia

data = load_from_wikipedia('GPT-4', lang="pt")
print(data[0].page_content)

Generative Pre-trained Transformer 4 (GPT-4) é um modelo de linguagem grande multimodal criado pela OpenAI e o quarto modelo da série GPT.  Foi lançado em 14 de março de 2023, e se tornou publicamente aberto de forma limitada por meio do ChatGPT Plus, com o seu acesso à API comercial sendo provido por uma lista de espera. Sendo um transformador, foi pré-treinado para prever o próximo token (usando dados públicos e "licenciados de provedores terceirizados"), e então foi aperfeiçoado através de uma técnica de aprendizagem por reforço com humanos. 
A empresa Microsoft, após o lançamento do modelo, confirmou que versões do Bing utilizando o GPT estavam, de fato, utilizando o modelo mais recente da OpenAI antes de seu lançamento oficial. 


== Capacidades ==
Diferentemente de seu predecessor, o GPT-3, o GPT-4 é capaz de processar imagens como entrada, não apenas texto, analisando o conteúdo da imagem de forma semelhante a um humano, e emitindo uma saída em forma de texto.
Pesquisadores da M

In [39]:
# # Splitting the document into chunks
chunks = chunk_data(data, chunk_size=256)
print(len(chunks))
print(chunks[10].page_content)
print_embedding_cost(chunks)
# # Creating a Chroma vector store using the provided text chunks and embedding model (default is text-embedding-3-small)
# vector_store = create_embeddings_chroma(chunks)

190
Representatives
shall
chuse
their
Speaker
and
other
Of ficers;and
shall
have
the
sole
Power
of
Impeachment.
Section
3:
The
Senate
The
Senate
of
the
United
States
shall
be
composed
of
two
Senators
from
each
State,
chosen
by
the
Legislature
thereof,
for
six
Total Tokens: 16711
Embedding Cost in USD: 0.000334


In [45]:
delete_pinecone_index()

Deleting all indexes ... 
Ok


In [47]:
index_name = "askadoc"
vector_store = insert_or_fetch_embeddings(index_name, chunks)

Creating index askadoc and embeddings ...Ok


# Getting into QA with similarity search

In [48]:
def ask_and_get_answer(vector_store, q, k=3):
    from langchain.chains import RetrievalQA
    from langchain_openai import ChatOpenAI

    llm = ChatOpenAI(model='gpt-3.5-turbo', temperature=1)
    retriever = vector_store.as_retriever(search_type='similarity', search_kwargs={'k': k})
    chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=retriever)
    answer = chain.invoke(q)
    return answer

In [57]:
q = "What is the whole document about?"
answer = ask_and_get_answer(vector_store, q)
print(answer["result"])

The excerpts provided are from the United States Constitution. The content of the Constitution outlines the fundamental principles and laws that govern the United States, including the structure of the government, the rights of citizens, and the relationship between the government and the people.


In [None]:
import time
i = 1
print("Write QUIT or EXIT to leave program")
while True:
    q = input(f"Question #{i}: ")
    i = i + 1
    if q.lower() in ["quit","exit"]:
         print("Leaving the program.")
         time.sleep(2)
         break;
    answer = ask_and_get_answer(vector_store, q)
    print(f"\nAnswer: " + answer["result"] + "\n\n" )
    print("-----" * 25)
    

Write QUIT or EXIT to leave program


Question #1:  Explain the second ammendment to the constitution



Answer: The Second Amendment to the United States Constitution states: "A well regulated Militia, being necessary to the security of a free State, the right of the people to keep and bear Arms, shall not be infringed." This provision protects the right of the American people to keep and bear firearms. It has been the subject of considerable debate, with discussions focusing on the balance between individuals' right to own guns and the need for public safety.


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Question #2:  Explain the concept of Federalism as it is presented in the us constitution



Answer: Federalism, as presented in the United States Constitution, is the division of powers between the national government and the state governments. The Constitution outlines the responsibilities and powers of the federal government, while also reserving certain powers to the states or to the people. This system of government ensures a balance of power between the national government and the individual states, allowing for both centralized authority and state autonomy. This concept of Federalism is a key principle of the U.S. Constitution, which establishes a system of government that includes both national and state level governance.


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Question #3:  Q1: Expplain the bill of rights Q2: Describe what happens in presidential succession. Answer both question separately 



Answer: Q1: I don't have information on the Bill of Rights. 
Q2: In the case of the removal of the President from office, their death, or resignation, the Vice President will become the President. If there is a vacancy in the office of the Vice President, the President shall nominate a Vice President who must be confirmed by a majority vote of Congress. The Speaker of the House of Representatives and the President pro tempore of the Senate are next in the line of succession.


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Question #4:  Describe the Bill of Rights



Answer: The Bill of Rights is the first ten amendments to the United States Constitution. It includes rights such as freedom of speech, religion, and the right to bear arms. Each of the ten amendments provides specific protections for the citizens of the United States, ensuring their individual liberties and rights are protected.


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
