<a href="https://colab.research.google.com/github/anastaszi/GenAI/blob/main/Building_a_Question_Answering_Bot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Building a Q&A Bot
In the previous two notebooks, we've discussed quite a few concepts as they pertain to LLM basics and prompting. However, there hasn't been too much code to see how these things can be put together. In this notebook, we'll code through and example of a Question and Answering Bot, based on a fixed knowledge base. We'll cover the details of LLMs and prompting play into the patterns of search, retrevial, and text-generation in the application.

To understand the details, we'll first take a look at the high level architecture of the application:

![serach_retrevial_generation](./images/search_retrevial_generation.png)

## Search

Vector search could be an entire course on it's own - it's a vast topic with several nuanced details. However, for this course, we'll cover the basics. Vector search is a technique used to retrieve similar items or entities based on their vector representations. In vector search, data points are represented as vectors in a high-dimensional space, where each dimension corresponds to a specific feature or attribute. The goal is to efficiently search and retrieve items that are close or similar to a given query vector.

Here are the basic steps involved in vector search:

1. Vector Representation: Each item or entity in the search space is transformed into a vector representation. This can be done using techniques like word embeddings, document embeddings, or deep learning models. The vectors capture the essential characteristics or semantic meaning of the items.

2. Indexing: The vector representations of the items are stored in an index, which organizes the vectors for efficient search. Various indexing structures like k-d trees, ball trees, or approximate nearest neighbor (ANN) indexes can be used to speed up the search process.

3. Querying: When a search query is provided, it is also transformed into a vector representation using the same method used for the indexed items. The query vector represents the desired item or the characteristics being searched for.

4. Similarity Measurement: The similarity between the query vector and the vectors in the index is calculated using a distance or similarity metric, such as Euclidean distance or cosine similarity. This metric quantifies the similarity or dissimilarity between two vectors based on their positions in the high-dimensional space.

5. Retrieval: The items in the index that are most similar to the query vector are retrieved based on their proximity in the vector space. The retrieval can be performed using algorithms that efficiently search the index structure for nearest neighbors or approximate nearest neighbors.


## Vector Search Technologies

There are various types of vector search technologies include vector libraries, vector databases and vector plugins.

- **Vector Libraries**: A vector search library is specifically designed to handle large-scale vector data sets and perform search operations to find nearest neighbors or similar items to a given query vector. It utilizes advanced indexing structures and search algorithms to optimize the search process and provide fast retrieval of similar vectors. These libraries typically offer a set of APIs and functions that allow developers to build applications or systems that require similarity search capabilities. Popular vector search libraries include Annoy, FAISS, NMSLIB, and SPTAG. These libraries provide efficient indexing structures, search algorithms, and APIs to support similarity search tasks in various domains, including recommendation systems, content-based image retrieval, natural language processing, and data mining.

- **Vector Databases**: A vector database, also known as a vector storage or vector index, is a specialized database system designed to efficiently store, manage, and retrieve vector data. It is specifically optimized for handling large-scale vector datasets and performing similarity searches on those vectors. While there are several overlapping capabilites between a vector database and a vector library, a vector database often includes all of those in a vector library and also include classic database features (CRUD operations, metadata handling, etc.). Some noteable vector databases are Chroma, Pinecone, Weavite, etc.

-  **Vector Plugins**: The term "vector plugins" does not have a widely recognized or standard definition in the context of computer science or software development. However, based on the general understanding of plugins and vectors, we can interpret "vector plugins" as plugins or extensions specifically designed to enhance or expand the functionality related to vector processing, vector math, or vector-based operations in software applications. Vector plugins provide support for traditional database technologies (Postgres, mySQL, Mongo, etc.) to do vector search. This allows you to repurpose existing database technologies for vector search.


Since this example includes a small amount of data, we'll use an Open Source vector database, Chroma, to index our data.

For this example, we'll be using some state of the union text data and the `Langchain` loaders to load in our `.txt` file:

In [None]:
%pip install sentence_transformers chromadb openai

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Defaulting to user installation because normal site-packages is not writeable
Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m


Installing collected packages: openai
[0mSuccessfully installed openai-0.27.8
Note: you may need to restart the kernel to use updated packages.


In [None]:
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

In [None]:
# Load in our Text Data using TextLoader
loader = TextLoader("./data/state_of_the_union.txt")
documents = loader.load()

Downloading (…)a8e1d/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading (…)0bca8e1d/config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e1d/data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)a8e1d/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)8e1d/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)b20bca8e1d/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)bca8e1d/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

ValueError: Could not import chromadb python package. Please install it with `pip install chromadb`.

Now that we've read in our data, before we index embeddings in Chroma, we'll need to chunk the data. Chunking refers to the process of dividing a piece of text into smaller, manageable subsets or chunks. This technique is employed to improve the efficiency and scalability of similarity search operations, especially when dealing with massive or high-dimensional vector datasets. Ideal chunk size is entirely dependent on the number of documents, layout of those documents, context sharing across documents, the nature of queries, and the token limit of the chat model being used. Chunking best practices are still evolving, but [here](https://www.pinecone.io/learn/chunking-strategies/) is a good resource.

Since our data is small, we can get away with using the default chunk size, with no chunk_overlap.

In [None]:
# Chunk our data into smaller pieces for more effective vector search
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

After data is chunked, we'll now use the `MPNetModel` from the `SentenceTransformer` library via `HuggingFace` to map our text data into numeric vectors and index them via the `from_documents` function in the `Chroma` module via `Langchain`

In [None]:
embeddings = HuggingFaceEmbeddings()
docsearch = Chroma.from_documents(texts, embeddings)

The `docsearch` object is an instance of a `Chroma` vector store in `Langchain`

In [None]:
type(docsearch)

langchain.vectorstores.chroma.Chroma

In [None]:
# Example similarity search
docsearch.similarity_search("Thomas Jefferson")

[Document(page_content="For example, Thomas Jefferson thought Washington's oral presentation was too kingly for the new republic. Likewise, Congress's practice of giving a courteous reply in person at the President's residence was too formal. Jefferson detailed his priorities in his first annual message in 1801 and sent copies of the written message to each house of Congress. The President's annual message, as it was then called, was not spoken by the President for the next 112 years. The message was often printed in full or as excerpts in newspapers for the American public to read.\n\nThe first President to revive Washington's spoken precedent was Woodrow Wilson in 1913. Although controversial at the time, Wilson delivered his first annual message in person to both houses of Congress and outlined his legislative priorities.", metadata={'source': './data/state_of_the_union.txt'}),
 Document(page_content="Most annual messages outline the President's legislative agenda and national prior

To review the architecture above:

![search_retrevial_generation](./images/search_retrevial_generation.png)


So far we've covered the first few pieces :
- Search: embedded and indexed our knolwedge base using `TextLoader`, `HuggingFaceEmbeddings`, and `Chroma`
- Retrieval: within the `Chroma` collection, we can leveage the built in search capabilites to identify relevent piece of content based on the query


Let's shift the focus to the text generation portion:

## Langchain Chains
Chains are an incredible versitile part of the `Langchain` library. Chains allow for users to build more complex applications by chaining several steps and models together into pipelines. Chains allow us to combine multiple components together to create a single, coherent application. For example, we can create a chain that takes user input, formats it with a PromptTemplate, and then passes the formatted response to an LLM. We can build more complex chains by combining multiple chains together, or by combining chains with other components.

Read more about Chains [here](https://python.langchain.com/docs/modules/chains/)


`Langchain` is beind so rapidly developed that new, built-in chains for different applications are popping up within a few weeks. `Langchain` recently released a `RetrevialQA` chain that is designed to be used in a Q&A Bost application. This chain takes the following inputs:
- **llm**: large language model used for text generation
- **retrevier**: this is the retrevier object
- **chain_type**: this denotes how relevant content is passed into the llm, more details [here](https://python.langchain.com/docs/modules/chains/popular/vector_db_qa#chain-type)




Note that the `RetrievalQA` chain below uses an `OpenAI` chat model. In order to run this code, you'll need to generate an `OpenAI` key and store it as an environment variable:

```import os
    os.environ["OPENAI_API_KEY"] = INSERT_KEY
```

In [None]:
# Create the chain
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=docsearch.as_retriever())

In [None]:
# Prompt the chain with a question
qa.run("What did the president say about Thomas Jefferson?")

" Thomas Jefferson thought Washington's oral presentation was too kingly for the new republic and Congress's practice of giving a courteous reply in person at the President's residence was too formal."

Great! We've been able to stand up a working example of a search, retrieval, and text generation process. Let's add one more step of complexity, by adding in a prompt template:

In [None]:
from langchain.prompts import PromptTemplate
prompt_template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.

{context}

Question: {question}
Answer in Iambic Pentameter:"""
prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

In [None]:
# Incorporate the prompt into the chain
chain_type_kwargs = {"prompt": prompt}

qa_with_prompt = RetrievalQA.from_chain_type(llm=OpenAI(),
                                             chain_type="stuff",
                                             retriever=docsearch.as_retriever(),
                                             chain_type_kwargs=chain_type_kwargs)

In [None]:
print(qa_with_prompt.run("What did the president say about George Washington?"))


George Washington's speech did set/A precedent for all presidents yet/For kingly presence he was seen/And courteous replies at his home were keen/He wrote a dateline of United States/To prove the nation's union and fate.


Fantastic! We've now been able to incorporate some of the prompt engineering best practices and techniques into our search, retrevial, and text generation pipeline.