<a href="https://colab.research.google.com/github/bharathpgp/aicodes/blob/main/Cod(H)erSpace_RAG_Handson.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Cod(H)erSpace RAG - Hands-on


### Step 1: Ensure Google Colab GPU Availability

Before proceeding, confirm that you're using a Google Colab GPU instance.

1. Click the dropdown on the top right corner besides **Connect / Reconnect**
2. Select **Change Runtime Type**
4. Select hardware accelerator as **T4** or **A100**

If successful, the command provided in the next step will display a table.

In [None]:
!nvidia-smi

Sat Nov 25 01:40:47 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P8    11W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Step 2: Install Required Libraries/Dependencies

Here are the essential prerequisites for our project:

* **CTransformers**: A Python wrapper for GGML, a C-based tensor library. It's used for inferring LLM in quantized mode.
* **langchain**: A Python library for chaining RAG, agent examples, and more.
* **ChromaDB**: A simple vector database necessary for indexing and RAG examples.

In [None]:
!pip install langchain==0.0.250 chromadb==0.4.4
!pip install sentence_transformers
!CT_CUBLAS=1 pip install ctransformers --no-binary ctransformers
#!CT_METAL=1 pip install ctransformers --no-binary ctransformers

Collecting langchain==0.0.250
  Downloading langchain-0.0.250-py3-none-any.whl (1.4 MB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.4 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/1.4 MB[0m [31m3.3 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.4/1.4 MB[0m [31m25.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting chromadb==0.4.4
  Downloading chromadb-0.4.4-py3-none-any.whl (402 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m402.6/402.6 kB[0m [31m45.5 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7 (from langchain==0.0.250)
  Downloading dataclasses_json-0.5.14-py3-none-any.whl (26 kB)
Collecting langsmith<0.1.0,>=0.0.11 (from langchain==0.0.250)
  Downloadin

### Step 3: Set Up a Self-Hosted LLM

Model - [Fine tuned OpenLlama 3B model with 4-bit quantization](https://huggingface.co/TheBloke/orca_mini_3B-GGML).

Here's how you can get started:

1. Use the `AutoModelForCausalLM` class from the ctransformer library to download and initiate the Llama model.
2. We use Orca Mini 3B which is a fined tuned version of open-source model from UC Berkelely (OpenLLaMA), it is available in a 3B, 7B and 13B size with 4-bit quantization mode.

Config:
max_new_tokens - How many new tokens should be generated when prompted
repetition_penalty - How much penalty is given for a repeated word in the generation
temperature - How sharp the model answers are (Typically 0 - 2. 0 - Logical, 2 - Creative)
stream: stream each token after generation.

In [None]:
from ctransformers import AutoModelForCausalLM
config = {'max_new_tokens': 256, 'repetition_penalty': 1.1,
          'temperature': 0.1, 'stream': True}

llm = AutoModelForCausalLM.from_pretrained(
      "TheBloke/orca_mini_3B-GGML", model_file='orca-mini-3b.ggmlv3.q4_0.bin',
      model_type="llama",
      gpu_layers=86, #110 for 7b, 130 for 13b
      **config
)

Fetching 0 files: 0it [00:00, ?it/s]

Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

orca-mini-3b.ggmlv3.q4_0.bin:   0%|          | 0.00/1.93G [00:00<?, ?B/s]

### Interact with LLM

If you're new to using Language Models (LLM), don't worry! Here's a simple way to interact with the AI:

1. **Create a Prompt**: The code sets up a funny AI assistant that awaits your instruction.
2. **Ask for a Joke**: The prompt `"Tell me a joke."` is already formatted for you. You can replace it with any other instruction you desire.
3. **See the AI's Response**: Run the code, and the AI will respond

Feel free to experiment with different instructions to see how the AI responds.

In [None]:
prompt = "Write one line joke."
template = f"""
You are a funny AI assistant.
### Instruction:
{prompt}
### Response:"""

tokens = llm.tokenize(template)
for token in llm.generate(tokens):
    print(llm.detokenize(token), end='', flush=True)

 Why don't scientists trust atoms? Because they make up everything.

### Closed book knowledge test:
Retrieval Augmented Generation - Checking if the model is able to give a appropriate answer to our prompt here



In [None]:
prompt = "What is Aleph.js?"
template = f"""
If you don't know the answer, say I don't know.
### Instruction:
{prompt}
### Response:"""

tokens = llm.tokenize(template)
for token in llm.generate(tokens):
    print(llm.detokenize(token), end='', flush=True)

 I'm sorry, but I cannot provide a clear answer to your question as Aleph.js is not a well-known or commonly used programming language. It appears to be an open-source JavaScript engine developed by Facebook in 2013, but there is very little information available about it online. Can you please provide more context or details about the question you are asking?

---

### Load the sample structured tech radar data

Clone the sample tech radar data and load the CSV file

In [None]:
!wget https://raw.githubusercontent.com/rkandas/aibootcampdata/main/Radar_datatable.csv
from langchain.document_loaders.csv_loader import CSVLoader
loader = CSVLoader(file_path='./Radar_datatable.csv')
data = loader.load()
print(data[:10])

--2023-11-25 01:45:19--  https://raw.githubusercontent.com/rkandas/aibootcampdata/main/Radar_datatable.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 152029 (148K) [text/plain]
Saving to: ‘Radar_datatable.csv’


2023-11-25 01:45:19 (53.1 MB/s) - ‘Radar_datatable.csv’ saved [152029/152029]

[Document(page_content='Name: Aleph.js\nURL: https://www.thoughtworks.com/radar/languages-and-frameworks/aleph-js\nVolume: Oct-22\nRing: Assess\nQuadrant: Languages & Frameworks\nshorturl: https://tinyurl.com/2pawh2eu\nDescription: There is certainly no shortage of frameworks to build web applications in JavaScript/ TypeScript . We\'ve featured many of them in the Radar, but what sets  Aleph.js  apart in this crowded field is that it\'s built to run on  Deno , the new

---

## Introduction to Chroma DB and Embeddings

Chroma DB is a specialized vector database designed to store various types of embeddings. This section will introduce you to some key terms and concepts:

* **ChromaDB**: This is a vector database that can store text documents using embeddings. It's a powerful tool for organizing complex data structures. Learn more about ChromaDB [here](https://docs.trychroma.com/getting-started).

* **Embeddings**: As a native concept in AI, embeddings are versatile representations used to depict various types of data such as text and images, with future prospects for audio and video. Discover more about embeddings [here](https://docs.trychroma.com/embeddings).

* **HuggingFaceEmbeddings**: This specific class within the Langchain library facilitates the transformation of text documents into embeddings. You can find additional details on how to get started with embeddings on the [HuggingFace blog](https://huggingface.co/blog/getting-started-with-embeddings).

### Example: Converting Text to Embeddings with Chroma DB
In the following example, we demonstrate how to utilize the HuggingFace embeddings class. We will take CSV data loaded in a previous step and convert it into embeddings. These transformed data will then be loaded into ChromaDB for further use.

This practical application showcases the power and flexibility of Chroma DB and the Langchain library in handling and transforming complex data structures, providing a foundation for various AI-powered tools and algorithms.


In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings()
db = Chroma.from_documents(data, embeddings)

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

data_config.json:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

### Querying VectorDB with Text Data Embeddings

Once you have loaded the text data embeddings into VectorDB (such as ChromaDB), you can start performing queries on it. The process of querying works by identifying documents that are most closely related to a given query.

**Note**: The results returned by the query might not be exact matches to the query itself. Instead, they represent what are known as "nearest neighbors." These are documents that, in the space of the embeddings, are most likely to carry the answer to the query. They may provide relevant information or context that closely aligns with the query's intent.

This powerful feature of VectorDB allows for more nuanced and responsive querying, especially in applications where exact matches may not capture the full richness of the data.

In [None]:
query = "What is Aleph.js?"
docs = db.similarity_search(query)
print(len(docs))
print(docs)

4
[Document(page_content='Name: Aleph.js\nURL: https://www.thoughtworks.com/radar/languages-and-frameworks/aleph-js\nVolume: Oct-22\nRing: Assess\nQuadrant: Languages & Frameworks\nshorturl: https://tinyurl.com/2pawh2eu\nDescription: There is certainly no shortage of frameworks to build web applications in JavaScript/ TypeScript . We\'ve featured many of them in the Radar, but what sets  Aleph.js  apart in this crowded field is that it\'s built to run on  Deno , the new server-side run time created by the original developer of  Node . This puts Aleph.js on a modern foundation that addresses several shortcomings and problems with Node. Aleph.js is still new " it\'s approaching the 1.0 release at the time of writing " but it already offers a solid developer experience, including hot module replacement. With Deno now way past its  1.0 release , this is a modern choice for projects that can take the risk.', metadata={'row': 0, 'source': './Radar_datatable.csv'}), Document(page_content="Nam

## RAG - Retrieval Augmented Generation

### Integrating VectorDB with LLMs for RAG

RAG combines the filtering capabilities of VectorDB with LLMs to create a powerful information retrieval system. Here's how it works:

1. **Query VectorDB**: First, a query is run on the VectorDB, just like in the previous example. This retrieves the documents most closely related to the query, known as the "nearest neighbors."

2. **Pass Documents to LLM**: The retrieved documents are then passed as context to the LLM. The LLM uses this additional information to generate a more informed and relevant response.

3. **Generate Response**: With the added context from the filtered documents, the LLM crafts a response to the query, creating a more nuanced and precise answer.

Below is an example of passing the VectorDB results to LLM to generate the final response:

In [None]:
prompt = f"""Use the information provided in context to answer the question.
If answer is not available in the given context,
just say that you don't know, don't try to make up an answer.
Context: {docs[0]}
Question: {query}
Answer:"""

tokens = llm.tokenize(prompt)
result = ""
for token in llm.generate(tokens):
  result += llm.detokenize(token)

print(result)


 Aleph.js is a JavaScript framework built to run on Deno, a new server-side run time created by the original developer of Node. It offers a solid developer experience, including hot module replacement, and is approaching the 1.0 release at the time of writing.


In [None]:
query = "What is the use of Bento ML?"
docs = db.similarity_search(query)
print(len(docs))
print(docs)

4
[Document(page_content='Name: BentoML\nURL: https://www.thoughtworks.com/radar/languages-and-frameworks/bentoml\nVolume: Oct-22\nRing: Assess\nQuadrant: Languages & Frameworks\nshorturl: https://tinyurl.com/2l8syruf\nDescription: BentoML  is a python-first framework for serving machine-learning models in production at scale. The models it provides are agnostic of their environment; all model artifacts, source code and dependencies are encapsulated in a self-contained format called Bento. It\'s like having your model "as a service." Think of BentoML as the  Docker  for ML models: It generates VM images with pre-programmed APIs ready for deployment and includes features that make it easy to test these images. BentoML can help speed up the initial development effort by easing the start of projects which is why we included it in Assess.', metadata={'row': 8, 'source': './Radar_datatable.csv'}), Document(page_content='Name: DataOps.live\nURL: https://www.thoughtworks.com/radar/platforms/d

In [None]:
prompt = f"""Use the information provided in context to answer the question.
If answer is not available in the given context,
just say that you don't know, don't try to make up an answer.
Context: {docs[0]}
Question: {query}
Answer:"""
tokens = llm.tokenize(prompt)
result = ""
for token in llm.generate(tokens):
  result += llm.detokenize(token)

print(result)

 Bento ML is a python-first framework for serving machine-learning models in production at scale.


## Task
1. In the below code cell 1, add the prompt replacing <add your query here> in the prompt variable referring to the radar data and try to query the LLM without giving the context document
2. Then add the same prompt replacing <add your query here> in query variable, and try to run the cell 2 & cell 3 to understand the effectiveness of RAG.


In [None]:
prompt = <add your query here>
template = f"""
If you don't know the answer, say I don't know.
### Instruction:
{prompt}
### Response:"""

tokens = llm.tokenize(template)
for token in llm.generate(tokens):
    print(llm.detokenize(token), end='', flush=True)

In [None]:
query = <add your query here>
docs = db.similarity_search(query)
print(len(docs))
print(docs)

In [None]:
prompt = f"""Use the information provided in context to answer the question.
If answer is not available in the given context,
just say that you don't know, don't try to make up an answer.
Context: {docs[0]}
Question: {query}
Answer:"""
tokens = llm.tokenize(prompt)
result = ""
for token in llm.generate(tokens):
  result += llm.detokenize(token)

print(result)

## Resources:
CTransformers - https://github.com/marella/ctransformers

GGML Models - https://huggingface.co/models?search=ggml

