# RAG Implementation from Scratch

In this project, I will create a simple RAG impelementation from scratch using Mistral model's chat completion model **mistral-large-latest**, and embedding model **mistral-embed**

In [33]:
api_key="O3umMFr6EDLY78iAMnO1Ab1Ckl3KvwQn"

# Libraries Used

**Requests Library**: We will use the requests library to fetch Paul Graham's essay data from the web. This library allows us to easily send HTTP requests and retrieve content, which will serve as the knowledge base for our Retrieval-Augmented Generation (RAG) application.

**NumPy**: NumPy is the most popular Python library for numerical calculations and data manipulation. We will leverage it to perform efficient computations and handle data processing tasks required for preparing our essay data.

**FAISS Vector Database**: We will utilize the FAISS (Facebook AI Similarity Search) vector database to store the vector embeddings generated from our data. This tool enables fast and scalable similarity searches, making it ideal for managing the embeddings used in our RAG application.

In [35]:
from mistralai import Mistral
import requests
import numpy as np

import faiss
import os
from getpass import getpass

client = Mistral(api_key=api_key)

## Data Fetching

Read the essay data from the internet using **requests** and save it to file **essay.txt**

In [36]:
response = requests.get('https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt')
text = response.text

In [37]:
f = open('essay.txt', 'w')
f.write(text)
f.close()

## Chunking

In a Retrieval-Augmented Generation (RAG) system, breaking a document into smaller pieces is important. This step which is called **chunking**, makes it easier to find and pull out the most relevant information later during the retrieval step. For this example, we split the text by characters, grouping every 2048 characters into a single chunk. Doing this gives us 37 chunks in total.

In [38]:
chunk_size = 2048
chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
len(chunks)

37

## Embedding

The next step is embedding, where we convert each text chunk into a numeric vector using Mistral AI’s mistral-embed model, enabling efficient retrieval in our RAG system.

Embeddings are numeric vectors that represent text in a way computers can understand, placing similar meanings closer together in a virtual space. In a Retrieval-Augmented Generation (RAG) system, we need them to quickly find the most relevant text chunks by comparing their vectors to a question’s vector. This helps the system retrieve meaningful information efficiently and pass it to the AI for generating answers.

In [39]:
def get_text_embedding(input):
    embeddings_batch_response = client.embeddings.create(
          model="mistral-embed",
          inputs=input
      )
    return embeddings_batch_response.data[0].embedding

Here, a timer is added to create a delay between requests, helping us stay within the Mistral API's request limit.

In [40]:
import time

text_embeddings = []
for chunk in chunks:
    text_embeddings.append(get_text_embedding(chunk))
    time.sleep(2)  # Wait 1 second between requests; adjust as needed
text_embeddings = np.array(text_embeddings)

## Storing Document Embeddings

Storing vector embeddings into vector databases a common practice to for efficient processing and retrieval. For this project we are using FAISS vector database as it is freely available.



In [41]:
d = text_embeddings.shape[1]
index = faiss.IndexFlatL2(d)
index.add(text_embeddings)

## Create embeddings for a question

Whenever users ask a question, we also need to create embeddings for this question using the same embedding models as before. This will enable as perform vector similarity search.

In [42]:
question = "What were the two main things the author worked on before college?"
question_embeddings = np.array([get_text_embedding(question)])

## Retrieval

After creating and storing the embeddings in the vector database, we will retrieve the most relevant chunks by performing a similarity search using the question’s embedding. This search will identify the top k chunks, which are then combined with the user’s question to form a prompt for the LLM model (mistral-large-latest) to generate an accurate response.

In [43]:
D, I = index.search(question_embeddings, k=2) # distance, index
retrieved_chunk = [chunks[i] for i in I.tolist()[0]]

In [44]:
prompt = f"""
Context information is below.
---------------------
{retrieved_chunk}
---------------------
Given the context information and not prior knowledge, answer the query. 
Depending on the question and context, add explanation or example where necessary.
Query: {question}
Answer:
"""

## Generating Response

We can now use the prompt in the Mistral models' chat completion function to generate responses. As seen in the example below, our simple RAG app is working well, but there's still significant room for improvement.

In [46]:
def run_mistral(user_message, model="mistral-large-latest"):
    messages = [
        {
            "role": "user", "content": user_message
        }
    ]
    chat_response = client.chat.complete(
        model=model,
        messages=messages
    )
    return (chat_response.choices[0].message.content)

run_mistral(prompt)

'Before college, the author worked on writing and programming. Specifically, they wrote short stories and worked on programming using an IBM 1401 computer.'