The Goal here is to create a simple RAG project in order to get a better understand of how RAG works, and how I can iterate past it. To create bigger scale projects. 

 
## Recollection of how Rag works.

My understand of Rag is that there are documents that are processed in bits of text called chunks. These chunks are then embedded and stored in a 3d vector space. Once they are stored in the vector space they are indexed with a specifc value, this value will be used to retrieve the data again. 

Once we do all the embedding of the document we will then have the user pass in a query and this query will then be used to be compared against the documents and it will grab the k most related ones.

Once gathered of text chunks it will be based into the context of the LLM and be used to generate a response.

There are more advanced and less simple versions of this approuch that allows for more efficiency however we will be using the approuch for better understanding of the project.


### Step 0: Gather the documents.

For this example we will just be using 2 text files so we won't neccesarily have to load any data however documents might need to be loaded through api calls and other means.

### Step 1:  Clean the Data

When we are working with text documents it's important to clean the data in order to avoid irrelevant or damaged text chunks.

we do this with basic python string functions and the use of the regular expression library 're' to take out non-relevant ascii characters.

In [1]:
import re

def clean_data(text: str) -> str:
    #This is going to sub the leading spaces in the file
    text = re.sub(r'\s+', '',text)

    #This gets rid of all the nonASCII characters.
    text = re.sub(r'[^\x00-\x7F]+', '', text)

    #This gets rid of any trailing white spaces.
    return text.strip()

### Step 2: Load  Text

Now we will create a method to load the data of the documents one into an accessible array.


In [2]:
import os
#allows for us to read other directories.

def load_data(folder_path):
    #creates an array of documents that would match our txt files.
    docs = []
    for file in os.listdir(folder_path):
        if file.endswith(".txt"):
            with open(os.path.join(folder_path,file),'r',encoding="utf-8") as f:
                docs.append(f.read())
    return docs


In [3]:
print(load_data("../tutorial-project/documents/"))

['In this type of machine learning (supervised), the model is trained on labeled data. \nIn simple terms, every training example has an input and an associated output label. \nThe objective is to build a model that generalizes well on unseen data. \nCommon algorithms include:\n- Linear Regression\n- Decision Trees\n- Random Forests\n- Support Vector Machines\n\nClassification and regression tasks are performed in supervised machine learning.\nFor example: spam detection (classification) and house price prediction (regression).\nThey can be evaluated using accuracy, F1-score, precision, recall, or mean squared error.', 'In this type of machine learning (unsupervised), the model is trained on unlabeled data. \nPopular algorithms include:\n- K-Means\n- Principal Component Analysis (PCA)\n- Autoencoders\n\nThere are no predefined output labels; the algorithm automatically detects \nunderlying patterns or structures within the data.\nTypical use cases include anomaly detection, customer clu

As you can see it loads all of the document and we are now ready to begin the cleaning portion.

### Step 3: Prepare the data

In [4]:
#no imports since this is one notebook

def prepare_data(folder_path="../tutorial-project/documents/"):
    rawData = load_data(folder_path)
    cleanedData = [clean_data(doc) for doc in rawData]
    print(f"prepared {len(cleanedData)} documents")

    return cleanedData


In [5]:
print(prepare_data())

prepared 2 documents
['Inthistypeofmachinelearning(supervised),themodelistrainedonlabeleddata.Insimpleterms,everytrainingexamplehasaninputandanassociatedoutputlabel.Theobjectiveistobuildamodelthatgeneralizeswellonunseendata.Commonalgorithmsinclude:-LinearRegression-DecisionTrees-RandomForests-SupportVectorMachinesClassificationandregressiontasksareperformedinsupervisedmachinelearning.Forexample:spamdetection(classification)andhousepriceprediction(regression).Theycanbeevaluatedusingaccuracy,F1-score,precision,recall,ormeansquarederror.', 'Inthistypeofmachinelearning(unsupervised),themodelistrainedonunlabeleddata.Popularalgorithmsinclude:-K-Means-PrincipalComponentAnalysis(PCA)-AutoencodersTherearenopredefinedoutputlabels;thealgorithmautomaticallydetectsunderlyingpatternsorstructureswithinthedata.Typicalusecasesincludeanomalydetection,customerclustering,anddimensionalityreduction.Performancecanbemeasuredqualitativelyorwithmetricssuchassilhouettescoreandreconstructionerror.']


As you can see when running the code above that the process of preparing the data strips the code of the white spaces and the separates them into their own individual documents.

with the data ready for use now, the next steps regard preparing for the retrieval process are chunking text, embedding, and then vectorizing. lets get started with chunking the text we have now. 



### Step 3: Chunking the text.

We will use a LANGCHAIN library to chunk the text into a size of 500 chunks with 100 chunk overlap.
The 100 chunk overlap helps with retaining semantics in the text between chunks.

We are now going to Pip install the Langchain library.

In [6]:
%pip install langchain_text_splitters

Note: you may need to restart the kernel to use updated packages.


In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_docs(documents, chunk_size=500, chunk_overlap=100):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size = chunk_size,
        chunk_overlap = chunk_overlap
    )
    chunks = splitter.create_documents(documents)
    print(f"Total chunks created {len(chunks)}")
    return chunks

### Step 4: Embedding Chunks

Now that we have chunked text we can now embedd these text chunks to be vectorized for easy retrieval.

First we are going to import the SentenceTransformer Library.

In [8]:
%pip install sentence-transformers

Note: you may need to restart the kernel to use updated packages.


In [9]:
from sentence_transformers import SentenceTransformer #embed text chunks tool.
import numpy as np #mulit dimensional array library

def get_embeddings(text_chunks):
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    print(f"embedding a total of {len(text_chunks)} chunks: ")

    embeddings = model.encode(text_chunks,show_progress_bar=True)
    print(f"the shape of the embedding is {embeddings.shape} ")

    return np.array(embeddings)



The shape of the embedding is regarding to the dimension of the vector that the embeddings are in.

The model we used in this practice project is the 'sentence-transformer/all-MiniM-L6-v2' this project is good it allows for the embedding array to be 384 dimensions. 

These embedding dimesnions typically vary from 384 - 1024
Larger dimensions are better for model performance, while smaller dimensions are better for faster computation.

### Step 4: Vectorize your embeddings

we are going to use a facebook opensource library to vectorize the database.

In [10]:
%pip install faiss-cpu



Note: you may need to restart the kernel to use updated packages.


with this library we are going to create the vector and also return an index to the embedding space.

In [11]:
import faiss
import numpy as np
import pickle

def build_faiss_index(embeddings, save_path="faiss_index"):
    dimensions = embeddings.shape[1]
    print(f"Building FAISS index with dimensions: {dimensions} ")

    index = faiss.IndexFlatL2(dimensions)
    index.add(embeddings.astype('float32'))

    faiss.write_index(index, f"{save_path}.index")
    print(f"Saved faiss index to {save_path}.index")

    return index
    

When we create an index we are creating a datastructure that makes it easy for us to find specific embeddings.

In this specific example it uses the IndexFlatL2 which means.

Index -> Data Structure to look things up

Flat -> No clustering, no Graphs

L2 -> using euclidean distance, how close they are in space.


#### Storing Metadata
We are now going to store metadata that is the same index as the textchunks to be used for the retrieval step.

In [12]:
def store_metadata(text_chunks, path="faiss_metadata.pkl"):
    with open(path,"wb") as f:
        pickle.dump(text_chunks,f)
    print(f"saved text metaData to {path}")

## Step 5: Retrieve Relevant information.

so in order for us to deem information relevant we must first decide what we need to know. The way this works is by processing a user query, converting it to a numerical value, then compare these number to previously processed text chunks.

This is called a Similarity Search.

We need to load the vector (FAISS) and the metadata(Pkl)


In [13]:
import faiss
import pickle
import numpy as np
from sentence_transformers import SentenceTransformer

def load_faiss_index(index_path="faiss_index.index"):
    print("loading FAISS index")
    return faiss.read_index(index_path)

def load_metadata(metadata_path="faiss_metadata.pkl"):
    print("Loading MetaData")
    with open(metadata_path,"rb") as f:
        return pickle.load(f)
        

Now that we have the metadata and the vector database how do we use them with the query to load the relevant chunks??!?!??!?!?!?

We are going to use the same model we embedded the text chunks to embedd the query.

Then use the index data structure to return the top_k most relevant chunks.


In [14]:
def retrieve_similar_chunks(query, index, text_chunks, top_k=3):
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    query_vector = model.encode([query]).astype('float32')

    print("Searching for chunks....")

    distances, indices = index.search(query_vector, top_k)
    print(f"return {top_k} releven text chunks")

    return [text_chunks[i] for i in indices[0]]

    

Now that we have all the relevant chunks we are going to combine it to a sinlge piece of context to pass into the LLM for generation.

In [15]:
def createContext(query,index,text_chunks,top_k=3):
    context_chunks = retrieve_similar_chunks(query, index, text_chunks, top_k)
    context = "\n\n".join(context_chunks)
    return context

### Step 6: Generate answer from LLM with context acquired

We are going to use a small local model called TinyLlama, but since this usecase of rag is relevant to chatbots we can be fine implementing any other chatbot model as well.

In [16]:
%pip install accelerate 

Note: you may need to restart the kernel to use updated packages.


In [17]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Load at the top, outside generate_answer()
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    dtype=torch.float16,  # use float16 if supported
    device_map="auto"
)
model.eval()


def generate_answer(query,top_k=3):
    index = load_faiss_index()
    text_chunks = load_metadata()
    

    context = createContext(query, index, text_chunks, top_k)
    prompt = f"""
        Context: {context}
        Question: {query}
        Answer: 
    """

    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=200, pad_token_id=tokenizer.eos_token_id)

    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    answer = full_text.split("Answer:")[1].strip() if "Answer:" in full_text else full_text.strip()
    print("\nFinal Answer")
    print(answer)


Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

### Step 7: create the Rag Pipeline

Now all we have to do is merge everything into one single linear process.

In [20]:
def run_pipeline():
    print("Load and Clean Data: \n")
    documents = load_data("../tutorial-project/documents/")
    print(f"Loaded {len(documents)} documents: \n")

    print("Splitting text into chunks\n")
    chunks_as_text = split_docs(documents, 500, 100,)

    texts = [c.page_content for c in chunks_as_text]
    print(f"Created {len(texts)} text chunks. \n")

    print("Generating embeddings: \n")
    embeddings = get_embeddings(texts)

    print("storing embeddings in vector space\n")
    index = build_faiss_index(embeddings)
    store_metadata(texts)
    print("Stored Embeddings and MetaData Successfully\n")

    print("Retrieve similar chunks and generate answer")
    query = "Does unsupervised ML cover regression tasks"
    generate_answer(query)



In [21]:
run_pipeline()

Load and Clean Data: 

Loaded 2 documents: 

Splitting text into chunks

Total chunks created 4
Created 4 text chunks. 

Generating embeddings: 

embedding a total of 4 chunks: 


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

the shape of the embedding is (4, 384) 
storing embeddings in vector space

Building FAISS index with dimensions: 384 
Saved faiss index to faiss_index.index
saved text metaData to faiss_metadata.pkl
Stored Embeddings and MetaData Successfully

Retrieve similar chunks and generate answer
loading FAISS index
Loading MetaData
Searching for chunks....
return 3 releven text chunks

Final Answer
Yes, unsupervised ML can be used for regression tasks. 
For example, K-Means clustering can be used to group data points based on their similarity. 
The model can then predict the cluster membership of new data points. 
Autoencoders can be used to reconstruct the input data from its latent representation. 
These algorithms can be used for regression tasks such as clustering, dimensionality reduction, and feature selection.
