<a href="https://colab.research.google.com/github/brandon-baek/SLM-RAG-Project/blob/main/SLM_RAG_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **SLM RAG Project**

This project creates a simple implementation of a small language model enhanced with retrieval augemented generation.

# Prerequisites

In [None]:
from transformers import pipeline
from IPython.display import clear_output
from sentence_transformers import SentenceTransformer
import numpy as np

llm = pipeline('text-generation', model='HuggingFaceTB/SmolLM-360M')
emb = SentenceTransformer('all-MiniLM-L6-v2')

clear_output()

print('Imports and Initializations complete.')

Imports and Initializations complete.


# Retrieval Augmented Generation

Cosine Similarity is used to calculate the semantic similarity between the embedding vectors of two texts. A return of 1 indicates identical vectors, 0 no similarity, -1 opposite vectors.

The formula for Cosine Similarity is: Similarity = (A · B) / (||A|| * ||B||)
- A * B is the dot product between two vectors
- ||A|| and ||B|| represent the magnitude of each vector

This formula is implemented with NumPy.

```
def cosine_similarity(v1, v2):
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
```

Utilizing Cosine Similarity and a pretrained embedding SentenceTransformer model, we can calculate the similarity between two texts.

The model used for retrieving the embedding vectors is a SentenceTransformer, all-MiniLM-L6-v2.

This is achieved by first getting the embedding vectors from each text using the pretrained model intialized to "emb", then using Cosine Similarity on the two vectors.

```
def text_similarity(t1, t2):
    return cosine_similarity(emb.encode(t1), emb.encode(t2))
```

We can put this all together into one Retriever class that can compare the text similarities between a query and a large text source to return text that can be used to augment generations.

First, if the data is a singular string, it will be batched based on batch_size and batch_overlap. Then the search method, gets the similarity score between a query and every single batch. Next, the top k scoring batches are selected and returned.

In [None]:
class Retriever:
    def __init__(self, data, batch_size=10, batch_overlap=10, custom_batch=False):
        if batch_overlap >= batch_size:
            raise ValueError('batch_size must be greater than batch_overlap')

        if not custom_batch:
            self.data = []
            data = data.split()
            start = 0
            while start < len(data):
                end = start + batch_size
                self.data.append(' '.join(data[start:end]))
                start += batch_size - batch_overlap
        else:
            self.data = data

    def cosine_similarity(self, v1, v2):
        return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

    def text_similarity(self, t1, t2):
        return self.cosine_similarity(emb.encode(t1), emb.encode(t2))

    def search(self, query, top_k=2):
        similarity_scores = []
        for batch in self.data:
            similarity_scores.append(self.text_similarity(batch, query))
        return [self.data[similarity_scores.index(i)] for i in sorted(similarity_scores, reverse=True)[:top_k]]

# Example Usage

In [None]:
text = 'Hi, my name is Brandon Baek and I am 15 years old. Ice cream is my favorite food in the entire world. The weather is doing pretty nice! I got an A on the English exam. Someone passed away yesterday and I am very sad. I\'m very sad today. I love coding! I\'m Korean American.'

In [None]:
ret = Retriever(text, batch_size=10, batch_overlap=3)

In [None]:
query = "What's your hobby?"
context = '\n'.join(ret.search(query, top_k=3))
print('Context given:\n' + context + '\n')
to_ai = f'<context>{context}</context>\n<query>{query}</query>\n<answer>'
print('Answer:\n' + llm(to_ai, pad_token_id=0)[0]['generated_text'].split('<answer>')[1].split('</answer>')[0])

Context given:
world. The weather is doing pretty nice! I got an
I love coding! I'm Korean American.
is my favorite food in the entire world. The weather

Answer:
I love coding! I'm Korean American.
