<h1>This is a python notebook containing our implmementation of a RAG for code generation, completion, debugging and other code relatied "natural" language queries</h1>

In [2]:
!pip install chromadb sentence_transformers pandas bs4



You should consider upgrading via the 'c:\users\38641\documents\faks\5.letnik\2.semester\nlp\ul-fri-nlp-course-project-2024-2025-1-6-3-musketeers\venv\scripts\python.exe -m pip install --upgrade pip' command.


In [3]:
import chromadb
from sentence_transformers import SentenceTransformer
import pandas as pd
import time
import uuid
from bs4 import BeautifulSoup

  from .autonotebook import tqdm as notebook_tqdm


Read the Stackoverflow questions

In [None]:
data = ['python_questions0.csv']
MAX_DOCS = 5000
df = pd.DataFrame()
for d in data:
    df = pd.concat([df, pd.read_csv(d)], ignore_index=True)

    
df = df.loc[:min(len(df), MAX_DOCS-1), ["tags", "question_title", "question_body", "answer", "question_score"]]
total_docs = len(df)
print(f"Loaded {total_docs} questions")

Loaded 100 questions


Chunk the questions and prepare them to be embedded

In [None]:
chunks = []
min_code_block = 10

for ix, content in df.iterrows():
    answer = content.loc['answer']
    tags = content.loc["tags"]
    score = content.loc["question_score"]

    question_chunk = f"{content.loc['question_title']}\n{content.loc['question_body']}".lower()
    chunks.append({"chunk": question_chunk,
                   "metadata": {"tags": tags,
                                "score": score,
                                "question": True,
                                "code": False,
                                "answer": answer.lower()
                                }})
    
    answer_chunk = str(answer).lower()
    chunks.append({"chunk": answer_chunk,
                   "metadata": {"tags": tags,
                   "score": score,
                   "question": False,
                   "code": False}
                   })

    soup = BeautifulSoup(answer, 'html.parser')
    code_blocks = [code.get_text() for code in soup.find_all('code')]
    for block in code_blocks:
        if len(block) > min_code_block and '\n' in block.strip():
            chunks.append({"chunk": block.lower(),
                           "metadata": {"tags": tags,
                                        "score": score,
                                        "question": False,
                                        "code": True}})

chunks = pd.DataFrame(chunks)
total_chunks = len(chunks)
print(f"Prepared {total_chunks} chunks.")

Prepared 347 chunks.


Initiate the embedder model and the vector database to store embeddings

In [35]:
# Initialize Chroma client
client = chromadb.PersistentClient(path="./test_db")

collection = client.get_or_create_collection(
    name="stackoverflow_demo",
    metadata={"hnsw:space": "cosine"}
)

In [36]:
# Initialize model
model = SentenceTransformer('all-MiniLM-L6-v2')

Embed the chunks and save them into the database

In [37]:
def print_progress(current, total, start_time, operation="Processing"):
    elapsed = time.time() - start_time
    percent = current / total
    eta = (elapsed / current) * (total - current) if current > 0 else 0
    print(
        f"\r{operation}: {current}/{total} ({percent:.1%}) | "
        f"Elapsed: {elapsed:.1f}s | ETA: {eta:.1f}s",
        end="", flush=True
    )

In [38]:
BATCH_SIZE = 200
total_added = 0
start_time = time.time()

for batch_num in range(0, total_chunks, BATCH_SIZE):
    batch = chunks.iloc[batch_num:batch_num + BATCH_SIZE]
    
    documents = []
    metadatas = []
    ids = []
    
    for ix, row in batch.iterrows():
        chunk = row["chunk"]
        metadata = row["metadata"]
        documents.append(chunk)
        metadatas.append(metadata)
        ids.append(str(uuid.uuid4()))  # Generate unique UUID for each document
    
    collection.add(
        documents=documents,
        metadatas=metadatas,
        ids=ids
    )
    total_added += len(documents)

    print_progress(min(batch_num + BATCH_SIZE, total_chunks), total_chunks, start_time)


print(f"\n\nSuccessfully added {total_added} documents")
print(f"Total documents in collection: {collection.count()}")
print(f"Total time: {time.time() - start_time:.2f} seconds")

Processing: 347/347 (100.0%) | Elapsed: 61.9s | ETA: 0.0s

Successfully added 347 documents
Total documents in collection: 347
Total time: 61.91 seconds


In [39]:
results = collection.get()
print(f"Total documents: {len(results['ids'])}")

# Inspect first few items
for i in range(min(3, len(results['ids']))):
    print(f"\nDocument {i+1}:")
    print(f"ID: {results['ids'][i]}")
    print(f"Content: {results['documents'][i][:200]}...")  # First 200 chars
    print(f"Metadata: {results['metadatas'][i]}")

Total documents: 347

Document 1:
ID: 8da50ba7-e7ce-4318-bfda-9dcd55f97f5f
Content: deleting dataframe row in pandas based on column value
<p>i have the following dataframe:</p>

<pre><code>             daysago  line_race rating        rw    wrating
 line_date                        ...
Metadata: {'question': True, 'tags': 'python|pandas', 'answer': "<p>the given answer is correct nontheless as someone above said you can use <code>df.query('line_race != 0')</code> which depending on your problem is much faster. highly recommend.</p>", 'score': 256, 'code': False}

Document 2:
ID: 9661644b-99db-4da1-9622-a1fa0e894c29
Content: <p>the given answer is correct nontheless as someone above said you can use <code>df.query('line_race != 0')</code> which depending on your problem is much faster. highly recommend.</p>...
Metadata: {'tags': 'python|pandas', 'question': False, 'code': False, 'score': 256}

Document 3:
ID: 5e1c7261-2877-42ea-a705-1da24e3d13e5
Content: df.query('line_race != 0')...
M

Check the retrieval process

In [40]:
# Search for similar questions
query_text = "how to parse json in python"
query_embedding = model.encode(query_text.lower()).tolist()

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3
)

print("\nTop 3 similar questions:")
for i, (doc, meta) in enumerate(zip(results['documents'][0], results['metadatas'][0])):
    print(f"\nResult {i+1}:")
    print(f"Score: {1 - results['distances'][0][i]:.2f}")
    print(f"Content: {doc[:200]}...")
    print(f"Tags: {meta['tags']}")


Top 3 similar questions:

Result 1:
Score: 0.52
Content: >>> data = {'uid': ['restest'], 'mail': [''], 'givenname': ['research'], 'cn': ['research test account'], 'sn': ['account']}
>>> data
{'mail': [''], 'sn': ['account'], 'givenname': ['research'], 'uid'...
Tags: python|django|dictionary|ldap

Result 2:
Score: 0.49
Content: body=json.dumps(query_params),
...
Tags: google-webmaster-tools|google-api-python-client|google-api-webmasters

Result 3:
Score: 0.47
Content: <p>found it! body parameter should actually by python object, not json formatted string!</p>

<pre><code>body=json.dumps(query_params),
</code></pre>

<p>should be </p>

<pre><code>body=query_params,
...
Tags: google-webmaster-tools|google-api-python-client|google-api-webmasters


In [41]:
!pip install torch transformers --index-url https://download.pytorch.org/whl/cpu
! pip install accelerate

You should consider upgrading via the 'c:\users\38641\documents\faks\5.letnik\2.semester\nlp\ul-fri-nlp-course-project-2024-2025-1-6-3-musketeers\venv\scripts\python.exe -m pip install --upgrade pip' command.


Looking in indexes: https://download.pytorch.org/whl/cpu


You should consider upgrading via the 'c:\users\38641\documents\faks\5.letnik\2.semester\nlp\ul-fri-nlp-course-project-2024-2025-1-6-3-musketeers\venv\scripts\python.exe -m pip install --upgrade pip' command.


In [42]:
!pip install transformers accelerate



You should consider upgrading via the 'c:\users\38641\documents\faks\5.letnik\2.semester\nlp\ul-fri-nlp-course-project-2024-2025-1-6-3-musketeers\venv\scripts\python.exe -m pip install --upgrade pip' command.


Test the whole RAG pipeline on a small LLM that runs on CPU. Compare results to plain LLM, to see if retrieval even helps.

In [None]:
# for testing only
from transformers import AutoTokenizer, AutoModelForCausalLM

class RAG:
    def __init__(self, embedder, collection, retrieve_number=3, gpu_based=False):
        model_id = "stabilityai/stablelm-2-zephyr-1_6b" if gpu_based else "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
        self.device = "cuda" if self.gpu_based else "cpu"
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.llm = AutoModelForCausalLM.from_pretrained(model_id, device_map=self.device)
        self.embedder = embedder
        self.retriever = collection
        self.retrieve_number = retrieve_number


    def generate(self, query):
        query_embedding = self.embedder.encode(query.lower()).tolist()
        results = self.retriever.query(query_embeddings=[query_embedding], n_results=self.retrieve_number)
        prompt = self.build_prompt(query, results)
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        outputs = self.llm.generate(**inputs, max_new_tokens=200)
        output = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return output.split("Answer:")[1]

    def context_from_results(self, results):
        contexts = []
        for document, metadata in zip(results["documents"], results["metadatas"]):
            metadata = metadata[0]
            document = document[0]
            if metadata["question"]:
                contexts.append(metadata["answer"])
            else:
                contexts.append(document)
        return contexts

    def build_prompt(self, query, results):
        contexts = self.context_from_results(results)
        return f'''
            Answer the following code related question using the context provided inside triple qoutes in it is useful.
            In the answer provide an example of code that is related to the question.
            If you do not know the answer, say that you do not know. Do not try to invent the solution.
            

            Question: {query}


            ```{''.join(f"Context {i}: {context}{chr(10)}{chr(10)}" for i, context in enumerate(contexts))}´´´

            
            Answer:

            '''

In [None]:
class BasicLLM:
    def __init__(self, gpu_based=False):
        model_id = "stabilityai/stablelm-2-zephyr-1_6b" if gpu_based else "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
        self.device = "cuda" if self.gpu_based else "cpu"
        self.llm = AutoModelForCausalLM.from_pretrained(model_id, device_map=self.device)
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)
        self.gpu_based = gpu_based

    def generate(self, query):
        inputs = self.tokenizer(query, return_tensors="pt").to(self.device)
        outputs = self.llm.generate(**inputs, max_new_tokens=200)
        output = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        return output.split("?")[1]


In [None]:
# tuki nej bo vprašaj na koncu (nej bo samo en vprašaj)
query = "How to parse json in python?"

code_llm = RAG(model, collection)
answer = code_llm.generate(query)
print(f"RAG answer:\n{answer}")

print("\n")
basicLLM = BasicLLM()
basic_answer = basicLLM.generate(query)
print(f"Basic answer:\n{basic_answer}")

RAG answer:


            
            In this example, we are parsing a JSON object using the `json` library.

            We start by importing the `json` library and creating a dictionary from the JSON data.

            We then iterate over the keys of the dictionary and extract the values for each key.

            We use the `get` method to extract the value for the `uid` key, which is a list of strings.

            We then convert the list of strings to a string using the `join` method.

            Finally, we assign the resulting string to the `uid` variable.

            Here's the complete code:

            ```
            import json

            data = {'uid': ['restest'], 'mail': [''], 'givenname': ['research'], 'cn': ['research test account'], 'sn': ['account']}

            uid = [v[0] for (k


Basic answer:
 I am trying to parse a json file in python. The json file contains a list of objects with keys "name" and "age". I want to extract the name and age from each obj