# Headstarter Codebase RAG Project

## Objective

The objective of the Codebase RAG Project is to develop an intelligent system that leverages Retrieval-Augmented Generation to:

- Efficiently retrieve relevant information from a codebase or documentation.
- Provide AI-powered insights and assistance for coding tasks, such as debugging, refactoring, or feature implementation.
Improve developer productivity by integrating contextual understanding into IDEs or development workflows.


## System Architecture

![Screenshot 2024-11-25 at 7 12 58 PM](https://github.com/user-attachments/assets/0bd67cf0-43d5-46d2-879c-a752cae4c8e3)

**Explanation of RAG Workflow:**

1. User Query: The system takes a user query (e.g., a natural language question or task input).

2. Query Embedding: The user query is converted into an embedding (a numerical vector representation) using a pre-trained embedding model.

3. Vector Search in Database: The embedding is used to search for the most relevant entries in a vector database, such as Pinecone, ChromaDB, or FAISS. These systems retrieve items (e.g., documents, code snippets) that have embeddings closest to the query embedding based on similarity metrics like cosine similarity.

4. Data Retrieval: The most relevant results are retrieved from the database, forming the context for the next step.

5. LLM Processing: The retrieved context is decoded or formatted and then passed, along with the original user query, as input to a large language model (LLM), such as Llama 3.1, via an API like Groq API.

6. Relevant Answer Generation: The LLM processes the input (user query + retrieved context) and generates a relevant and contextually aware answer, which is returned to the user.




* Notes:
- The data retrieved from Pinecone is already in a text or structured format, not embeddings.

## Install Necessary Libraries

In [1]:
! pip install pygithub langchain langchain-community openai tiktoken pinecone-client langchain_pinecone sentence-transformers

Collecting pygithub
  Downloading PyGithub-2.5.0-py3-none-any.whl.metadata (3.9 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.8-py3-none-any.whl.metadata (2.9 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting langchain_pinecone
  Downloading langchain_pinecone-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Collecting pynacl>=1.4.0 (from pygithub)
  Downloading PyNaCl-1.5.0-cp36-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl.metadata (8.6 kB)
Collecting SQLAlchemy<3,>=1.4 (from langchain)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5

In [2]:
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from langchain_pinecone import PineconeVectorStore
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from google.colab import userdata
from pinecone import Pinecone
import os
import tempfile
from github import Github, Repository
from git import Repo
from openai import OpenAI
from pathlib import Path
from langchain.schema import Document


  from tqdm.autonotebook import tqdm, trange


- similarity = cosine_similarity(query_vector, document_vector)
- query_vector: Shape (1, n)
- document_vector: Shape (1, n)
- Output: A 2D array of shape (1, 1), e.g.:
- [[0.85]]  # similarity score between the two vectors
- To extract the actual similarity score as a scalar, you access it with similarity[0][0].





## Clone a Github Repo locally

In [3]:
# check the current directory
os.getcwd()

'/content'

In [4]:
def clone_repo (repo_url):
  """Clones a GitHub repository to a temporary directory.

    Args:
        repo_url: The URL of the GitHub repository.

    Returns:
        The path to the cloned repository.
  """

  repo_name = repo_url.split("/")[-1]  # Extract repository name from URL
  repo_path = f"/content/{repo_name}"

  if os.path.exists(repo_path):
      print(f"Repository already exists at {repo_path}")
      return repo_path

  else:
      # Clone the repository if the directory does not exist
      Repo.clone_from(repo_url, repo_path)
      print(f"Repository cloned to {repo_path}")
      return repo_path

In [5]:
repo_url =  "https://github.com/CoderAgent/SecureAgent"
path = clone_repo(repo_url)

Repository cloned to /content/SecureAgent


In [6]:
print(path)

/content/SecureAgent


In [7]:
SUPPORTED_EXTENSIONS = {'.py', '.js', '.tsx', '.jsx', '.ipynb', '.java',
                         '.cpp', '.ts', '.go', '.rs', '.vue', '.swift', '.c', '.h'}

IGNORED_DIRS = {'node_modules', 'venv', 'env', 'dist', 'build', '.git',
                '__pycache__', '.next', '.vscode', 'vendor'}

In [8]:
def get_file_content( file_path, repo_path):
  """
  Get content of a single file.

  Args:
  file_path (str): Path to the file.

  Returns:
  Optional[Dict[str, str]]: Dictionary with file name and content
  """

  try:
    with open(file_path, 'r', encoding="utf-8") as file:
      content = file.read()

      # get the relative path from repo root
      rel_path = os.path.relpath(file_path, repo_path)

    return {
          "name": rel_path,
          "content": content
    }


  except Exception as e:
    print(f"Error reading file {file_path}: {e}")
    return None


In [9]:
# Example of the content
get_file_content("/content/SecureAgent/src/data/PRSuggestionImpl.ts", "/content/SecureAgent")

{'name': 'src/data/PRSuggestionImpl.ts',
 'content': 'import { PRSuggestion } from "../constants";\n\nexport class PRSuggestionImpl implements PRSuggestion {\n  describe: string;\n  type: string;\n  comment: string;\n  code: string;\n  filename: string;\n\n  constructor(\n    describe: string,\n    type: string,\n    comment: string,\n    code: string,\n    filename: string\n  ) {\n    this.describe = describe;\n    this.type = type;\n    this.comment = comment;\n    this.code = code;\n    this.filename = filename;\n  }\n\n  toString(): string {\n    const xmlElements = [\n      `<suggestion>`,\n      `  <describe>${this.describe}</describe>`,\n      `  <type>${this.type}</type>`,\n      `  <comment>${this.comment}</comment>`,\n      `  <code>${this.code}</code>`,\n      `  <filename>${this.filename}</filename>`,\n      `</suggestion>`,\n    ];\n    return xmlElements.join("\\n");\n  }\n\n  identity(): string {\n    return `${this.filename}:${this.comment}`;\n  }\n}\n'}

In [10]:
def get_main_files_content (repo_path:str):
  """
  Get content of supported code files from local repo.

  Args:
  repo_path (str): Path to the local repo

  Returns:
  List of dictionaries with file name and content
  """

  files_content = []

  try:
    for root, dirs, files in os.walk(repo_path):
      # Skip if current directory is in ignored directories
      if any(ignored_dir in root for ignored_dir in IGNORED_DIRS):
        continue
        # will return True if any ignored_dir found in root, and got to the next root

      # Process each file in the current directory
      for file in files:
        file_path = os.path.join(root, file)
        if os.path.splitext(file)[1] in SUPPORTED_EXTENSIONS:
          file_content = get_file_content(file_path, repo_path)
          if file_content:
            files_content.append(file_content)

  except Exception as e:
    print(f"Error reading repository: {str(e)}")

  return files_content



In [11]:
file_content = get_main_files_content(path)

In [12]:
file_content

[{'name': 'src/prompts.ts',
  'content': 'import { encode, encodeChat } from "gpt-tokenizer";\nimport type { ChatCompletionMessageParam } from "groq-sdk/resources/chat/completions";\nimport type { PRFile } from "./constants";\nimport {\n  rawPatchStrategy,\n  smarterContextPatchStrategy,\n} from "./context/review";\nimport { GROQ_MODEL, type GroqChatModel } from "./llms/groq";\n\nconst ModelsToTokenLimits: Record<GroqChatModel, number> = {\n  "mixtral-8x7b-32768": 32768,\n  "gemma-7b-it": 32768,\n  "llama3-70b-8192": 8192,\n  "llama3-8b-8192": 8192,\n};\n\nexport const REVIEW_DIFF_PROMPT = `You are PR-Reviewer, a language model designed to review git pull requests.\nYour task is to provide constructive and concise feedback for the PR, and also provide meaningful code suggestions.\n\nExample PR Diff input:\n\'\n## src/file1.py\n\n@@ -12,5 +12,5 @@ def func1():\ncode line that already existed in the file...\ncode line that already existed in the file....\n-code line that was removed in t

In [35]:
len(file_content)

13

## Embeddings

In [13]:
def get_huggingface_embeddings(text, model_name= "sentence-transformers/all-mpnet-base-v2"):
  model =SentenceTransformer(model_name)
  return model.encode(text)

In [14]:
text = "I am a programmer and try to learn machine learning"

embeddings = get_huggingface_embeddings(text)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [15]:
embeddings

array([-2.07913760e-02,  8.33775091e-04, -4.02439758e-02,  1.02785742e-02,
        1.62548646e-02,  4.59598117e-02,  4.83331718e-02, -3.26138251e-02,
        6.93292310e-03,  3.23573835e-02,  5.38950600e-02,  4.41972204e-02,
        9.05252807e-03,  4.64829840e-02,  2.23122425e-02, -5.03575727e-02,
        3.81771363e-02, -3.67080048e-02,  9.01245233e-03, -3.55045237e-02,
       -2.39045285e-02,  7.86051434e-03, -4.05415148e-02,  2.36402564e-02,
       -4.52324450e-02, -6.29183277e-02, -1.43290134e-02, -1.93033032e-02,
       -1.41940517e-02,  6.64439127e-02,  1.38363549e-02, -7.18007088e-02,
        4.80623636e-03,  4.04268242e-02,  1.73956948e-06, -3.81604135e-02,
       -4.20074165e-02,  3.14945057e-02, -3.30295451e-02, -3.20510752e-02,
        4.32071500e-02,  1.19105252e-02,  1.37608647e-02,  4.67022583e-02,
       -3.01357210e-02,  1.28489546e-02,  1.03936329e-01,  7.10375328e-03,
        1.57412365e-02,  7.01721907e-02, -1.67294545e-03, -2.11559962e-02,
        2.20719688e-02, -

## Setting up Pinecone

In [16]:
# Set the pinecone_api_key as environment variable
pinecone_api_key = userdata.get('PINECONE_API_KEY')
os.environ['PINECONE_API_KEY'] = pinecone_api_key


# Initialize pinecone
pc = Pinecone(api_key = pinecone_api_key)

# Connect to pinecone index
pinecone_index = pc.Index("codebase-rag")

In [17]:
vector_store = PineconeVectorStore(
    index_name = "codebase-rag",
    embedding = HuggingFaceEmbeddings()
)

  embedding = HuggingFaceEmbeddings()
  embedding = HuggingFaceEmbeddings()


In [18]:
documents = []


# create document object for the file
for file in file_content:
  doc = Document(
      page_content = f"{file['name']}\n{file['content']}",
      metadata = {"source": file["name"]}
  )
  documents.append(doc)

# embed each document
vectorstore = PineconeVectorStore.from_documents(
    documents = documents,
    embedding = HuggingFaceEmbeddings(),
    index_name = "codebase-rag",
    namespace = "https://github.com/CoderAgent/SecureAgent"

)

  embedding = HuggingFaceEmbeddings(),


## Perform RAG

In [19]:
client = OpenAI(
    base_url = "https://api.groq.com/openai/v1",
    api_key = userdata.get("GROQ_API_KEY")
)

In [20]:
query = "How are python files parsed?"

In [21]:
raw_query_embed = get_huggingface_embeddings(query)
raw_query_embed

array([ 5.29357232e-02, -6.24647178e-02, -2.87437718e-02,  1.83179416e-02,
       -4.33840672e-04,  4.03239094e-02, -7.76652806e-03, -2.74394872e-03,
        2.53445264e-02, -8.10819939e-02, -8.44583288e-03, -6.59269514e-03,
        4.16187495e-02,  3.98627296e-02,  2.82911733e-02,  2.84344628e-02,
        2.65303329e-02, -2.60126498e-02,  4.16299142e-02,  3.92820686e-02,
       -5.15580364e-02,  5.83349541e-02,  5.88829117e-03,  3.46064568e-02,
       -2.46872660e-03,  2.72809248e-02,  1.07212560e-02,  4.55760621e-02,
       -1.69188846e-02, -4.85301390e-02, -3.02424375e-02, -3.29698175e-02,
        2.46010050e-02,  3.23601812e-02,  1.16030503e-06,  9.71379410e-03,
       -3.70800160e-02,  1.84200946e-02, -1.39834182e-02,  4.25723344e-02,
        6.78140819e-02, -6.66246563e-02,  2.11651716e-02, -1.11712900e-03,
       -1.80115458e-02, -7.90139660e-02,  5.93152717e-02, -5.23733571e-02,
        5.63013554e-02,  4.31280173e-02,  7.77091645e-03, -2.30586994e-02,
       -2.94572674e-02,  

In [22]:
# Try to find the match from data that we already input to pinecone
top_matches = pinecone_index.query(
    vector = raw_query_embed.tolist(),
    top_k = 5,
    include_metadata = True,
    namespace = "https://github.com/CoderAgent/SecureAgent"
)

In [23]:
top_matches

{'matches': [{'id': 'a3c23e19-0ef9-40ec-9f67-4a21630e5061',
              'metadata': {'source': 'src/context/language/python-parser.ts',
                           'text': 'src/context/language/python-parser.ts\n'
                                   'import { AbstractParser, EnclosingContext '
                                   '} from "../../constants";\n'
                                   'export class PythonParser implements '
                                   'AbstractParser {\n'
                                   '  findEnclosingContext(\n'
                                   '    file: string,\n'
                                   '    lineStart: number,\n'
                                   '    lineEnd: number\n'
                                   '  ): EnclosingContext {\n'
                                   '    // TODO: Implement this method for '
                                   'Python\n'
                                   '    return null;\n'
                          

In [24]:
contexts = [item["metadata"]["text"] for item in top_matches["matches"]]

In [25]:
contexts

['src/context/language/python-parser.ts\nimport { AbstractParser, EnclosingContext } from "../../constants";\nexport class PythonParser implements AbstractParser {\n  findEnclosingContext(\n    file: string,\n    lineStart: number,\n    lineEnd: number\n  ): EnclosingContext {\n    // TODO: Implement this method for Python\n    return null;\n  }\n  dryRun(file: string): { valid: boolean; error: string } {\n    // TODO: Implement this method for Python\n    return { valid: false, error: "Not implemented yet" };\n  }\n}\n',
 'src/context/language/python-parser.ts\nimport { AbstractParser, EnclosingContext } from "../../constants";\nexport class PythonParser implements AbstractParser {\n  findEnclosingContext(\n    file: string,\n    lineStart: number,\n    lineEnd: number\n  ): EnclosingContext {\n    // TODO: Implement this method for Python\n    return null;\n  }\n  dryRun(file: string): { valid: boolean; error: string } {\n    // TODO: Implement this method for Python\n    return { va

In [26]:
len(contexts)

5

In [27]:
augmented_query = "<CONTEXT>\n" + "\n\n-----------\n\n".join(contexts[:5]) + "\n\n----------------\n</CONTEXT>\n\n\nMy Question: \n" + query

In [28]:
print(augmented_query)

<CONTEXT>
src/context/language/python-parser.ts
import { AbstractParser, EnclosingContext } from "../../constants";
export class PythonParser implements AbstractParser {
  findEnclosingContext(
    file: string,
    lineStart: number,
    lineEnd: number
  ): EnclosingContext {
    // TODO: Implement this method for Python
    return null;
  }
  dryRun(file: string): { valid: boolean; error: string } {
    // TODO: Implement this method for Python
    return { valid: false, error: "Not implemented yet" };
  }
}


-----------

src/context/language/python-parser.ts
import { AbstractParser, EnclosingContext } from "../../constants";
export class PythonParser implements AbstractParser {
  findEnclosingContext(
    file: string,
    lineStart: number,
    lineEnd: number
  ): EnclosingContext {
    // TODO: Implement this method for Python
    return null;
  }
  dryRun(file: string): { valid: boolean; error: string } {
    // TODO: Implement this method for Python
    return { valid: false,

In [29]:
system_prompt = f"""You are a Senior Software Engineer, specializing in TypeScript.

Answer any questions I have about the codebase, based on the code provided. Always consider all of the context provided when forming a response.
"""

In [30]:
model = "llama-3.1-70b-versatile"

llm_response = client.chat.completions.create(
    model = model,
    messages= [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": augmented_query}
    ]
)

response = llm_response.choices[0].message.content

In [31]:
response

'Based on the provided codebase, it seems that Python files are not properly parsed yet. The `PythonParser` class is implemented, but its `findEnclosingContext` and `dryRun` methods are not implemented and return a `null` or a hardcoded error message.\n\nTo properly parse Python files, the `PythonParser` class would need to be updated with a Python parsing library, such as `pycodestyle` or `ast` (abstract syntax tree), and the `findEnclosingContext` and `dryRun` methods would need to be implemented to utilize this library.\n\nFor example, the `findEnclosingContext` method could parse the Python file and then find the enclosing context (e.g., the function or class) of a given range of lines. The `dryRun` method could also parse the Python file and return an error message if the parsing fails.\n\nHere\'s a basic example of how the `PythonParser` class could be updated using the `ast` library:\n\n```typescript\n// src/context/language/python-parser.ts\nimport * as ast from \'python-ast-pa

## Putting All Together

In [32]:
def perform_rag(query):
    raw_query_embedding = get_huggingface_embeddings(query)

    top_matches = pinecone_index.query(vector=raw_query_embedding.tolist(), top_k=5, include_metadata=True, namespace="https://github.com/CoderAgent/SecureAgent")

    # Get the list of retrieved texts
    contexts = [item['metadata']['text'] for item in top_matches['matches']]

    augmented_query = "<CONTEXT>\n" + "\n\n-------\n\n".join(contexts[ : 10]) + "\n-------\n</CONTEXT>\n\n\n\nMY QUESTION:\n" + query

    # Modify the prompt below as need to improve the response quality
    system_prompt = f"""You are a Senior Software Engineer, specializing in TypeScript.

    Answer any questions I have about the codebase, based on the code provided. Always consider all of the context provided when forming a response.
    """

    llm_response = client.chat.completions.create(
        model="llama-3.1-8b-instant",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": augmented_query}
        ]
    )

    return llm_response.choices[0].message.content

In [33]:
response = perform_rag("How to use python file parser?")

In [34]:
print(response)

Based on the code provided, it seems like the Python file parser is currently not implemented correctly. The class `PythonParser` is implemented, but the methods `findEnclosingContext` and `dryRun` are not fully implemented. 

However, to answer your question, you would typically use a Python file parser to parse a Python file and extract data from it. Assuming the methods in the `PythonParser` class are implemented correctly, you can use it in the following way:

```typescript
import { PythonParser } from './python-parser';

const pythonParser = new PythonParser();

// To parse a Python file and find the enclosing context
const file = 'path/to/your/python/file.py';
const startLine = 10;
const endLine = 20;
const result = pythonParser.findEnclosingContext(file, startLine, endLine);

// To check if a Python file is valid or not
const isValid = pythonParser.dryRun(file);
console.log(`Is the file valid? ${isValid.valid}`);
console.log(`Error: ${isValid.error}`);
```

However, since the me