<a href="https://colab.research.google.com/github/VarunTej9/Code-Embedding-Pipeline-with-CodeBERT/blob/main/CLEAN_Code_Embedding_Pipeline_with_CodeBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installing dependencies

- `transformers` - to load CodeBERT
- `torch` - to run the model
- `faiss-cpu` - to store and search embeddings

In [None]:
!pip install transformers torch faiss-cpu


Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.meta


In the next step, we define small Python functions (as strings) to test our embedding pipeline.

We're storing them inside a Python dictionary named `code_snippets`.  
Each key is a **filename** (like `"add.py"`), and each value is the **actual code**.

These snippets will be passed into the model to generate embeddings.


In [None]:
code_snippets = {
    "add.py": "def add(a, b):\n    return a + b",
    "factorial.py": "def factorial(n):\n    if n == 0:\n        return 1\n    return n * factorial(n - 1)"
}



In the next step, we load the pre-trained **CodeBERT model** (`microsoft/codebert-base`) and its tokenizer using Hugging Face's `transformers` library.

- The **tokenizer** splits the code into smaller pieces (called tokens) that the model can understand.
- The **model** processes those tokens and returns embeddings — numerical representations that capture the meaning of the code.

This model has been trained on both source code and natural language, making it ideal for code understanding tasks.


In [None]:
from transformers import RobertaTokenizer, RobertaModel
import torch
import numpy as np

tokenizer = RobertaTokenizer.from_pretrained("microsoft/codebert-base")
model = RobertaModel.from_pretrained("microsoft/codebert-base")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]


Now we define a function called `get_embedding()` that takes a code snippet (as a string) and returns its **embedding** using the CodeBERT model.

###  What this function does:
- **Tokenizes** the input code using the CodeBERT tokenizer.
- **Passes** it through the model to get hidden states (intermediate representations).
- **Averages** those token vectors to get a single vector representing the whole code snippet.
- **Returns** the final embedding as a NumPy array (list of numbers).

This embedding captures the **semantic meaning** of the code and can be used for tasks like similarity search, clustering, or code classification.


In [None]:
def get_embedding(code):
    inputs = tokenizer(code, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim=1).squeeze().numpy()



In this step, we run each code snippet through the `get_embedding()` function to generate its corresponding **vector embedding**.

###  What happens here:
- We loop through each code snippet.
- Use the embedding function to convert it into a numerical vector.
- Store the filename and its embedding in a dictionary called `embeddings`.

This dictionary will be used in the next step to save the results in a `.json` file.


In [None]:
embeddings = {}

for fname, code in code_snippets.items():
    emb = get_embedding(code)
    embeddings[fname] = emb.tolist()




Now that we have generated embeddings for all our code snippets, we will save them to a file called `embeddings.json`.

###  What this does:
- Uses Python's `json` module to write the `embeddings` dictionary into a `.json` file.
- Makes the embeddings easy to store, share, and reuse.
- Uses `google.colab.files.download()` to download the file to your local machine.

This file contains each code snippet's filename and its corresponding embedding vector.


In [None]:
import json

with open("embeddings.json", "w") as f:
    json.dump(embeddings, f, indent=2)

from google.colab import files
files.download("embeddings.json")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


In this step, we use **FAISS** (Facebook AI Similarity Search) to store our embeddings in a special index structure.  
This allows for **fast similarity search** — for example, finding which code snippets are most similar to a given one.

### What this code does:
- Converts all code embeddings into a NumPy array.
- Creates a **FAISS index** using L2 (Euclidean) distance.
- Adds all vectors to the index.
- Downloads the file so it can be reused later for code search tasks.

This sets the foundation for building tools like **semantic code search**, **clone detection**, and more!


In [None]:
import faiss

vecs = np.array([get_embedding(code) for code in code_snippets.values()])
index = faiss.IndexFlatL2(vecs.shape[1])
index.add(vecs)
faiss.write_index(index, "faiss_index.index")
files.download("faiss_index.index")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
files.download("faiss_index.index")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
print("\n📦 Sample Embeddings Output (truncated):\n")
for file, emb in embeddings.items():
    print(f"{file}: [ {', '.join(f'{v:.4f}' for v in emb[:5])} ... ]\n")


📦 Sample Embeddings Output (truncated):

add.py: [ -0.3996, 0.1826, 0.1412, 0.0673, -0.1280 ... ]

factorial.py: [ -0.2689, 0.1082, 0.2437, 0.1180, -0.1852 ... ]

