To build a RAG (Retrieval-Augmented Generation) system with ChromaDB for storing and searching vectorized data, we can adapt your example to handle the `.csv` file containing Java snippets. Here's a step-by-step explanation and an example in Python:

### Steps:

1. **Load the Dataset**: Read the `javaSnippets.csv` file.
2. **Embed Text**: Use a language model to embed the Java snippets into vector representations.
3. **Store in ChromaDB**: Save the embeddings into ChromaDB for retrieval.
4. **Search for Snippets**: Perform similarity searches on the database using vectorized queries.

### 1. **Install Required Libraries**

Before starting, install the necessary libraries:

In [23]:
# pip install pandas sentence-transformers chromadb


---

### 2. **Load and Inspect Data**

In [1]:
import pandas as pd

# Step 1: Load the dataset
file_path = './datasets/snippets/javaSnippets.csv'  # Update with your file path
df = pd.read_csv(file_path)

# Inspect the first few rows of the dataset
print("Dataset Preview:")
print(df.head())

# Ensure the column with code snippets is correctly identified
column_name = 'snippet'
assert column_name in df.columns, f"Column '{column_name}' not found in the dataset."

# Extract snippets
snippets = df[column_name].tolist()
print(f"Loaded {len(snippets)} code snippets.")

Dataset Preview:
      id                                            snippet language  \
0  88608  /*\r\n * Copyright 2014 The Netty Project\r\n ...     Java   
1  88609   * with the License. You may obtain a copy of ...     Java   
2  88610   * distributed under the License is distribute...     Java   
3  88611  \r\npackage io.netty.resolver.dns;\r\n\r\nimpo...     Java   
4  88612  \r\nimport java.net.InetSocketAddress;\r\nimpo...     Java   

                                      repo_file_name  \
0  netty/netty/resolver-dns/src/test/java/io/nett...   
1  netty/netty/resolver-dns/src/test/java/io/nett...   
2  netty/netty/resolver-dns/src/test/java/io/nett...   
3  netty/netty/resolver-dns/src/test/java/io/nett...   
4  netty/netty/resolver-dns/src/test/java/io/nett...   

                  github_repo_url     license  \
0  https://github.com/netty/netty  Apache-2.0   
1  https://github.com/netty/netty  Apache-2.0   
2  https://github.com/netty/netty  Apache-2.0   
3  https://github

---
To clean your dataset (`javaSnippets.csv`), you can follow a structured process similar to the example you provided. Here's how you can clean the data:

### Step-by-Step Data Cleaning
1. **Check for Missing Values**:
   Identify which columns have missing values and decide whether to drop rows or columns or fill missing values.

2. **Remove Unnecessary Characters**:
   Since the dataset includes code snippets, it’s essential to clean characters like extra whitespace or unnecessary escape sequences (`\r\n`).

3. **Check for Duplicates**:
   Remove duplicate rows to avoid redundant data.

4. **Validate Data Types**:
   Ensure columns have correct data types (e.g., `starting_line_number` and `chunk_size` should be integers).

5. **Filter Invalid Data**:
   Remove rows with invalid or irrelevant data (e.g., empty or overly short snippets).

Here's the Python code for cleaning your dataset:

In [None]:
# pip install dask[dataframe]

Collecting dask-expr<1.2,>=1.1 (from dask[dataframe])
  Downloading dask_expr-1.1.21-py3-none-any.whl.metadata (2.6 kB)
Collecting pyarrow>=14.0.1 (from dask-expr<1.2,>=1.1->dask[dataframe])
  Using cached pyarrow-18.1.0-cp312-cp312-win_amd64.whl.metadata (3.4 kB)
Downloading dask_expr-1.1.21-py3-none-any.whl (244 kB)
Using cached pyarrow-18.1.0-cp312-cp312-win_amd64.whl (25.1 MB)
Installing collected packages: pyarrow, dask-expr
Successfully installed dask-expr-1.1.21 pyarrow-18.1.0
Note: you may need to restart the kernel to use updated packages.


In [2]:
df.isna().sum()
# Columns exp and topic_name contain missing values, we will drop rows with missing values in both columns
df = df.dropna(subset=["snippet"])
df.isna().sum()
df = df.dropna(subset=["repo_file_name"])
df.isna().sum()
df = df.dropna(subset=["github_repo_url"])
df.isna().sum()

id                      0
snippet                 0
language                0
repo_file_name          0
github_repo_url         0
license                 0
commit_hash             0
starting_line_number    0
chunk_size              0
dtype: int64

### Explanation of the Code
1. **Handling Missing Values**:
   - Identifies columns with missing values and drops rows where critical columns (`snippet`, `repo_file_name`, etc.) are empty.
   
2. **Removing Unnecessary Characters**:
   - Replaces `\r\n` with `\n` for better readability of the code snippets.
   - Trims leading/trailing whitespace.

3. **Removing Duplicates**:
   - Removes rows with identical values across all columns.

4. **Validating Data Types**:
   - Ensures numeric columns like `starting_line_number` and `chunk_size` are properly formatted.

5. **Filtering Invalid Snippets**:
   - Removes rows with overly short or invalid snippets (e.g., fewer than 10 characters).

### Next Steps
- Save the cleaned dataset for further processing:
  

In [4]:
pip install datasets

Collecting datasets
  Using cached datasets-3.2.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Using cached xxhash-3.5.0-cp312-cp312-win_amd64.whl.metadata (13 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Using cached multiprocess-0.70.16-py312-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Collecting aiohttp (from datasets)
  Downloading aiohttp-3.11.11-cp312-cp312-win_amd64.whl.metadata (8.0 kB)
Collecting aiohappyeyeballs>=2.3.0 (from aiohttp->datasets)
  Using cached aiohappyeyeballs-2.4.4-py3-none-any.whl.metadata (6.1 kB)
Collecting aiosignal>=1.1.2 (from aiohttp->datasets)
  Downloading aiosignal-1.3.2-py2.py3-none-any.whl.metadata (3.8 kB)
Collecting attrs>=17.3.0 (from aiohttp->datasets)
  Downloading att

In [9]:
from datasets import Dataset

# Define a function to split the dataframe into smaller chunks
def process_in_chunks(df, chunk_size):
    num_chunks = len(df) // chunk_size + 1
    for i in range(num_chunks):
        chunk = df[i * chunk_size:(i + 1) * chunk_size]
        dataset = Dataset.from_pandas(chunk)
        dataset.save_to_disk(f"./datasets/snippets/cleaned_chunk_{i}")

# Process the dataframe in chunks (adjust the chunk size as needed)
process_in_chunks(df, chunk_size=10000)  # For example, 10,000 rows per chunk


Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 35855.89 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 429234.11 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 488487.94 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 528795.99 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 448382.46 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 515201.14 examples/s] 
Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 282861.86 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 420532.19 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 435346.67 examples/s]
Saving the dataset (1/1 shards): 100%|██████████| 10000/10000 [00:00<00:00, 923408.04 examples/s]
Saving the dataset (

In [14]:
dataset_chunk_0 = Dataset.load_from_disk("./datasets/snippets/cleaned_chunk_0")

dataset_chunk_0[0]

{'id': 88608,
 'snippet': '/*\r\n * Copyright 2014 The Netty Project\r\n *\r\n * The Netty Project licenses this file to you under the Apache License,\r\n * version 2.0 (the "License"); you may not use this file except in compliance\r\n',
 'language': 'Java',
 'repo_file_name': 'netty/netty/resolver-dns/src/test/java/io/netty/resolver/dns/DnsServerAddressesTest.java',
 'github_repo_url': 'https://github.com/netty/netty',
 'license': 'Apache-2.0',
 'commit_hash': 'a60825c3b425892af9be3e9284677aa8a58faa6b\r\n',
 'starting_line_number': 0,
 'chunk_size': 5}

---
Now that you have embeddings working fine, it's time to store those embeddings in a vector database like ChromaDB. ChromaDB is a vector database that can store embeddings efficiently and support similarity search, which is useful for tasks like semantic search.

Here’s how you can modify your code to store and retrieve embeddings using ChromaDB:

### Step 1: Install Required Libraries

Make sure you have the required libraries installed:

- `chromadb`
- `pandas`
- `numpy`

You can install ChromaDB using pip:

In [10]:
# pip install chromadb pandas numpy


---

### **Step 3: Create Collection**
You can now insert documents into the collection.

In [19]:
import chromadb
import numpy as np
import pandas as pd


# Initialize ChromaDB client
chroma_client = chromadb.PersistentClient(path="JavaCodeSnippets-store-vdb")

# Create or connect to collection
collection_name = 'JavaCodeSnippetsDB'

# Check if collection exists by comparing names
existing_collections = [collection.name for collection in chroma_client.list_collections()]

if collection_name not in existing_collections:
    chroma_client.create_collection(collection_name)

# Get the collection
collection = chroma_client.get_collection(collection_name)

collection = chroma_client.get_collection(collection_name)

In [10]:
# ------------------------------------------------------------------------------
# ------------------------------------------------------------------------------
# ------------------------------------------------------------------------------
chroma_client.delete_collection(collection_name) # Deletes the collection
# ------------------------------------------------------------------------------
# ------------------------------------------------------------------------------
# ------------------------------------------------------------------------------

In [20]:
if collection:
    print("ChromaDB is working! Collections found:", collection)
else:
    print("No collections found. Creating a new collection...")

ChromaDB is working! Collections found: Collection(name=JavaCodeSnippetsDB)



---

### **Step 4: Add Data to the Collection**
You can now insert documents into the collection.

In [38]:
import os
import json
from datasets import Dataset
from tqdm import tqdm

In [39]:
dataset_dir = "./datasets/snippets/"
progress_file = "./progress.json" 

# Load progress from file, if it exists
def load_progress():
    if os.path.exists(progress_file):
        with open(progress_file, 'r') as f:
            return json.load(f)
    return {"chunk": None, "last_row_idx": 0}

# Save progress to file
def save_progress(chunk_dir, row_idx):
    progress = {"chunk": chunk_dir, "last_row_idx": row_idx}
    with open(progress_file, 'w') as f:
        json.dump(progress, f)
    print(f"Progress saved: {chunk_dir}, row {row_idx}", end="\r")

    
# List all the subdirectories in the dataset directory
chunk_dirs = [f for f in os.listdir(dataset_dir) 
              if os.path.isdir(os.path.join(dataset_dir, f)) and f.startswith('cleaned_chunk_')]

print("Found directories:", chunk_dirs)
total_chunks = len(chunk_dirs)
print(f"Total chunks to process: {total_chunks}")

Found directories: ['cleaned_chunk_0', 'cleaned_chunk_1', 'cleaned_chunk_10', 'cleaned_chunk_100', 'cleaned_chunk_101', 'cleaned_chunk_102', 'cleaned_chunk_103', 'cleaned_chunk_104', 'cleaned_chunk_105', 'cleaned_chunk_106', 'cleaned_chunk_107', 'cleaned_chunk_108', 'cleaned_chunk_109', 'cleaned_chunk_11', 'cleaned_chunk_110', 'cleaned_chunk_111', 'cleaned_chunk_112', 'cleaned_chunk_113', 'cleaned_chunk_114', 'cleaned_chunk_115', 'cleaned_chunk_116', 'cleaned_chunk_117', 'cleaned_chunk_118', 'cleaned_chunk_119', 'cleaned_chunk_12', 'cleaned_chunk_120', 'cleaned_chunk_121', 'cleaned_chunk_122', 'cleaned_chunk_123', 'cleaned_chunk_124', 'cleaned_chunk_125', 'cleaned_chunk_126', 'cleaned_chunk_127', 'cleaned_chunk_128', 'cleaned_chunk_129', 'cleaned_chunk_13', 'cleaned_chunk_130', 'cleaned_chunk_131', 'cleaned_chunk_132', 'cleaned_chunk_133', 'cleaned_chunk_134', 'cleaned_chunk_135', 'cleaned_chunk_136', 'cleaned_chunk_137', 'cleaned_chunk_138', 'cleaned_chunk_139', 'cleaned_chunk_14', 'c

In [40]:
# # Load the last progress (if any)
# progress = load_progress()

# if progress:
#     start_chunk = progress['chunk']
#     start_row_idx = progress['last_row_idx']
#     # Load the last progress (if any)
# else:
#     print("Starting from the beginning...")
#     start_chunk = None
#     start_row_idx = None


progress = load_progress()
start_chunk = progress.get('chunk')
start_row_idx = progress.get('last_row_idx', 0)
# Total number of chunks
total_chunks = len(chunk_dirs)
print(f"Total chunks to process: {total_chunks}")
print(f"\tCurrent chunks: {start_chunk}")
print(f"\tCurrent Index: {start_row_idx}")

Total chunks to process: 755
	Current chunks: cleaned_chunk_0
	Current Index: 6563


In [None]:
# Process each chunk
for idx, chunk_dir in enumerate(chunk_dirs, 1):
    dataset_chunk = Dataset.load_from_disk(os.path.join(dataset_dir, chunk_dir))
    total_rows = len(dataset_chunk)

    # Resume from the correct row if restarting
    if start_chunk and chunk_dir == start_chunk:
        if start_row_idx >= total_rows:
            print(f"Skipping {chunk_dir}, start_row_idx exceeds total_rows.")
            save_progress(chunk_dir, 0)
            start_row_idx = 0
            start_chunk = None
            continue

        # Resume processing the current chunk
        print(f"Resuming {chunk_dir} from row {start_row_idx}")
        rows_to_process = dataset_chunk[start_row_idx:]
    else:
        start_row_idx = 0  # Reset start_row_idx for new chunks
        rows_to_process = dataset_chunk

    # Process rows in the current chunk
    for row_idx, first_row in enumerate(tqdm(rows_to_process, desc=f"Processing {chunk_dir}", 
                                             total=total_rows - start_row_idx, ncols=100, unit="row"), 
                                           start=start_row_idx):
        try:
            ids = [str(first_row['id'])]
            documents = [first_row['snippet']]
            metadatas = [{
                "type": "snippet",
                "repo": first_row['repo_file_name'],
                "language": first_row['language']
            }]

            # Avoid duplicates
            existing_ids = collection.get(ids=ids)['ids']
            if not existing_ids:
                collection.add(
                    ids=ids,
                    documents=documents,
                    metadatas=metadatas
                )
        except Exception as e:
            print(f"Error processing row {row_idx} in {chunk_dir}: {e}")

        # Save progress every 50 rows
        if row_idx % 50 == 0:
            save_progress(chunk_dir, row_idx + 1)

    # After finishing a chunk, reset progress
    save_progress(chunk_dir, 0)
    start_row_idx = 0
    start_chunk = None

    # Display progress
    chunk_progress = (idx / total_chunks) * 100
    print(f"\033[KProcessed {idx}/{total_chunks} chunks ({chunk_progress:.2f}%)", end="\r")

print("\n✅ All chunks have been added to the collection successfully!")

Processing cleaned_chunk_1:  16%|████▍                        | 1551/10000 [07:07<56:25,  2.50row/s]

Progress saved: cleaned_chunk_1, row 1551

Processing cleaned_chunk_1:  16%|████▎                      | 1601/10000 [07:26<1:01:14,  2.29row/s]

Progress saved: cleaned_chunk_1, row 1601

Processing cleaned_chunk_1:  17%|████▊                        | 1651/10000 [07:47<51:52,  2.68row/s]

Progress saved: cleaned_chunk_1, row 1651

Processing cleaned_chunk_1:  17%|████▉                        | 1701/10000 [08:06<56:40,  2.44row/s]

Progress saved: cleaned_chunk_1, row 1701

Processing cleaned_chunk_1:  18%|█████                        | 1751/10000 [08:27<54:35,  2.52row/s]

Progress saved: cleaned_chunk_1, row 1751

Processing cleaned_chunk_1:  18%|████▊                      | 1801/10000 [08:57<2:17:11,  1.00s/row]

Progress saved: cleaned_chunk_1, row 1801

Processing cleaned_chunk_1:  18%|████▉                      | 1807/10000 [09:00<1:20:51,  1.69row/s]Exception occurred invoking consumer for subscription b90ef180c8d3423d97d10fe0f92bb295to topic persistent://default/default/d2a3e480-de6c-4bee-84e3-95ca4a05e2d2 
Processing cleaned_chunk_1:  18%|████▉                      | 1833/10000 [09:16<1:17:54,  1.75row/s]

In [46]:
collection.peek()

{'ids': ['88608',
  '88609',
  '88610',
  '88611',
  '88612',
  '88613',
  '88614',
  '88615',
  '88616',
  '88617'],
 'embeddings': array([[-7.49901228e-05,  1.13361422e-02, -9.46090091e-03, ...,
         -1.62886903e-02,  2.23879162e-02, -5.50048575e-02],
        [ 3.67557779e-02, -2.23476384e-02, -4.64685336e-02, ...,
          6.27410263e-02,  4.71353680e-02, -4.86179851e-02],
        [-6.68321401e-02,  9.62347910e-03, -3.16255689e-02, ...,
          1.55157335e-02,  1.53038085e-01, -5.40957153e-02],
        ...,
        [ 6.40597418e-02,  2.80569047e-02, -1.88555662e-02, ...,
          7.03850985e-02, -6.30347878e-02, -2.12625600e-02],
        [-4.71711643e-02, -4.86637540e-02, -1.74724516e-02, ...,
          1.06600709e-02,  5.52940881e-04, -3.09259407e-02],
        [-5.24540320e-02,  2.24042162e-02, -2.29494162e-02, ...,
          7.10451677e-02, -1.19481666e-03,  3.72953597e-03]],
       shape=(10, 384)),
 'documents': ['/*\r\n * Copyright 2014 The Netty Project\r\n *\r\n * The

In [47]:
# Get the number of items in the collection
num_items = collection.count()

print(f"Number of items in the collection: {num_items}")

Number of items in the collection: 10375


---
### **Step 5: After the collection is ready, we can query it using the query() method of the collection and print the results returned by the query().**

In [7]:
results = collection.query(query_texts="compareTo function", n_results=1)

print(results)

{'ids': [['88884']], 'embeddings': None, 'documents': [['                    @Override\r\n                    public int compare(InetSocketAddress o1, InetSocketAddress o2) {\r\n                        if (o1.equals(o2)) {\r\n                            return 0;\r\n                        }\r\n']], 'uris': None, 'data': None, 'metadatas': [[{'language': 'Java', 'repo': 'netty/netty/resolver-dns/src/test/java/io/netty/resolver/dns/DefaultAuthoritativeDnsServerCacheTest.java', 'type': 'snippet'}]], 'distances': [[1.0026335716247559]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}


In [26]:
# # Use a query without any filters to fetch all items
# results = collection.query(query_texts="*", n_results=10000)  # Adjust the n_results as needed (limit)

# # Ensure the directory exists
# directory_path = './JavaCodeSnippetsDB'
# if not os.path.exists(directory_path):
#     os.makedirs(directory_path)

# # Define the path for the file to save the results
# file_path = os.path.join(directory_path, 'full_collection_data.json')

# # Save the entire collection data to a JSON file
# with open(file_path, 'w') as f:
#     json.dump(results, f)

# print(f"Entire collection data saved to {file_path}")

Number of requested results 10000 is greater than number of elements in index 3696, updating n_results = 3696


Entire collection data saved to ./JavaCodeSnippetsDB\full_collection_data.json
