# Advanced Commit Search

#### Env Variables

These configuration variables control the behavior of the vector database operations:

- **VECTOR_DB_RESULT_LIMIT**: Maximum number of similar commits to return in search results (set to 5)
- **VECTOR_DB_INSERT_BATCH_SIZE**: Number of vectors to insert into Qdrant in each batch operation (set to 100)

These constants help optimize performance and manage resource usage throughout the application.

In [1]:
VECTOR_DB_RESULT_LIMIT = 5
VECTOR_DB_INSERT_BATCH_SIZE = 100

#### Imports and Downloads

This section imports all the necessary libraries and downloads required NLTK data:

**Core Libraries:**
- `re`: Regular expressions for text processing
- `nltk`: Natural Language Toolkit for sentence tokenization
- `xml.etree.ElementTree`: XML parsing capabilities
- `tqdm`: Progress bars for long-running operations
- `os`: File system operations
- `sqlite3`: SQLite database operations
- `subprocess`: Running shell commands

**Machine Learning & Vector Database:**
- `sentence_transformers`: Convert text to semantic embeddings
- `qdrant_client`: Vector database for similarity search
- `qdrant_client.models`: Data structures for vector operations

**NLTK Downloads:**
- `punkt`: Sentence tokenizer models
- `punkt_tab`: Additional tokenization data

These libraries enable the complete pipeline from data extraction to semantic search.

In [3]:
import re
import nltk
from nltk.tokenize import sent_tokenize
import xml.etree.ElementTree as ET
from tqdm import tqdm
import os
from qdrant_client.http.models import Filter, FieldCondition, MatchValue
import sqlite3
import subprocess
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

nltk.download('punkt')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /home/jovyan/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## 1- Download Commit Messages

This section executes the shell script to download commit messages from Git repositories:

**Process Flow:**
1. **Script Execution**: Runs `export_commit_messages.sh` using subprocess
2. **Real-time Output**: Streams stdout in real-time to show progress
3. **Error Handling**: Captures and displays any errors from stderr
4. **Exit Validation**: Checks return code to ensure successful completion

The script downloads commit data and converts it to XML format for further processing. This step is essential for gathering the raw data that will be processed into embeddings.

In [4]:
process = subprocess.Popen(['bash', 'export_commit_messages.sh'], stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True)

for line in process.stdout:
    print(line, end='')

process.wait()
if process.returncode != 0:
    stderr = process.stderr.read()
    print("Error downloading commit messages:", stderr)
    exit(1)
print("Commit messages downloaded successfully.")


[0;32mGit Repository Log Export Script[0m
[1;33mNumber of repositories to process:[0m 2
[1;33mNumber of logs to export per repository:[0m 9999999
[1;33mOutput directory:[0m XML_commit_messages

[0;32mProcessing repository 1/2: https://github.com/carbon-language/carbon-lang.git[0m
[0;32mStep 1: Cloning repository...[0m
[0;32m✓ Repository cloned successfully[0m

[0;32mStep 2: Exporting logs...[0m
[1;33mGetting commit list...[0m
[1;33mFound 4239 commit(s) to process[0m

[1;33mProcessing commit 1/4239...[0m
[1;33mProcessing commit 2/4239...[0m
[1;33mProcessing commit 3/4239...[0m
[1;33mProcessing commit 4/4239...[0m
[1;33mProcessing commit 5/4239...[0m
[1;33mProcessing commit 6/4239...[0m
[1;33mProcessing commit 7/4239...[0m
[1;33mProcessing commit 8/4239...[0m
[1;33mProcessing commit 9/4239...[0m
[1;33mProcessing commit 10/4239...[0m
[1;33mProcessing commit 11/4239...[0m
[1;33mProcessing commit 12/4239...[0m
[1;33mProcessing commit 13/4239...[

## 2- Validate XML Files with XSD

You will see an error in carbon reposity at commit afa265dd5710a81bda1ad81fc992f5441e9a15da due to invalid characters in the commit message. Remove the invalid characters and re-run the cell.

This step validates the downloaded XML files against a predefined XSD schema:

**Validation Process:**
1. **Schema Validation**: Uses `validate-xml.sh` to check XML structure
2. **Error Detection**: Identifies malformed XML or invalid characters
3. **Data Quality**: Ensures XML files meet expected format requirements
4. **Pre-processing Check**: Validates data before proceeding to parsing

**Common Issues:**
- Invalid characters in commit messages (like control characters)
- Malformed XML structure
- Encoding problems

XML validation is crucial for preventing parsing errors in subsequent steps.

In [9]:
result = subprocess.run(['bash', 'validate-xml.sh'], capture_output=True, text=True)
print(result.stdout)
if result.returncode != 0:
    print("Error validating XML file:", result.stderr)
    exit(1)
print("XML file validated successfully.")

[0;32mXML Validation Script[0m
[1;33mSchema file:[0m repository-commits.xsd
[1;33mXML directory:[0m XML_commit_messages

[1;33mValidating schema file...[0m
[0;32m✓ Schema file is valid[0m

[1;33mValidating XML files...[0m

Validating carbon-lang_commits.xml... [0;32m✓ VALID[0m
Validating pyrefly_commits.xml... [0;32m✓ VALID[0m

[1;33mValidation Summary:[0m
[1;33m  Files found:[0m 2
[0;32m  Valid files:[0m 2
[0;31m  Invalid files:[0m 0

[0;32m✓ All XML files are valid according to the schema[0m

XML file validated successfully.


## 3- Create Qdrant Collection

### Paragraph to Sentence

In [10]:
def paragraph_to_sentences(paragraph):
    # Split by two or more consecutive newlines
    parts = re.split(r'\n{2,}', paragraph)
    sentences = []
    for part in parts:
        # Replace single newlines with spaces
        part = part.replace('\n', ' ')
        sentences.extend(sent_tokenize(part))
    return sentences

This function processes commit messages that may contain multiple paragraphs or sentences:

1. **Split paragraphs**: Uses regex to split on two or more consecutive newlines
2. **Normalize whitespace**: Replaces single newlines with spaces within paragraphs
3. **Sentence tokenization**: Uses NLTK to properly split text into individual sentences

This preprocessing improves the quality of embeddings by creating more focused, sentence-level chunks.

### XML Parser

In [11]:
def parse_commits(xml_file):
    tree = ET.parse(xml_file)
    root = tree.getroot()
    repo = root.find('repository')
    repo_url = repo.find('url').text if repo is not None else None
    repo_name = repo.find('name').text if repo is not None else None
    commits = []
    for commit in tqdm(root.findall('.//commit'), desc="Parsing commits"):
        message = commit.find('message').text
        author = commit.find('author').text
        date = commit.find('date').text
        hash = commit.find('hash').text
        commits.append({"message": message,
                        "author": author,
                        "date": date,
                        "hash": hash,
                        "repo_url": repo_url,
                        "repo_name": repo_name})
    return commits

Parse XML files containing commit data and extract structured information:

1. **XML parsing**: Use ElementTree to parse the XML structure
2. **Repository metadata**: Extract repository URL and name from the root element
3. **Commit iteration**: Loop through all commit elements with a progress bar
4. **Data extraction**: Extract message, author, date, and hash for each commit
5. **Return structured data**: Create a list of dictionaries with all commit information

The function returns a standardized format that can be easily processed by other parts of the pipeline.

In [12]:
commits = []
xml_dir = 'XML_commit_messages'
for filename in os.listdir(xml_dir):
    if filename.endswith('.xml'):
        file_path = os.path.join(xml_dir, filename)
        commits.extend(parse_commits(file_path))

Parsing commits: 100%|██████████| 6531/6531 [00:00<00:00, 1107817.34it/s]
Parsing commits: 100%|██████████| 4239/4239 [00:00<00:00, 1023289.48it/s]


Process all XML files in the XML_commit_messages directory and combine their commit data into a single list. This allows us to work with commits from multiple repositories in a unified way.

**Multi-Repository Processing:**

This code iterates through all XML files in the `XML_commit_messages` directory and processes them:

1. **Directory Scanning**: Lists all files in the XML directory
2. **File Filtering**: Only processes files with `.xml` extension
3. **Batch Processing**: Calls `parse_commits()` for each XML file
4. **Data Aggregation**: Combines commits from all repositories into a single list

This approach allows the system to handle multiple Git repositories simultaneously, creating a unified dataset for analysis. Each repository's commits are parsed and added to the master `commits` list.

### Generate Embeddings

In [13]:
model_name = 'multi-qa-MiniLM-L6-cos-v1'
print(f"Loading model {model_name} ...")
model = SentenceTransformer(model_name)  # better for query → doc
print("Generating embeddings...")

Loading model multi-qa-MiniLM-L6-cos-v1 ...
Generating embeddings...


Load the SentenceTransformer model for generating embeddings:

- **Model choice**: `multi-qa-MiniLM-L6-cos-v1` is optimized for query-to-document similarity
- **Size**: Produces 384-dimensional vectors
- **Performance**: Good balance between speed and quality for semantic search tasks

This model will convert commit messages into numerical vectors that capture their semantic meaning.

### Set Up Qdrant & Store Embeddings

In [14]:
client = QdrantClient(host='qdrant', port=6333)

if not client.collection_exists(collection_name="commits"):
    client.create_collection(
        collection_name="commits",
        vectors_config=VectorParams(size=384, distance=Distance.COSINE),
    )
else:
    client.delete_collection(collection_name="commits")
    client.create_collection(
        collection_name="commits",
        vectors_config=VectorParams(size=384, distance=Distance.COSINE),
    )


points = [PointStruct(id=i, vector=model.encode(c['message'], convert_to_numpy=True).tolist(), payload={"commit-hash": c['hash'], "author": c['author'], "date": c['date'], "message": c['message']})
          for i, c in enumerate(commits)]


for i in tqdm(range(0, len(points), VECTOR_DB_INSERT_BATCH_SIZE), desc="Upserting to Qdrant"):
    batch = points[i:i+VECTOR_DB_INSERT_BATCH_SIZE]
    client.upsert(collection_name="commits", points=batch)

1
1


  return forward_call(*args, **kwargs)
Upserting to Qdrant: 100%|██████████| 108/108 [00:03<00:00, 34.60it/s]


Set up the Qdrant vector database and store commit embeddings:

1. **Collection management**: Create or recreate the "commits" collection
2. **Vector configuration**: 384-dimensional vectors with cosine similarity
3. **Embedding generation**: Convert each commit message to a vector
4. **Batch insertion**: Insert vectors in batches for better performance
5. **Metadata storage**: Store commit hash, author, date, and message as payload

This creates a searchable vector database where we can find semantically similar commits.

### Embed User Input

In [15]:
def embed_user_query(query):
    return model.encode(query, convert_to_numpy=True)

Convert user search queries into embeddings using the same model used for commit messages. This ensures that queries and documents exist in the same vector space for accurate similarity comparison.

### Search Qdrant for Similar Commit Messages

In [16]:
def search_similar_commit(query):
    vector = embed_user_query(query)
    results = client.search(
        collection_name="commits",
        query_vector=vector.tolist(),
        limit=VECTOR_DB_RESULT_LIMIT,
        with_payload=True,
    )
    if not results:
        return "No similar commit found."
    return results

### Example Usage

**Semantic Search Function:**

This function performs vector-based similarity search on commit messages:

**Process Flow:**
1. **Query Embedding**: Convert the user's search query into a vector using the same model
2. **Vector Search**: Use Qdrant to find the most similar commit message vectors
3. **Result Limiting**: Return only the top N results (defined by `VECTOR_DB_RESULT_LIMIT`)
4. **Payload Inclusion**: Include commit metadata (hash, author, date, message) in results
5. **Fallback**: Return helpful message if no similar commits are found

**Key Features:**
- Uses cosine similarity to measure semantic closeness
- Returns structured results with similarity scores
- Maintains consistent vector space between queries and documents

In [17]:
x = search_similar_commit("bug fix")
x

  results = client.search(


[ScoredPoint(id=10115, version=101, score=0.6703605, payload={'commit-hash': '70a7839a3199e044f391573a184d050beff095ab', 'author': 'gromer@google.com', 'date': '2021-10-27 17:09:03 -0700', 'message': 'Fix bug from #909 (#924)'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id=9564, version=95, score=0.56431377, payload={'commit-hash': '4192ee42a6e8aab7cfa60d5d8328df83a8032790', 'author': 'josh11b@users.noreply.github.com', 'date': '2022-08-30 08:25:24 -0700', 'message': 'Misc small fixes (#2123)'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id=10425, version=104, score=0.5512945, payload={'commit-hash': '82f5c4224b25f793619585786f0368915e96479c', 'author': '46229924+jonmeow@users.noreply.github.com', 'date': '2021-05-14 13:23:10 -0700', 'message': 'Small fixes from #530 (#537)'}, vector=None, shard_key=None, order_value=None),
 ScoredPoint(id=9719, version=97, score=0.52749014, payload={'commit-hash': '8ed38c42f398c8051cdb833c4f621d0b96e56732', 'author

**Demo Search Query:**

This example demonstrates the semantic search functionality by searching for commits related to "bug fix":

**What This Does:**
- Converts "bug fix" into a vector representation
- Searches the commit database for semantically similar messages
- Returns the top 5 most similar commits with their metadata
- Shows similarity scores indicating how closely each commit matches the query

**Expected Results:**
- Commits containing words like "fix", "bug", "issue", "resolve"
- Commits with similar semantic meaning even without exact word matches
- Results ranked by semantic similarity score

## 4- Create a SQLite Database

### SQLite Database Setup

This section creates a SQLite database to store commit messages in a relational format. The database will have two tables:
- `repositories`: stores repository information (URL and name)
- `commits`: stores commit details with a foreign key reference to repositories

SQLite provides ACID transactions and allows us to perform complex queries on the commit data.

In [18]:
conn = sqlite3.connect('commit_messages.db')
cursor = conn.cursor()

**Database Connection Setup:**

This code establishes a connection to the SQLite database:

- **Database File**: Creates or connects to `commit_messages.db`
- **Auto-Creation**: SQLite automatically creates the file if it doesn't exist
- **Cursor Object**: Provides an interface to execute SQL commands
- **Local Storage**: Stores data persistently on the local file system

SQLite is chosen for its simplicity, zero-configuration setup, and ability to handle the commit data efficiently without requiring a separate database server.

Create a connection to the SQLite database file. If the file doesn't exist, SQLite will create it automatically. The cursor object allows us to execute SQL commands.

### Drop existing tables to ensure clean schema

In [19]:
cursor.execute('DROP TABLE IF EXISTS commits')
cursor.execute('DROP TABLE IF EXISTS repositories')

<sqlite3.Cursor at 0x7dbcaefbc8c0>

**Clean Database Reset:**

These SQL commands ensure a fresh start by removing any existing tables:

- **DROP TABLE IF EXISTS**: Safely removes tables without errors if they don't exist
- **Order Matters**: Drops `commits` table first due to foreign key dependency on `repositories`
- **Clean Slate**: Prevents schema conflicts or data inconsistencies from previous runs
- **Idempotent**: Safe to run multiple times without side effects

This approach ensures that each notebook run starts with a consistent, empty database schema.

Clean slate approach: Drop any existing tables to ensure we start with a fresh schema. This prevents conflicts if you run the notebook multiple times or if the table structure has changed.

###  Create Tables

In [20]:
cursor.execute('''
    CREATE TABLE IF NOT EXISTS repositories (
        repository_url TEXT PRIMARY KEY,
        repository_name TEXT NOT NULL
    )
''')


<sqlite3.Cursor at 0x7dbcaefbc8c0>

**Repositories Table Schema:**

This creates the parent table for storing repository information:

**Table Structure:**
- **repository_url**: Primary key, unique identifier for each repository
- **repository_name**: Human-readable name of the repository
- **Data Type**: TEXT fields for string data
- **Constraints**: PRIMARY KEY ensures uniqueness, NOT NULL prevents empty values

**Purpose:**
- Normalizes data to avoid redundancy
- Establishes the parent entity in a one-to-many relationship with commits
- Provides a clean separation between repository metadata and commit data

Create the `repositories` table to store unique repository information:
- `repository_url`: Primary key, the unique URL of the repository
- `repository_name`: Human-readable name of the repository

This table normalizes the data to avoid storing repository info redundantly for each commit.

In [21]:
cursor.execute('''
    CREATE TABLE IF NOT EXISTS commits (
        hash TEXT NOT NULL,
        author TEXT NOT NULL,
        date TEXT NOT NULL,
        message TEXT NOT NULL,
        repository_url TEXT NOT NULL,
        PRIMARY KEY (repository_url, hash),
        FOREIGN KEY (repository_url) REFERENCES repositories (repository_url)
    )
''')

<sqlite3.Cursor at 0x7dbcaefbc8c0>

**Commits Table Schema:**

This creates the main table for storing detailed commit information:

**Table Structure:**
- **hash**: The unique Git commit SHA identifier
- **author**: Developer who made the commit
- **date**: Timestamp when the commit was made
- **message**: Full commit message content
- **repository_url**: Links to the repositories table (foreign key)

**Key Constraints:**
- **Composite Primary Key**: (repository_url, hash) ensures uniqueness across repos
- **Foreign Key**: Links commits to their respective repositories
- **NOT NULL**: All fields are required for data integrity

This schema supports querying commits by repository, author, date ranges, or message content.

Create the `commits` table to store commit information:
- `hash`: The unique commit hash/SHA
- `author`: Who made the commit
- `date`: When the commit was made
- `message`: The commit message content
- `repository_url`: Foreign key linking to the repositories table

The composite primary key (repository_url, hash) ensures uniqueness across repositories.

### Insert data from the commits list

In [22]:
for commit in commits:
    # Insert repository
    cursor.execute(
        'INSERT OR IGNORE INTO repositories (repository_url, repository_name) VALUES (?, ?)',
        (commit['repo_url'], commit['repo_name'])
    )
    
    # Insert commit
    cursor.execute(
        'INSERT OR IGNORE INTO commits (hash, author, date, message, repository_url) VALUES (?, ?, ?, ?, ?)',
        (commit['hash'], commit['author'], commit['date'], commit['message'], commit['repo_url'])
    )

conn.commit()
conn.close()

**Data Population Process:**

This code populates the database with all parsed commit data:

**Two-Step Insertion:**
1. **Repository Insert**: Adds repository metadata first (parent table)
2. **Commit Insert**: Adds commit details with foreign key reference

**Key Features:**
- **INSERT OR IGNORE**: Prevents duplicate entries and avoids errors
- **Referential Integrity**: Maintains proper foreign key relationships
- **Batch Processing**: Processes all commits in a single transaction
- **Data Consistency**: Ensures all commits have valid repository references

**Final Steps:**
- **commit()**: Saves all changes to the database file
- **close()**: Properly closes the database connection and frees resources

This creates a fully populated, normalized database ready for complex queries and analysis.

Insert all the parsed commit data into the database:

1. **Repository insertion**: Use `INSERT OR IGNORE` to add repositories without duplicates
2. **Commit insertion**: Insert each commit with its metadata, linking to the repository
3. **Transaction commit**: Save all changes to the database file
4. **Connection cleanup**: Close the database connection to free resources

The `OR IGNORE` clause prevents errors if we try to insert duplicate data.