# Preprocessing Notebook

This notebook handles data preprocessing for the Cirq-RAG-Code-Assistant project.

## Purpose
- Fetch quantum code from GitHub repositories
- Load and clean knowledge base data
- Process Cirq code snippets
- Generate descriptions for code samples
- Prepare data for embedding generation
- Organize knowledge base structure

## Usage
Import preprocessing functions from `src.data` and use them to process your data.


## 1. Setup and Imports

Import the necessary modules for data fetching, preprocessing, and loading.


In [1]:
# Add project root to Python path
import sys
from pathlib import Path
import os

# Get the project root (parent of notebooks directory)
# In Jupyter notebooks, we need to navigate from the current working directory
current_dir = Path(os.getcwd())
# If we're in the notebooks directory, go up one level; otherwise assume we're at project root
if current_dir.name == "notebooks":
    project_root = current_dir.parent
else:
    # Try to find the project root by looking for src directory
    project_root = current_dir
    while project_root != project_root.parent:
        if (project_root / "src").exists():
            break
        project_root = project_root.parent

if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

print(f"üìÅ Project root: {project_root}")
print(f"üìÅ Current directory: {current_dir}")

# Import data processing modules
from src.data.fetcher import DatasetFetcher
from src.data.preprocessor import DataPreprocessor
from src.data.description_generator import DescriptionGenerator
from src.data.dataset_loader import DatasetLoader

# Set up paths (relative to project root)
DATA_DIR = project_root / "data" / "datasets"
DATA_DIR.mkdir(parents=True, exist_ok=True)

print("‚úÖ Imports successful!")


üìÅ Project root: d:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant
üìÅ Current directory: d:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\notebooks
‚úÖ Imports successful!


## 2. Fetch Data from GitHub

Fetch Cirq code samples from the Cirq GitHub repository.


In [2]:
# Initialize fetcher
fetcher = DatasetFetcher(
    repos_dir="repos",  # Directory to clone repositories
    output_dir=DATA_DIR,  # Output directory for extracted data
)

# Fetch code from all repositories
# Note: This will clone repositories if they don't exist
# Set force_clone=True to re-clone existing repositories
output_file = fetcher.fetch_all(
    output_filename="quantum_code_samples_filtered.jsonl",
    force_clone=False,  # Set to True to re-clone
    min_code_length=50,
    max_code_length=50000,
)

print(f"‚úÖ Data fetched and saved to: {output_file}")


Repository Cirq already exists. Skipping clone.
Scanning 1175 Python files in Cirq...


Extracting Cirq code: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1175/1175 [00:00<00:00, 7258.06it/s]

‚úÖ Collected 427 samples from https://github.com/quantumlib/Cirq

‚úÖ Extraction complete!
Total samples extracted: 427
  - Cirq: 427 samples

üíæ Saved to: d:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\data\datasets\quantum_code_samples_filtered.jsonl

‚úÖ Data fetched and saved to: d:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\data\datasets\quantum_code_samples_filtered.jsonl





## 3. Load and Inspect Dataset

Load the dataset and view statistics.


In [3]:
# Load dataset
dataset_path = DATA_DIR / "quantum_code_samples_filtered.jsonl"
loader = DatasetLoader(dataset_path)

# Print statistics
loader.print_stats()

# Get some sample entries
samples = loader.sample(3, seed=42)
print("\nüìã Sample entries:")
for i, entry in enumerate(samples, 1):
    print(f"\n--- Sample {i} ---")
    print(f"Framework: {entry.get('framework')}")
    print(f"File: {entry.get('file')}")
    print(f"Code length: {len(entry.get('code', ''))} characters")
    print(f"Code preview: {entry.get('code', '')[:200]}...")



Dataset Statistics: quantum_code_samples_filtered.jsonl
Total entries: 427

Frameworks:
  - Cirq: 427

Code length:
  - Average: 9706 characters
  - Min: 854 characters
  - Max: 49381 characters

Descriptions:
  - With descriptions: 0
  - Coverage: 0.0%


üìã Sample entries:

--- Sample 1 ---
Framework: Cirq
File: cirq-core\cirq\transformers\gauge_compiling\idle_moments_gauge.py
Code length: 8517 characters
Code preview: # Copyright 2025 The Cirq Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of t...

--- Sample 2 ---
Framework: Cirq
File: cirq-google\cirq_google\engine\util.py
Code length: 1357 characters
Code preview: # Copyright 2022 The Cirq Developers
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of t...

--- Sample 3 ---
Framework: Cirq
File: cir

## 4. Preprocess Dataset

Clean and validate the dataset, remove duplicates, and extract metadata.


In [4]:
# Initialize preprocessor
preprocessor = DataPreprocessor(
    min_code_length=50,
    max_code_length=50000,
    min_lines=5,
    max_lines=1000,
    remove_duplicates=True,
    validate_syntax=True,
)

# Preprocess dataset
input_file = DATA_DIR / "quantum_code_samples_filtered.jsonl"
output_file = DATA_DIR / "quantum_code_samples_preprocessed.jsonl"

stats = preprocessor.preprocess_dataset(
    input_path=input_file,
    output_path=output_file,
    add_metadata=True,
)

print(f"‚úÖ Preprocessing complete! Processed {stats['processed']} entries.")


Loading dataset from d:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\data\datasets\quantum_code_samples_filtered.jsonl...
Preprocessing 427 entries...


Preprocessing:   0%|          | 0/427 [00:00<?, ?it/s]

Preprocessing: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 427/427 [00:00<00:00, 1026.12it/s]

Writing preprocessed data to d:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\data\datasets\quantum_code_samples_preprocessed.jsonl...

‚úÖ Preprocessing complete!
Total entries: 427
Processed: 0
Filtered out: 427
  - Duplicates: 0
  - Quality issues: 427
Retention rate: 0.0%

üíæ Saved to: d:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\data\datasets\quantum_code_samples_preprocessed.jsonl

   Check your filtering criteria:
   - min_code_length: 50
   - max_code_length: 50000
   - min_lines: 5
   - max_lines: 1000
   - validate_syntax: True

   Sample quality issues found:
   1. File: cirq-google\cirq_google\cloud\quantum_v1alpha1\services\quantum_engine_service\transports\grpc.py
      Code length: 49381, Lines: 1102
      Issues: Too many lines: 1102 > 1000
   2. File: cirq-google\cirq_google\cloud\quantum_v1alpha1\services\quantum_engine_service\transports\rest_base.py
      Code length: 39372, Lines: 1070
      Issues: Too many




## 5. Generate Descriptions

Add natural language descriptions to code samples.


In [5]:
# Initialize description generator
# Set use_ml=True to use ML-based summarization (requires transformers)
generator = DescriptionGenerator(
    use_ml=False,  # Set to True for ML-enhanced descriptions
    ml_model="facebook/bart-large-cnn",
    device="auto",  # "auto", "cpu", or "cuda"
)

# Generate descriptions
input_file = DATA_DIR / "quantum_code_samples_preprocessed.jsonl"
output_file = DATA_DIR / "quantum_dataset_with_descriptions.jsonl"

desc_stats = generator.add_descriptions_to_dataset(
    input_path=input_file,
    output_path=output_file,
    use_ml=False,  # Override instance setting if needed
    batch_size=100,
)

print(f"‚úÖ Descriptions generated! Processed {desc_stats['processed']} entries.")


Reading dataset from d:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\data\datasets\quantum_code_samples_preprocessed.jsonl...
Generating descriptions for 0 entries...


Generating descriptions: 0it [00:00, ?it/s]

Writing output to d:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\data\datasets\quantum_dataset_with_descriptions.jsonl...

‚úÖ Description generation complete!
Total entries: 0
Processed: 0
Skipped: 0
Errors: 0

üíæ Saved to: d:\University\Uni\Semester 7\Generative AI\Project\Cirq-RAG-Code-Assistant\data\datasets\quantum_dataset_with_descriptions.jsonl

‚úÖ Descriptions generated! Processed 0 entries.





## 6. Verify Final Dataset

Load and verify the final preprocessed dataset with descriptions.


In [6]:
# Load final dataset
final_dataset = DatasetLoader(DATA_DIR / "quantum_dataset_with_descriptions.jsonl")

# Print statistics
final_dataset.print_stats()

# View a sample entry with description
samples = final_dataset.sample(1, seed=42)
if samples:
    entry = samples[0]
    print("\nüìã Sample entry with description:")
    print(f"Framework: {entry.get('framework')}")
    print(f"File: {entry.get('file')}")
    print(f"\nDescription:")
    print(entry.get('description', 'No description'))
    print(f"\nMetadata:")
    if 'metadata' in entry:
        for key, value in entry['metadata'].items():
            print(f"  - {key}: {value}")



Dataset Statistics: quantum_dataset_with_descriptions.jsonl
Total entries: 0

Frameworks:

Code length:
  - Average: 0 characters
  - Min: 0 characters
  - Max: 0 characters

Descriptions:
  - With descriptions: 0
  - Coverage: 0.0%



## 7. View Cirq Samples

View and analyze Cirq samples from the dataset.


In [7]:
# Get all Cirq samples (all entries should be Cirq)
cirq_samples = final_dataset.get_by_framework("Cirq")
print(f"Found {len(cirq_samples)} Cirq samples")

# View a Cirq sample
if cirq_samples:
    sample = cirq_samples[0]
    print(f"\nüìã Cirq Sample:")
    print(f"File: {sample.get('file')}")
    print(f"Description: {sample.get('description', 'No description')[:200]}...")
    print(f"\nCode preview:")
    print(sample.get('code', '')[:300] + "...")


Found 0 Cirq samples


## 8. Complete Pipeline

Run the complete preprocessing pipeline in one go.


In [8]:
# Complete preprocessing pipeline
def run_preprocessing_pipeline(
    fetch_data: bool = False,
    generate_descriptions: bool = True,
    use_ml: bool = False,
):
    """
    Run the complete data preprocessing pipeline.
    
    Args:
        fetch_data: Whether to fetch data from GitHub
        generate_descriptions: Whether to generate descriptions
        use_ml: Whether to use ML for description generation
    """
    # Step 1: Fetch data (optional, if not already done)
    if fetch_data:
        print("Step 1: Fetching data from GitHub...")
        fetcher = DatasetFetcher(output_dir=DATA_DIR)
        fetcher.fetch_all()
    
    # Step 2: Preprocess data
    print("\nStep 2: Preprocessing data...")
    preprocessor = DataPreprocessor()
    preprocessor.preprocess_dataset(
        input_path=DATA_DIR / "quantum_code_samples_filtered.jsonl",
        output_path=DATA_DIR / "quantum_code_samples_preprocessed.jsonl",
    )
    
    # Step 3: Generate descriptions
    if generate_descriptions:
        print("\nStep 3: Generating descriptions...")
        generator = DescriptionGenerator(use_ml=use_ml)
        generator.add_descriptions_to_dataset(
            input_path=DATA_DIR / "quantum_code_samples_preprocessed.jsonl",
            output_path=DATA_DIR / "quantum_dataset_with_descriptions.jsonl",
        )
    
    print("\n‚úÖ Pipeline complete!")

# Uncomment to run the complete pipeline:
# run_preprocessing_pipeline(fetch_data=False, generate_descriptions=True, use_ml=False)
