# Indexing NeuroCONV Repository

This notebook demonstrates how to use repo-indexer to index and search through the [NeuroCONV](https://github.com/catalystneuro/neuroconv) repository, which is a tool for converting neurophysiology data to NWB format.

In [None]:
%load_ext autoreload
%autoreload 2

In [1]:
# Enable nested asyncio support for Jupyter notebooks
import nest_asyncio
nest_asyncio.apply()

## Setup

You can choose between using OpenAI's embeddings (higher quality, requires API key) or a local model (free, runs locally):

### Option 1: Using OpenAI Embeddings

In [None]:
import os
from repo_indexer.indexer import RepoIndexer
from repo_indexer.embeddings import OPENAI_ADA_002

# Initialize indexer with OpenAI embeddings
indexer = RepoIndexer(
    qdrant_url="http://localhost:6333",
    api_key=os.getenv("OPENAI_API_KEY"),
    collection_name="neuroconv",  # Dedicated collection for neuroconv
    embedding_model=OPENAI_ADA_002
)

### Option 2: Using Local E5 Model

This option requires running the local embedding server. First, start the server with:
```bash
docker run -p 8001:80 ghcr.io/huggingface/text-embeddings-inference:cpu-0.3 --model-id intfloat/multilingual-e5-small
```

The server will be available at http://localhost:8001. You can customize this in two ways:

In [None]:
from repo_indexer.indexer import RepoIndexer
from repo_indexer.embeddings import LocalE5Embeddings

# Create custom embedding generator with specific URL
embedding_generator = LocalE5Embeddings(
    url="http://localhost:8001/v1/embeddings"
)

# Initialize indexer with custom generator
indexer = RepoIndexer(
    qdrant_url="http://localhost:6333",
    collection_name="neuroconv",
    embedding_generator=embedding_generator
)

## Repository Processing

Let's process the repository step by step, focusing on Python files to understand the core functionality:

### 1. Parse Repository

First, we parse the repository to get its raw content:

In [None]:
# Parse the repository
repo_url = "https://github.com/catalystneuro/neuroconv.git"
indexer.parse_repo(repo_url)

# Access stored metadata
repo_data = indexer.repositories[repo_url]

print("\nRepository Summary:")
print(repo_data["summary"])

In [None]:
print("\nRepository Tree:")
print(repo_data["tree"])

In [None]:
print(repo_data["content"][240][0])
print(repo_data["content"][240][1])

### 2. Generate Chunks

Next, we split the content into meaningful chunks, focusing on Python files and setting reasonable token limits:

In [None]:
# Generate chunks from Python files
indexer.generate_chunks(
    repo_url=repo_url,
    min_tokens=20,     # Skip very small chunks
    # max_tokens=5000,    # Limit chunk size for better context
    file_types=[".py"] # Only process Python files
)

chunks = indexer.repositories[repo_url]["chunks"]
print(f"Generated {len(chunks)} chunks from Python files\n")


### 3. Generate Embeddings

Now we generate embeddings for each chunk:

In [None]:
# Generate embeddings
indexer.generate_embeddings(repo_url)

# Get a sample chunk with its embedding
sample_chunk = chunks[0]
print(f"Generated embeddings for {len(chunks)} chunks")
print(f"Vector size: {len(sample_chunk.embedding)}")

print("\nSample chunk:")
print(f"Content: {sample_chunk.content[:100]}...")
print("\nIts embedding vector (first 10 dimensions):")
print(sample_chunk.embedding[:10])

### 4. Store in Vector Database

Finally, we store the chunks and their embeddings:

In [None]:
# Store chunks and embeddings
indexer.insert_chunks(repo_url)

### Alternative: One-Step Processing

If you don't need to examine the intermediate results, you can use the convenience method that runs all steps:

In [None]:
# Process everything in one step, still focusing on Python files
indexer.index_repository(
    repo_url="https://github.com/catalystneuro/neuroconv.git",
    min_tokens=50,
    max_tokens=500,
    file_types=[".py"]
)

## Search Examples

Let's try some searches focused on different aspects of NeuroCONV:

In [None]:
def print_results(results):
    """Helper to print search results nicely."""
    for i, result in enumerate(results, 1):
        print(f"\nResult {i} (Score: {result['score']:.3f})")
        print(f"File: {result['file']}")
        if result['lines'][0]:
            print(f"Lines: {result['lines'][0]}-{result['lines'][1]}")
        print("Content:")
        print(result['content'])
        print("-" * 80)

### 1. Find Data Conversion Functions

In [None]:
# Search for data conversion functionality
results = indexer.search(
    query="function to convert data to NWB format",
    limit=3,
    chunk_type="code"
)
print_results(results)

### 2. Find Interface Documentation

In [None]:
# Search for interface documentation
results = indexer.search(
    query="interface documentation for converting neurophysiology data",
    limit=3,
    chunk_type="documentation"
)
print_results(results)

### 3. Find Configuration Examples

In [None]:
# Search for configuration examples
results = indexer.search(
    query="configuration settings for data conversion",
    limit=3,
    chunk_type="configuration"
)
print_results(results)

### 4. Find Test Examples

In [None]:
# Search for test examples
results = indexer.search(
    query="test cases for data conversion",
    limit=3,
    file_extension="py",  # Look in Python files
    chunk_type="code"
)
print_results(results)

### 5. Find Error Handling

In [None]:
# Search for error handling code
results = indexer.search(
    query="error handling during data conversion",
    limit=3,
    chunk_type="code"
)
print_results(results)

## Clean Up

If you want to remove the indexed data:

In [None]:
from repo_indexer.clients import QdrantManager

# Initialize manager
manager = QdrantManager(url="http://localhost:6333")

# Delete the collection
manager.delete_collection("neuroconv")