# Method 3: AST-Based (Language-Aware) Splitting

This notebook demonstrates the gold standard for code chunking: **AST-Based Splitting**.

## Concept
This method parses the source code into an **Abstract Syntax Tree (AST)**, which is a formal representation of the code's structure. By traversing this tree, we can extract complete, syntactically correct nodes like functions and classes.

### Pros:
- **Maximally Semantic:** Chunks are inherently meaningful because they align with the code's logical structure.
- **High-Quality Embeddings:** Coherent chunks produce accurate and useful embeddings.
- **Precise Retrieval:** Allows a search query to point to the exact function or class that is relevant.
- **Multi-Language Support:** Works with multiple programming languages using tree-sitter.

### Cons:
- **More Complex to Implement:** Requires integrating a parser like `tree-sitter` and writing language-specific logic.
- **Language-Specific:** Needs the correct grammar/parser for each programming language.
- **Version Dependencies:** Requires compatible versions of tree-sitter libraries.

In [1]:
# Install compatible versions of tree-sitter libraries
# Note: These specific versions are required for compatibility
import subprocess
import sys

def install_compatible_packages():
    """Install compatible versions that work together"""
    
    # First, uninstall any existing versions
    print("🔄 Removing any existing incompatible versions...")
    try:
        subprocess.run([sys.executable, "-m", "pip", "uninstall", "tree-sitter", "tree-sitter-languages", "-y"], 
                      capture_output=True)
    except:
        pass
    
    # Install the compatible versions (using available versions)
    print("📦 Installing compatible versions...")
    packages = [
        "tree-sitter==0.20.4",
        "tree_sitter_languages==1.9.1"  # Available version that should work
    ]
    
    for package in packages:
        try:
            result = subprocess.run([sys.executable, "-m", "pip", "install", package], 
                                  capture_output=True, text=True)
            if result.returncode == 0:
                print(f"✓ {package} installed successfully")
            else:
                print(f"✗ Error installing {package}: {result.stderr}")
        except Exception as e:
            print(f"✗ Exception installing {package}: {e}")
    
    # Verify installation
    print("\n🔍 Verifying installation...")
    try:
        from tree_sitter_languages import get_parser
        parser = get_parser('python')
        print("✅ tree_sitter_languages is working correctly!")
        return True
    except Exception as e:
        print(f"❌ Verification failed: {e}")
        return False

success = install_compatible_packages()
if success:
    print("\n🎉 All packages are working correctly!")
else:
    print("\n⚠️  Installation issues detected. You may need to restart the kernel.")

🔄 Removing any existing incompatible versions...
📦 Installing compatible versions...
📦 Installing compatible versions...
✓ tree-sitter==0.20.4 installed successfully
✓ tree-sitter==0.20.4 installed successfully
✓ tree_sitter_languages==1.9.1 installed successfully

🔍 Verifying installation...
✅ tree_sitter_languages is working correctly!

🎉 All packages are working correctly!
✓ tree_sitter_languages==1.9.1 installed successfully

🔍 Verifying installation...
✅ tree_sitter_languages is working correctly!

🎉 All packages are working correctly!


## Sample Code for Demonstration
We will use the following block of Python code as the input to demonstrate AST-based splitting.

In [2]:
sample_code = """
# File management and data processing utilities

import os
from typing import List, Dict
import asyncio

def read_file_content(filepath: str) -> str:
    \"\"\"Read and return the content of a file.\"\"\"
    try:
        with open(filepath, 'r', encoding='utf-8') as file:
            return file.read()
    except FileNotFoundError:
        return ""

def validate_email(email: str) -> bool:
    \"\"\"Simple email validation function.\"\"\"
    return "@" in email and "." in email.split("@")[-1]

async def fetch_user_data(user_id: int) -> Dict:
    \"\"\"Async function to fetch user data from API.\"\"\"
    # Simulate API call
    await asyncio.sleep(0.1)
    return {"id": user_id, "name": f"User {user_id}"}

class DataProcessor:
    \"\"\"A class for processing and analyzing data.\"\"\"
    
    def __init__(self, data_source: str):
        self.data_source = data_source
        self.processed_count = 0
    
    def process_batch(self, items: List[str]) -> List[str]:
        \"\"\"Process a batch of items.\"\"\"
        processed = []
        for item in items:
            processed.append(item.strip().upper())
            self.processed_count += 1
        return processed
    
    def get_statistics(self) -> Dict[str, int]:
        \"\"\"Get processing statistics.\"\"\"
        return {
            "processed_count": self.processed_count,
            "data_source_length": len(self.data_source)
        }

class FileManager:
    \"\"\"Utility class for file operations.\"\"\"
    
    def __init__(self, base_directory: str = "."):
        self.base_directory = base_directory
    
    def list_files(self, extension: str = None) -> List[str]:
        \"\"\"List files in the base directory.\"\"\"
        files = os.listdir(self.base_directory)
        if extension:
            files = [f for f in files if f.endswith(extension)]
        return files
    
    def file_exists(self, filename: str) -> bool:
        \"\"\"Check if a file exists.\"\"\"
        return os.path.exists(os.path.join(self.base_directory, filename))
"""

print("Sample code loaded:")
print(sample_code)

Sample code loaded:

# File management and data processing utilities

import os
from typing import List, Dict
import asyncio

def read_file_content(filepath: str) -> str:
    """Read and return the content of a file."""
    try:
        with open(filepath, 'r', encoding='utf-8') as file:
            return file.read()
    except FileNotFoundError:
        return ""

def validate_email(email: str) -> bool:
    """Simple email validation function."""
    return "@" in email and "." in email.split("@")[-1]

async def fetch_user_data(user_id: int) -> Dict:
    """Async function to fetch user data from API."""
    # Simulate API call
    await asyncio.sleep(0.1)
    return {"id": user_id, "name": f"User {user_id}"}

class DataProcessor:
    """A class for processing and analyzing data."""

    def __init__(self, data_source: str):
        self.data_source = data_source
        self.processed_count = 0

    def process_batch(self, items: List[str]) -> List[str]:
        """Process a batch of

In [3]:
from tree_sitter_languages import get_parser

def extract_code_elements(code: str, language: str = 'python'):
    """
    Extract classes and functions from source code using tree-sitter AST parsing.
    Similar to the working approach but with improved structure.
    
    Args:
        code (str): The source code to analyze
        language (str): Programming language (default: 'python')
    
    Returns:
        dict: Dictionary containing extracted classes and functions
    """
    try:
        # Initialize parser
        parser = get_parser(language)
        
        # Parse code into AST
        tree = parser.parse(bytes(code, "utf8"))
        
        # Results container
        extracted_elements = {
            'classes': [],
            'functions': [],
            'async_functions': []
        }
        
        def traverse_ast(node):
            """Recursively traverse the AST and extract code elements."""
            
            # Extract class definitions
            if node.type == "class_definition":
                class_name = node.child_by_field_name("name").text.decode('utf8')
                class_code = node.text.decode('utf8')
                extracted_elements['classes'].append({
                    'name': class_name,
                    'code': class_code,
                    'type': 'class'
                })
            
            # Extract regular function definitions
            elif node.type == "function_definition":
                func_name = node.child_by_field_name("name").text.decode('utf8')
                func_code = node.text.decode('utf8')
                extracted_elements['functions'].append({
                    'name': func_name,
                    'code': func_code,
                    'type': 'function'
                })
            
            # Extract async function definitions
            elif node.type == "async_function_definition":
                async_func_name = node.child_by_field_name("name").text.decode('utf8')
                async_func_code = node.text.decode('utf8')
                extracted_elements['async_functions'].append({
                    'name': async_func_name,
                    'code': async_func_code,
                    'type': 'async_function'
                })
            
            # Recursively process child nodes
            for child in node.children:
                traverse_ast(child)
        
        # Start traversal from root
        traverse_ast(tree.root_node)
        return extracted_elements
        
    except Exception as e:
        print(f"Error extracting code elements: {e}")
        return {'classes': [], 'functions': [], 'async_functions': []}

def group_all_elements(extracted_elements):
    """
    Group all extracted elements into a single list for easier processing.
    
    Args:
        extracted_elements (dict): Dictionary from extract_code_elements
    
    Returns:
        list: All elements combined with their metadata
    """
    all_elements = []
    
    # Add classes
    all_elements.extend(extracted_elements['classes'])
    
    # Add regular functions
    all_elements.extend(extracted_elements['functions'])
    
    # Add async functions
    all_elements.extend(extracted_elements['async_functions'])
    
    return all_elements

def display_element(element, index=None):
    """
    Display a single code element in a formatted way.
    
    Args:
        element (dict): Element dictionary with name, code, and type
        index (int): Optional index for numbering
    """
    element_type = element['type'].replace('_', ' ').title()
    name = element['name']
    code = element['code']
    
    if index is not None:
        print(f"--- Element {index}: {element_type} '{name}' ---")
    else:
        print(f"--- {element_type} '{name}' ---")
    
    print(code)
    print()  # Add spacing

def display_elements_by_type(extracted_elements):
    """
    Display elements grouped by their type.
    
    Args:
        extracted_elements (dict): Dictionary from extract_code_elements
    """
    # Display classes
    if extracted_elements['classes']:
        print("🏛️  CLASSES:")
        print("=" * 50)
        for element in extracted_elements['classes']:
            display_element(element)
    
    # Display functions
    if extracted_elements['functions']:
        print("⚡ FUNCTIONS:")
        print("=" * 50)
        for element in extracted_elements['functions']:
            display_element(element)
    
    # Display async functions
    if extracted_elements['async_functions']:
        print("🔄 ASYNC FUNCTIONS:")
        print("=" * 50)
        for element in extracted_elements['async_functions']:
            display_element(element)

print("✓ Code extraction functions defined")

✓ Code extraction functions defined


## Demonstration

Let's apply the AST-based splitting to our sample code and see the results.

In [None]:
print("--- Method 3: AST-Based Splitting (using tree-sitter-languages) ---")

# Extract code elements using our new function
extracted_elements = extract_code_elements(sample_code, language='python')

# Get summary statistics
total_classes = len(extracted_elements['classes'])
total_functions = len(extracted_elements['functions'])
total_async_functions = len(extracted_elements['async_functions'])
total_elements = total_classes + total_functions + total_async_functions

print(f"📊 EXTRACTION SUMMARY:")
print(f"   Classes: {total_classes}")
print(f"   Functions: {total_functions}")
print(f"   Async Functions: {total_async_functions}")
print(f"   Total Elements: {total_elements}")
print()

# Display elements grouped by type
display_elements_by_type(extracted_elements)

print("=" * 60)

--- Method 3: AST-Based Splitting (using tree-sitter-languages) ---
This method extracts complete functions and classes as semantically meaningful chunks.
Supports multiple programming languages!

📊 EXTRACTION SUMMARY:
   Classes: 2
   Functions: 9
   Async Functions: 0
   Total Elements: 11

🏛️  CLASSES:
--- Class 'DataProcessor' ---
class DataProcessor:
    """A class for processing and analyzing data."""

    def __init__(self, data_source: str):
        self.data_source = data_source
        self.processed_count = 0

    def process_batch(self, items: List[str]) -> List[str]:
        """Process a batch of items."""
        processed = []
        for item in items:
            processed.append(item.strip().upper())
            self.processed_count += 1
        return processed

    def get_statistics(self) -> Dict[str, int]:
        """Get processing statistics."""
        return {
            "processed_count": self.processed_count,
            "data_source_length": len(self.data_s