# Ingesting MDX and Jupyter Notebooks from documents/docs

This notebook demonstrates a complete workflow for loading and processing all MDX and Jupyter notebook files from the `documents/docs` folder. We'll process these documents and store them in a vector database for retrieval.

In [2]:
pip install nbformat

Collecting nbformat
  Downloading nbformat-5.10.4-py3-none-any.whl.metadata (3.6 kB)
Collecting fastjsonschema>=2.15 (from nbformat)
  Downloading fastjsonschema-2.21.1-py3-none-any.whl.metadata (2.2 kB)
Downloading nbformat-5.10.4-py3-none-any.whl (78 kB)
Downloading fastjsonschema-2.21.1-py3-none-any.whl (23 kB)
Installing collected packages: fastjsonschema, nbformat
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2/2[0m [nbformat]
[1A[2KSuccessfully installed fastjsonschema-2.21.1 nbformat-5.10.4
Note: you may need to restart the kernel to use updated packages.


In [3]:
# Import required libraries
import os
import glob
import ssl
import urllib3
from pathlib import Path
from dotenv import load_dotenv
import nbformat

# Import langchain components
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders.notebook import NotebookLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import SupabaseVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

# Import supabase
from supabase.client import Client, create_client

# Configure SSL context to be more permissive to handle SSL connection issues
ssl_context = ssl.create_default_context()
ssl_context.check_hostname = False
ssl_context.verify_mode = ssl.CERT_NONE

# Disable SSL warnings
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

# Load environment variables
load_dotenv()

True

## Configure Database and Embedding Model

Set up connections to Supabase and initialize the OpenAI embedding model.

In [4]:
# Initialize Supabase client with improved connection settings
supabase_url = os.environ.get("SUPABASE_URL")
supabase_key = os.environ.get("SUPABASE_SERVICE_KEY")
supabase: Client = create_client(
supabase_url, 
supabase_key,
    options={
        "timeout": 60,  # Increase timeout
        "headers": {
            "Connection": "keep-alive"
        }
    }
)

# Initialize embeddings model with improved timeout and retry settings
embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
    timeout=60,  # Increase timeout for API calls
    max_retries=5  # Add retries for resilience
)

print(f"Supabase connection established: {bool(supabase)}")
print(f"Embeddings model initialized: {bool(embeddings)}")

Supabase connection established: True
Embeddings model initialized: True


## Define Custom MDX Loader

Create a custom loader for MDX files that can extract content and metadata.

In [5]:
class MDXLoader:
    """Loader for MDX files to extract content and frontmatter metadata."""
    
    def __init__(self, file_path):
        """Initialize with file path."""
        self.file_path = file_path
    
    def load(self):
        """Load and parse MDX file."""
        with open(self.file_path, 'r', encoding='utf-8') as file:
            content = file.read()
        
        # Extract metadata (if present at the beginning of the file)
        metadata = {}
        if content.startswith('---'):
            # Extract the frontmatter between the first two '---' delimiters
            parts = content.split('---', 2)
            if len(parts) >= 3:
                frontmatter = parts[1].strip()
                # Simple parsing of key-value pairs
                for line in frontmatter.split('\n'):
                    if ':' in line:
                        key, value = line.split(':', 1)
                        metadata[key.strip()] = value.strip()
                content = parts[2]
        
        # Add file metadata
        metadata['source'] = self.file_path
        metadata['filetype'] = 'mdx'
        
        return [Document(page_content=content, metadata=metadata)]

## Find All MDX and Jupyter Notebook Files

Recursively locate all MDX and Jupyter notebook files in the docs directory.

In [6]:
def find_files(base_path, extensions):
    """Find all files with the given extensions in the base path recursively."""
    all_files = []
    for ext in extensions:
        # Use glob pattern to find all files with the extension recursively
        pattern = os.path.join(base_path, '**', f'*.{ext}')
        files = glob.glob(pattern, recursive=True)
        all_files.extend(files)
    return all_files

# Define the base path and extensions
base_path = 'documents/docs'
extensions = ['mdx', 'ipynb']

# Find all files
files = find_files(base_path, extensions)
print(f"Found {len(files)} total files")

# Count files by type
mdx_files = [f for f in files if f.endswith('.mdx')]
ipynb_files = [f for f in files if f.endswith('.ipynb')]
print(f"MDX files: {len(mdx_files)}")
print(f"Jupyter notebooks: {len(ipynb_files)}")

# Display sample files of each type
if mdx_files:
    print(f"\nSample MDX files:")
    for f in mdx_files[:3]:
        print(f"  - {f}")
if ipynb_files:
    print(f"\nSample Jupyter notebooks:")
    

Found 1514 total files
MDX files: 413
Jupyter notebooks: 1101

Sample MDX files:
  - documents/docs/people.mdx
  - documents/docs/introduction.mdx
  - documents/docs/_templates/integration.mdx

Sample Jupyter notebooks:
  - documents/docs/versions/migrating_chains/stuff_docs_chain.ipynb
  - documents/docs/versions/migrating_chains/refine_docs_chain.ipynb
  - documents/docs/versions/migrating_chains/multi_prompt_chain.ipynb


## Load and Process Documents

Load all documents using the appropriate loader for each file type.

In [7]:
def load_documents(files):
    """Load all documents from the given file paths."""
    documents = []
    failed_files = []
    
    for i, file_path in enumerate(files):
        try:
            if i % 50 == 0:
                print(f"Processing file {i+1}/{len(files)}")
                
            if file_path.endswith('.mdx'):
                loader = MDXLoader(file_path)
                docs = loader.load()
                documents.extend(docs)
            elif file_path.endswith('.ipynb'):
                loader = NotebookLoader(file_path, include_outputs=True, max_output_length=50)
                docs = loader.load()
                documents.extend(docs)
        except Exception as e:
            failed_files.append((file_path, str(e)))
            print(f"Error loading {file_path}: {str(e)}")
    
    return documents, failed_files

# Load all documents
print("Loading documents...")
documents, failed_files = load_documents(files)
print(f"Loaded {len(documents)} documents")

if failed_files:
    print(f"Failed to load {len(failed_files)} files")
    print("First few failures:")
    for path, error in failed_files[:3]:
        print(f"  - {path}: {error}")

Loading documents...
Processing file 1/1514
Processing file 51/1514
Processing file 101/1514
Processing file 151/1514
Processing file 201/1514
Processing file 251/1514
Processing file 301/1514
Processing file 351/1514
Processing file 401/1514
Processing file 451/1514
Processing file 501/1514
Processing file 551/1514
Processing file 601/1514
Processing file 651/1514
Processing file 701/1514
Processing file 751/1514
Processing file 801/1514
Processing file 851/1514
Processing file 901/1514
Processing file 951/1514
Processing file 1001/1514
Processing file 1051/1514
Processing file 1101/1514
Processing file 1151/1514
Processing file 1201/1514
Processing file 1251/1514
Processing file 1301/1514
Processing file 1351/1514
Processing file 1401/1514
Processing file 1451/1514
Processing file 1501/1514
Loaded 1514 documents


## Analyze Document Metadata

Examine the metadata of the loaded documents to understand what we've collected.

In [8]:
# Analyze document metadata
file_types = {}
metadata_keys = set()

for doc in documents:
    file_type = doc.metadata.get('filetype', 'unknown')
    file_types[file_type] = file_types.get(file_type, 0) + 1
    metadata_keys.update(doc.metadata.keys())

print("Document types:")
for file_type, count in file_types.items():
    print(f"  - {file_type}: {count}")

print("\nMetadata keys found:")
for key in sorted(metadata_keys):
    print(f"  - {key}")

# Sample document content
if documents:
    print("\nSample document content (first 200 chars):")
    print(f"Source: {documents[0].metadata.get('source', 'Unknown')}")
    print(f"Content: {documents[0].page_content[:200]}...")

Document types:
  - mdx: 413
  - unknown: 1101

Metadata keys found:
  - description
  - filetype
  - hide_table_of_contents
  - keywords
  - pagination_next
  - pagination_prev
  - sidebar-position
  - sidebar_class_name
  - sidebar_label
  - sidebar_position
  - source

Sample document content (first 200 chars):
Source: documents/docs/people.mdx
Content: 

import People from "@theme/People";

# People

There are some incredible humans from all over the world who have been instrumental in helping the LangChain community flourish 🌐!

This page highlight...


## Split Documents into Chunks

Split the documents into smaller chunks for better retrieval performance.

In [9]:
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    length_function=len,
)

chunks = text_splitter.split_documents(documents)
print(f"Split {len(documents)} documents into {len(chunks)} chunks")

# Sample chunk
if chunks:
    print("\nSample chunk:")
    print(f"Source: {chunks[0].metadata.get('source', 'Unknown')}")
    print(f"Content length: {len(chunks[0].page_content)} chars")
    print(f"Content preview: {chunks[0].page_content[:150]}...")

Split 1514 documents into 12791 chunks

Sample chunk:
Source: documents/docs/people.mdx
Content length: 906 chars
Content preview: import People from "@theme/People";

# People

There are some incredible humans from all over the world who have been instrumental in helping the Lang...


In [43]:
# Uncomment to use all chunks (for production)
upload_chunks = chunks
# For testing with a small sample, use:
# upload_chunks = chunks[:3]

print(f"Total chunks to upload: {len(upload_chunks)}")

Total chunks to upload: 12791


## Store Documents in Vector Database

Store the document chunks in Supabase vector store for retrieval.

In [26]:
vectorstore = SupabaseVectorStore(
    client=supabase,
    table_name="documents",
    query_name="rag_query",
    embedding=embeddings,
)

In [None]:
# Store documents in vector database using smaller batches with better error handling
import math
import time

print("Storing documents in vector database...")
print(f"Total chunks to upload: {len(upload_chunks)}")

# Process in batches of 300 chunks (as requested)
batch_size = 30
num_chunks = len(upload_chunks)
num_batches = math.ceil(num_chunks / batch_size)

print(f"Processing in {num_batches} batches of {batch_size} chunks each")

successful_batches = 0
skipped_batches = 0
failed_batches = 0
ssl_error_batches = []  # Track batches with SSL errors

for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, num_chunks)
    current_batch = upload_chunks[start_idx:end_idx]
    
    print(f"\nProcessing batch {i+1}/{num_batches} with {len(current_batch)} chunks (chunks {start_idx+1}-{end_idx})...")
    
    try:       
        # Store the current batch in the vector database
        batch_vector_store = SupabaseVectorStore.from_documents(
            current_batch,
            embeddings,
            client=supabase,
            table_name="documents",
            query_name="rag_query",
            chunk_size=30  # Smaller internal chunk size for API calls
        )
        
        
        print(f"Batch {i+1}/{num_batches} successfully stored!")
        successful_batches += 1
        
        # Add a small delay between batches to avoid rate limits
        if i < num_batches - 1:
            print("Waiting 1 second before processing next batch...")
            time.sleep(1)
            
    except Exception as e:
        error_message = str(e) if not hasattr(e, 'message') else str(e.message)
        
        # Check if it's an SSL error
        if any(ssl_pattern in error_message.lower() for ssl_pattern in ["ssl", "tls", "certificate", "handshake", "bad record"]):
            ssl_error_batches.append({
                "batch_number": i+1,
                "batch_size": len(current_batch),
                "start_idx": start_idx,
                "end_idx": end_idx,
                "error": error_message
            })
            failed_batches += 1
            print(f"SSL Error in batch {i+1}: {error_message}")
            print("\nRecorded SSL error information. Trying to continue with next batch...")
            # Add a longer delay after SSL error before trying next batch
            time.sleep(3)
            continue
            
        # Check if it's a duplicate exception (look for common duplicate error patterns)
        if any(dup_pattern in error_message.lower() for dup_pattern in ["duplicate", "already exists", "unique constraint", "unique violation", "conflict"]):
            skipped_batches += 1
            print(f"Duplicate detected in batch {i+1}, skipping: {error_message}")
            
            # Add a small delay before continuing
            time.sleep(0.5)
            continue  # Skip this batch and continue with the next one
        else:
            # For any other exception, break the loop
            failed_batches += 1
            print(f"Non-duplicate error in batch {i+1}: {error_message}")
            print("\nBreaking process due to non-duplicate error.")
            break  # Exit the loop on any non-duplicate error

print(f"\nUpload process summary:")
print(f"- {successful_batches} batches successfully stored")
print(f"- {skipped_batches} batches skipped due to duplicates")
print(f"- {failed_batches} batches failed with other errors")

# Display SSL error information if any were recorded
if ssl_error_batches:
    print("\nSSL Error Information:")
    print(f"Total SSL errors: {len(ssl_error_batches)}")
    for i, error_info in enumerate(ssl_error_batches):
        print(f"\nSSL Error #{i+1}:")
        print(f"  Batch number: {error_info['batch_number']}")
        print(f"  Batch size: {error_info['batch_size']}")
        print(f"  Chunk range: {error_info['start_idx']+1}-{error_info['end_idx']}")
        print(f"  Error: {error_info['error']}")

if successful_batches > 0 or skipped_batches > 0:
    print("You can now use the vector store for retrieving documents!")
    
    # Create a vector store instance for retrieval if needed
    vector_store = SupabaseVectorStore(
        client=supabase,
        table_name="documents",
        query_name="rag_query",
        embedding=embeddings,
    )

Storing documents in vector database...
Total chunks to upload: 12791
Processing in 427 batches of 30 chunks each

Processing batch 1/427 with 30 chunks (chunks 1-30)...
Duplicate detected in batch 1, skipping: duplicate key value violates unique constraint "documents_content_key"

Processing batch 2/427 with 30 chunks (chunks 31-60)...
Duplicate detected in batch 2, skipping: duplicate key value violates unique constraint "documents_content_key"

Processing batch 3/427 with 30 chunks (chunks 61-90)...
Duplicate detected in batch 3, skipping: duplicate key value violates unique constraint "documents_content_key"

Processing batch 4/427 with 30 chunks (chunks 91-120)...
Duplicate detected in batch 4, skipping: duplicate key value violates unique constraint "documents_content_key"

Processing batch 5/427 with 30 chunks (chunks 121-150)...
Duplicate detected in batch 5, skipping: duplicate key value violates unique constraint "documents_content_key"

Processing batch 6/427 with 30 chunks 

## Test Retrieval

Test the retrieval functionality with a sample query.

In [None]:
# Test retrieval
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

# Test with a sample query
query = "How do agents work in LangChain?"
docs = retriever.get_relevant_documents(query)

print(f"Retrieved {len(docs)} documents for query: '{query}'")
print("\nRetrieved documents:")

for i, doc in enumerate(docs):
    print(f"\nDocument {i+1}:")
    print(f"Source: {doc.metadata.get('source', 'Unknown')}")
    print(f"File type: {doc.metadata.get('filetype', 'Unknown')}")
    print(f"Content (first 200 chars): {doc.page_content[:200]}...")