# Document Ingestion and Indexing Demo

This notebook demonstrates the document ingestion pipeline:
1. Uploading documents (PDF, DOCX, TXT, images)
2. Text extraction
3. Chunking with overlap
4. Embedding generation
5. Storage in PostgreSQL with pgvector

## Setup

First, let's set up Django and import necessary modules.

In [None]:
import os
import sys
import django

# Add backend to path
sys.path.insert(0, os.path.abspath('..'))
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'config.settings')
django.setup()

print("Django setup complete!")

In [None]:
from documents.models import Document, DocumentChunk
from documents.services import DocumentProcessor
from django.core.files.uploadedfile import SimpleUploadedFile
import google.generativeai as genai
from django.conf import settings

print(f"Gemini Model: {settings.GEMINI_MODEL}")
print(f"Embedding Model: {settings.GEMINI_EMBEDDING_MODEL}")
print(f"Chunk Size: {settings.CHUNK_SIZE}")
print(f"Chunk Overlap: {settings.CHUNK_OVERLAP}")

## 1. Document Upload

Let's create a sample document and upload it to the system.

In [None]:
# Create a sample text document
sample_text = """
Introduction to Machine Learning

Machine learning is a subset of artificial intelligence (AI) that provides systems 
the ability to automatically learn and improve from experience without being 
explicitly programmed. Machine learning focuses on the development of computer 
programs that can access data and use it to learn for themselves.

Types of Machine Learning

1. Supervised Learning: The algorithm learns from labeled training data and makes 
predictions based on that data. Examples include classification and regression tasks.

2. Unsupervised Learning: The algorithm learns from unlabeled data and finds hidden 
patterns or intrinsic structures. Examples include clustering and dimensionality reduction.

3. Reinforcement Learning: The algorithm learns by interacting with an environment 
and receiving rewards or penalties for actions taken.

Applications of Machine Learning

Machine learning has numerous applications across various industries:
- Healthcare: Disease prediction, medical imaging analysis
- Finance: Fraud detection, algorithmic trading
- Technology: Speech recognition, recommendation systems
- Transportation: Autonomous vehicles, route optimization

Deep Learning

Deep learning is a subset of machine learning that uses neural networks with 
multiple layers (deep neural networks). It has achieved remarkable success in 
tasks like image recognition, natural language processing, and speech recognition.
"""

print(f"Sample text length: {len(sample_text)} characters")

In [None]:
# Create document record
document = Document.objects.create(
    title="Introduction to Machine Learning",
    file_type="txt",
    file_size=len(sample_text),
    language="en",
    status=Document.Status.UPLOADED
)

print(f"Created document: {document}")
print(f"Document ID: {document.id}")
print(f"Status: {document.status}")

## 2. Text Extraction and Chunking

Now let's see how text chunking works.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize text splitter with same settings as DocumentProcessor
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=settings.CHUNK_SIZE,
    chunk_overlap=settings.CHUNK_OVERLAP,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Split text into chunks
chunks = text_splitter.split_text(sample_text)

print(f"Number of chunks: {len(chunks)}")
print("\n" + "="*50)
for i, chunk in enumerate(chunks):
    print(f"\nChunk {i} (length: {len(chunk)} chars):")
    print("-" * 40)
    print(chunk[:200] + "..." if len(chunk) > 200 else chunk)

## 3. Embedding Generation

Let's generate embeddings for each chunk using Google Gemini.

In [None]:
# Configure Gemini
genai.configure(api_key=settings.GOOGLE_API_KEY, transport='rest')

# Generate embedding for first chunk as example
result = genai.embed_content(
    model=settings.GEMINI_EMBEDDING_MODEL,
    content=chunks[0],
    task_type="retrieval_document"
)

embedding = result['embedding']
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 10 values: {embedding[:10]}")
print(f"Embedding type: {type(embedding)}")

## 4. Store Chunks with Embeddings

Now let's store all chunks with their embeddings in the database.

In [None]:
# Generate embeddings and store chunks
chunk_objects = []

for idx, chunk_text in enumerate(chunks):
    # Generate embedding
    result = genai.embed_content(
        model=settings.GEMINI_EMBEDDING_MODEL,
        content=chunk_text,
        task_type="retrieval_document"
    )
    embedding = result['embedding']
    
    # Create chunk object
    chunk = DocumentChunk(
        document=document,
        index=idx,
        text=chunk_text,
        embedding=embedding,
        char_count=len(chunk_text),
        token_count=len(chunk_text.split())
    )
    chunk_objects.append(chunk)
    print(f"Created chunk {idx}: {len(chunk_text)} chars, {len(embedding)} dim embedding")

# Bulk create chunks
DocumentChunk.objects.bulk_create(chunk_objects)
print(f"\nStored {len(chunk_objects)} chunks in database")

In [None]:
# Update document status
document.status = Document.Status.READY
document.num_chunks = len(chunk_objects)
document.num_pages = 1
document.save()

print(f"Document status: {document.status}")
print(f"Number of chunks: {document.num_chunks}")

## 5. Verify Storage

Let's verify that everything was stored correctly.

In [None]:
# Query stored chunks
stored_chunks = DocumentChunk.objects.filter(document=document).order_by('index')

print(f"Retrieved {stored_chunks.count()} chunks from database")
print("\n" + "="*50)

for chunk in stored_chunks:
    print(f"\nChunk {chunk.index}:")
    print(f"  - Characters: {chunk.char_count}")
    print(f"  - Tokens: {chunk.token_count}")
    print(f"  - Embedding dimensions: {len(chunk.embedding)}")
    print(f"  - Text preview: {chunk.text[:100]}...")

## 6. Vector Similarity Search

Let's test vector similarity search on our stored chunks.

In [None]:
import numpy as np

# Test query
query = "What are the applications of machine learning?"

# Generate query embedding
query_result = genai.embed_content(
    model=settings.GEMINI_EMBEDDING_MODEL,
    content=query,
    task_type="retrieval_query"
)
query_embedding = np.array(query_result['embedding'])

print(f"Query: {query}")
print(f"Query embedding dimensions: {len(query_embedding)}")

In [None]:
# Calculate cosine similarity with all chunks
similarities = []

for chunk in stored_chunks:
    chunk_embedding = np.array(chunk.embedding)
    
    # Cosine similarity
    similarity = np.dot(query_embedding, chunk_embedding) / (
        np.linalg.norm(query_embedding) * np.linalg.norm(chunk_embedding)
    )
    similarities.append((chunk, similarity))

# Sort by similarity (descending)
similarities.sort(key=lambda x: x[1], reverse=True)

print("Search Results (sorted by similarity):")
print("=" * 50)

for chunk, sim in similarities:
    print(f"\nChunk {chunk.index} (similarity: {sim:.4f}):")
    print(f"  {chunk.text[:200]}...")

## 7. Cleanup

Let's clean up the demo document.

In [None]:
# Optionally delete the demo document
# document.delete()
# print("Demo document deleted")

print("Demo complete! Document remains in database for further testing.")

## Summary

In this notebook, we demonstrated:

1. **Document Upload**: Creating a document record in Django
2. **Text Chunking**: Using LangChain's RecursiveCharacterTextSplitter with overlap
3. **Embedding Generation**: Using Google Gemini's text-embedding-004 model
4. **Storage**: Storing chunks with 768-dimensional embeddings in PostgreSQL with pgvector
5. **Vector Search**: Performing cosine similarity search on stored embeddings

The system supports:
- PDF, DOCX, TXT, JPG, JPEG, PNG file formats
- Configurable chunk size and overlap
- Hybrid retrieval (BM25 + vector search)