# RAGCUN - Document Processing Example

This notebook demonstrates how to work with documents in RAGCUN, including:
- Uploading documents in Google Colab
- Processing different document formats
- Building a document-based Q&A system

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ctn/ragcun/blob/main/notebooks/document_processing.ipynb)

## 1. Setup

In [None]:
# Install RAGCUN
!git clone https://github.com/ctn/ragcun.git
%cd ragcun
!pip install -q -e .

print("‚úì Installation complete!")

In [None]:
from ragcun import RAGPipeline
from pathlib import Path
import os

print("‚úì Imports successful!")

## 2. Upload Documents

Choose your preferred method to add documents.

### Method 1: Upload Files from Computer

In [None]:
from google.colab import files

# Upload files
uploaded = files.upload()

# Save to data/raw directory
os.makedirs('data/raw', exist_ok=True)
for filename in uploaded.keys():
    with open(f'data/raw/{filename}', 'wb') as f:
        f.write(uploaded[filename])
    print(f"‚úì Saved {filename}")

### Method 2: Mount Google Drive

In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# You can now access files from your Drive
# Example: drive_path = '/content/drive/MyDrive/documents/'
print("‚úì Google Drive mounted!")

### Method 3: Download from URL

In [None]:
# Example: Download a sample text file
import urllib.request

os.makedirs('data/raw', exist_ok=True)

# Replace with your URL
# url = "https://example.com/document.txt"
# urllib.request.urlretrieve(url, 'data/raw/document.txt')

print("‚úì Ready to download documents from URLs")

## 3. Process Documents

Read and process text from various document formats.

In [None]:
def read_text_file(filepath):
    """Read a plain text file."""
    with open(filepath, 'r', encoding='utf-8') as f:
        return f.read()

def read_documents_from_directory(directory):
    """Read all text documents from a directory."""
    documents = []
    path = Path(directory)
    
    for file in path.glob('*.txt'):
        content = read_text_file(file)
        documents.append(content)
        print(f"‚úì Loaded {file.name}")
    
    return documents

# Example: Read documents from data/raw
# documents = read_documents_from_directory('data/raw')
print("‚úì Document processing functions ready")

## 4. Demo with Sample Documents

Let's create some sample documents to demonstrate the system.

In [None]:
# Create sample documents
sample_documents = [
    """Artificial Intelligence Overview
    
    Artificial Intelligence (AI) refers to the simulation of human intelligence 
    in machines that are programmed to think and learn. AI systems can perform 
    tasks such as visual perception, speech recognition, decision-making, and 
    language translation.""",
    
    """Machine Learning Basics
    
    Machine Learning is a subset of AI that enables systems to learn and improve 
    from experience without being explicitly programmed. It focuses on developing 
    algorithms that can access data and use it to learn patterns.""",
    
    """Deep Learning Explained
    
    Deep Learning is a subset of machine learning that uses neural networks with 
    multiple layers. These deep neural networks can automatically learn hierarchical 
    representations of data, making them powerful for tasks like image recognition 
    and natural language processing.""",
    
    """Natural Language Processing
    
    Natural Language Processing (NLP) is a branch of AI that helps computers 
    understand, interpret, and generate human language. Applications include 
    chatbots, translation services, sentiment analysis, and text summarization.""",
]

print(f"‚úì Created {len(sample_documents)} sample documents")

## 5. Build RAG System

In [None]:
# Create RAG pipeline
pipeline = RAGPipeline()

# Add documents
pipeline.add_documents(sample_documents)

print(f"‚úì RAG pipeline ready with {len(sample_documents)} documents")

## 6. Query the System

In [None]:
# Define queries
queries = [
    "What is artificial intelligence?",
    "Explain machine learning",
    "What are the applications of NLP?",
    "How does deep learning differ from machine learning?",
]

# Run queries
print("=" * 70)
print("Q&A Session")
print("=" * 70 + "\n")

for query in queries:
    print(f"\nüìù Question: {query}")
    response = pipeline.query(query, top_k=2)
    print(f"\nüí° Answer:\n{response}")
    print("\n" + "-" * 70)

## 7. Interactive Q&A

Try asking your own questions!

In [None]:
# Interactive query
user_question = input("Ask a question: ")

if user_question:
    response = pipeline.query(user_question, top_k=3)
    print(f"\nAnswer:\n{response}")
else:
    print("No question provided.")

## Next Steps

- Try uploading your own documents
- Experiment with different types of queries
- Adjust the `top_k` parameter to retrieve more or fewer documents
- Explore the other notebooks and examples

Happy experimenting! üöÄ