# Day 4 Solution - Tokenization and Text Processing

This is my solution to the Day 4 assignment. I've implemented tokenization understanding and text processing techniques.

## Features Implemented:
- Tokenization with tiktoken library
- Token counting and analysis
- Text chunking strategies
- Model-specific tokenization
- Cost estimation and optimization
- Advanced text processing techniques


In [None]:
# Day 4 Solution - Imports and Setup
import tiktoken
import os
from dotenv import load_dotenv
from openai import OpenAI
import json

# Load environment variables
load_dotenv(override=True)
openai = OpenAI()

print("Day 4 setup complete! Ready for tokenization analysis.")


In [None]:
# Understanding Tokenization
print("## Tokenization Fundamentals")
print("="*50)

# Get encoding for different models
models = ["gpt-4o-mini", "gpt-4o", "gpt-3.5-turbo", "o1-mini"]

encodings = {}
for model in models:
    try:
        encodings[model] = tiktoken.encoding_for_model(model)
        print(f"✅ {model}: {encodings[model].name}")
    except Exception as e:
        print(f"❌ {model}: {e}")

# Test text
test_text = "Hi my name is Ed and I like banoffee pie. This is a test of tokenization!"

print(f"\\nTest text: '{test_text}'")
print(f"Text length: {len(test_text)} characters")

# Tokenize with different models
for model, encoding in encodings.items():
    tokens = encoding.encode(test_text)
    print(f"\\n{model}:")
    print(f"  Tokens: {len(tokens)}")
    print(f"  Token IDs: {tokens}")
    
    # Show individual tokens
    print("  Individual tokens:")
    for i, token_id in enumerate(tokens):
        token_text = encoding.decode([token_id])
        print(f"    {i+1}. {token_id} = '{token_text}'")


In [None]:
# Token Counting and Cost Estimation
def count_tokens(text, model="gpt-4o-mini"):
    """Count tokens for a given text and model"""
    try:
        encoding = tiktoken.encoding_for_model(model)
        return len(encoding.encode(text))
    except Exception as e:
        print(f"Error counting tokens for {model}: {e}")
        return 0

def estimate_cost(text, model="gpt-4o-mini", operation="completion"):
    """Estimate cost for text processing"""
    token_count = count_tokens(text, model)
    
    # Pricing per 1K tokens (as of 2024)
    pricing = {
        "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
        "gpt-4o": {"input": 0.005, "output": 0.015},
        "gpt-3.5-turbo": {"input": 0.0005, "output": 0.0015}
    }
    
    if model in pricing:
        if operation == "input":
            cost = (token_count / 1000) * pricing[model]["input"]
        else:
            cost = (token_count / 1000) * pricing[model]["output"]
        return token_count, cost
    else:
        return token_count, 0

# Test with different texts
test_texts = [
    "Hello world!",
    "This is a longer text that will have more tokens and cost more money to process.",
    "Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data without being explicitly programmed for every task.",
    "The quick brown fox jumps over the lazy dog. " * 10  # Repeated text
]

print("## Token Counting and Cost Analysis")
print("="*60)

for i, text in enumerate(test_texts, 1):
    print(f"\\nText {i}: '{text[:50]}{'...' if len(text) > 50 else ''}'")
    print(f"Length: {len(text)} characters")
    
    for model in ["gpt-4o-mini", "gpt-4o", "gpt-3.5-turbo"]:
        tokens, cost = estimate_cost(text, model, "input")
        print(f"  {model}: {tokens} tokens, ${cost:.6f}")


In [None]:
# Text Chunking Strategies
def chunk_text_by_tokens(text, max_tokens=1000, model="gpt-4o-mini", overlap=50):
    """Split text into chunks based on token count"""
    encoding = tiktoken.encoding_for_model(model)
    
    # Encode the entire text
    tokens = encoding.encode(text)
    chunks = []
    
    start = 0
    while start < len(tokens):
        # Get chunk of tokens
        end = min(start + max_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        
        # Decode back to text
        chunk_text = encoding.decode(chunk_tokens)
        chunks.append(chunk_text)
        
        # Move start position with overlap
        start = end - overlap if end < len(tokens) else end
    
    return chunks

def chunk_text_by_sentences(text, max_tokens=1000, model="gpt-4o-mini"):
    """Split text into chunks by sentences, respecting token limits"""
    encoding = tiktoken.encoding_for_model(model)
    
    # Split by sentences (simple approach)
    sentences = text.split('. ')
    chunks = []
    current_chunk = ""
    
    for sentence in sentences:
        # Add sentence to current chunk
        test_chunk = current_chunk + sentence + ". " if current_chunk else sentence + ". "
        
        # Check token count
        if count_tokens(test_chunk, model) <= max_tokens:
            current_chunk = test_chunk
        else:
            # Save current chunk and start new one
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence + ". "
    
    # Add final chunk
    if current_chunk:
        chunks.append(current_chunk.strip())
    
    return chunks

# Test chunking strategies
long_text = """
Machine learning is a subset of artificial intelligence that focuses on algorithms that can learn from data without being explicitly programmed for every task. 
It involves training models on large datasets to make predictions or decisions. 
There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. 
Supervised learning uses labeled training data to learn a mapping from inputs to outputs. 
Unsupervised learning finds hidden patterns in data without labeled examples. 
Reinforcement learning learns through interaction with an environment using rewards and penalties. 
Deep learning is a subset of machine learning that uses neural networks with multiple layers. 
These networks can automatically learn hierarchical representations of data. 
Popular deep learning frameworks include TensorFlow, PyTorch, and Keras. 
Machine learning has applications in computer vision, natural language processing, speech recognition, and many other domains.
""" * 3  # Repeat to make it longer

print("## Text Chunking Strategies")
print("="*50)

print(f"Original text length: {len(long_text)} characters")
print(f"Token count: {count_tokens(long_text, 'gpt-4o-mini')} tokens")

# Test token-based chunking
print("\\n📊 Token-based chunking:")
token_chunks = chunk_text_by_tokens(long_text, max_tokens=200, model="gpt-4o-mini")
for i, chunk in enumerate(token_chunks):
    tokens = count_tokens(chunk, "gpt-4o-mini")
    print(f"  Chunk {i+1}: {tokens} tokens, {len(chunk)} chars")

# Test sentence-based chunking
print("\\n📊 Sentence-based chunking:")
sentence_chunks = chunk_text_by_sentences(long_text, max_tokens=200, model="gpt-4o-mini")
for i, chunk in enumerate(sentence_chunks):
    tokens = count_tokens(chunk, "gpt-4o-mini")
    print(f"  Chunk {i+1}: {tokens} tokens, {len(chunk)} chars")


In [None]:
# Advanced Text Processing with Token Awareness
def process_large_text(text, model="gpt-4o-mini", max_tokens=1000, operation="summarize"):
    """Process large text with token awareness"""
    chunks = chunk_text_by_tokens(text, max_tokens, model)
    
    print(f"📊 Processing {len(chunks)} chunks with {model}")
    
    results = []
    total_cost = 0
    
    for i, chunk in enumerate(chunks):
        print(f"\\nProcessing chunk {i+1}/{len(chunks)}...")
        
        # Count tokens and estimate cost
        tokens, cost = estimate_cost(chunk, model, "input")
        total_cost += cost
        
        # Process chunk based on operation
        if operation == "summarize":
            prompt = f"Summarize this text in 2-3 sentences:\\n\\n{chunk}"
        elif operation == "extract_keywords":
            prompt = f"Extract the 5 most important keywords from this text:\\n\\n{chunk}"
        elif operation == "sentiment":
            prompt = f"Analyze the sentiment of this text (positive/negative/neutral):\\n\\n{chunk}"
        else:
            prompt = f"Process this text:\\n\\n{chunk}"
        
        try:
            response = openai.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                max_tokens=100,
                temperature=0.3
            )
            
            result = response.choices[0].message.content
            results.append(result)
            
            # Estimate output cost
            output_tokens, output_cost = estimate_cost(result, model, "output")
            total_cost += output_cost
            
            print(f"  ✅ Chunk {i+1} processed: {len(result)} chars")
            
        except Exception as e:
            print(f"  ❌ Error processing chunk {i+1}: {e}")
            results.append(f"Error: {e}")
    
    print(f"\\n💰 Total estimated cost: ${total_cost:.6f}")
    return results, total_cost

# Test with a long document
document = """
Artificial Intelligence (AI) has become one of the most transformative technologies of the 21st century. 
It encompasses a wide range of techniques and applications that enable machines to perform tasks that typically require human intelligence. 
Machine learning, a subset of AI, allows systems to automatically learn and improve from experience without being explicitly programmed. 
Deep learning, which uses neural networks with multiple layers, has achieved remarkable success in areas like image recognition, natural language processing, and game playing. 
AI applications are now ubiquitous, from recommendation systems on e-commerce platforms to autonomous vehicles and medical diagnosis tools. 
The field continues to evolve rapidly, with new architectures and training methods being developed regularly. 
However, AI also raises important questions about ethics, bias, job displacement, and the need for responsible development and deployment. 
As AI becomes more powerful and widespread, it's crucial to ensure that these systems are fair, transparent, and beneficial to society as a whole.
""" * 5  # Make it longer

print("## Advanced Text Processing with Token Awareness")
print("="*60)

# Test summarization
print("\\n📝 Testing summarization...")
summaries, cost = process_large_text(document, operation="summarize")
print(f"\\nGenerated {len(summaries)} summaries")
for i, summary in enumerate(summaries):
    print(f"\\nSummary {i+1}: {summary}")

print(f"\\nTotal cost: ${cost:.6f}")
