# Tutorial 2: Data Exploration and Your First API Call

Alright, let's get our hands dirty with some actual data and API calls. 

## What we're doing today

- Download the GDSC 8 dataset (jobs + trainings from Brazil)
- Make our first Mistral API call (and not go broke doing it)
- Understand why tokens matter (spoiler: they cost money)
- Use LLMs to filter data instead of writing regex hell

**Reality check**: This is about building AI agents that help people find green jobs in Brazil. Cool mission, but also we're in a competition, so let's be smart about costs and performance.

---

## Understanding the Challenge Data

### The Mission (kinda cool actually)
We're helping young people in Brazil find green jobs. UNICEF partnership, climate action, meaningful careers - the whole deal. But here's the thing: we need to build AI agents that can sift through job descriptions and training programs, match them to people's profiles, and do it efficiently.

### Let's grab the data
Time to download some files from S3:

In [None]:
# Download the GDSC 8 dataset
!aws s3 cp s3://gdsc25test/ . --recursive

After this runs, you'll have a `data` directory with:
- **`jobs/`** - 200 job postings
- **`trainings/`** - 497 training programs

### Quick math reality check
697 items × however many personas we need to match = potentially expensive if we're not careful with API calls.

This is where being smart about it pays off. Literally.

In [None]:
# Let's see what we're working with
import os
from pathlib import Path

# Count files and get basic statistics
jobs_dir = Path('data/jobs')
trainings_dir = Path('data/trainings')

job_files = list(jobs_dir.glob('*.md')) if jobs_dir.exists() else []
training_files = list(trainings_dir.glob('*.md')) if trainings_dir.exists() else []

print(f"Dataset Overview:")
print(f"Jobs: {len(job_files)}")
print(f"Trainings: {len(training_files)}")
print(f"Total items: {len(job_files) + len(training_files)}")

# TODO: Add breakdowns by domain, seniority level, location
# (Author note: this would be useful for understanding what we're dealing with)

In [None]:
# Helper function to peek at files
from IPython.display import Markdown, display

def display_markdown_file(path: str) -> None:
    """Display a markdown file in Jupyter - nothing fancy"""
    p = Path(path)
    if not p.exists():
        print(f"File not found: {p}")
        return
    content = p.read_text(encoding='utf-8', errors='ignore')
    display(Markdown(content))

### Let's look at a job posting

In [None]:
# Display a sample job
if job_files:
    display_markdown_file(job_files[0])
else:
    print("No job files found. Make sure you've downloaded the data!")

### And a training program

In [None]:
# Display a sample training
if training_files:
    display_markdown_file(training_files[0])
else:
    print("No training files found. Make sure you've downloaded the data!")

### What you'll notice

Both jobs and trainings have:
- **Overview/Description** 
- **Location** (this matters for matching)
- **Prerequisites** (skills, experience levels)
- **Outcomes** (for trainings)

But here's the kicker: they're not consistently formatted. Some use different headers, different structures, different language. Your LLM solution needs to handle this chaos gracefully.

This is why we can't just use regex or simple parsing - we need something smarter.

---

## Your First Mistral API Call

Time to get our hands dirty with the actual AI part.

In [None]:
# Setup time
# !pip install python-dotenv mistralai

import os
import dotenv
import time
from mistralai import Mistral

# Load your API key from .env file
dotenv.load_dotenv()

# Check if we're good to go
if not os.getenv("MISTRAL_API_KEY"):
    print("❌ No MISTRAL_API_KEY found!")
    print("Create a .env file with your API key")
else:
    print("✅ API key found, we're ready to roll")

### Quick Token 101

**Tokens are how LLMs "see" text**. Think of them as word chunks:
- "Hello world!" ≈ 3 tokens
- "The" = 1 token, "ing" = 1 token
- Roughly 1 token ≈ 0.75 English words

**Why you care**:
- Tokens = money (API costs)
- Tokens = speed (more tokens = slower)
- Tokens = limits (models have max context windows)

**Golden rule**: Start with the smallest/cheapest model that can do the job. Scale up only if needed.

In [None]:
# Quick token estimator (rough but useful)
def estimate_tokens(text: str) -> int:
    """Ballpark token count - good enough for cost estimates"""
    words = len(text.split())
    return int(words / 0.75)

# Test it out
test_prompt = "Analyze this job posting and extract the key requirements."
print(f"Prompt: '{test_prompt}'")
print(f"Estimated tokens: {estimate_tokens(test_prompt)}")

# TODO: Use actual Mistral tokenizer for precise counts

### Model Comparison - The Money Talk

| Model | Size | Cost/1M tokens | Best for |
|-------|------|---------------|----------|
| mistral-small-latest | Small | ~$0.20 | Classification, simple extraction |
| mistral-large-latest | Large | ~$3.00 | Complex reasoning, detailed analysis |

*TODO: Get actual current pricing - these are rough estimates*

**Real talk**: For most filtering/classification tasks, the small model is plenty good and 15x cheaper. Only use the big guns when you really need them.

> Author note: Show concrete examples with actual token counts and costs side-by-side

In [None]:
# API wrapper with timing and cost tracking
client = Mistral(api_key=os.getenv("MISTRAL_API_KEY"))

def call_mistral(prompt: str, model: str = "mistral-small-latest") -> dict:
    """Call Mistral API and track what it costs us"""
    start_time = time.time()
    
    try:
        response = client.chat.complete(
            model=model,
            messages=[{"role": "user", "content": prompt}]
        )
        
        end_time = time.time()
        
        # Extract useful info
        result = {
            "content": response.choices[0].message.content,
            "model": model,
            "duration": end_time - start_time,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens,
            "total_tokens": response.usage.total_tokens
        }
        
        return result
        
    except Exception as e:
        print(f"❌ API call failed: {e}")
        return None

print("✅ Mistral client ready!")

### First API call - let's do this

In [None]:
# Simple test to make sure everything works
test_prompt = """What's a 'green job'? Keep it short and practical."""

print("🚀 First API call...")
result = call_mistral(test_prompt, "mistral-small-latest")

if result:
    print(f"\n📊 Stats:")
    print(f"Model: {result['model']}")
    print(f"Time: {result['duration']:.2f} seconds")
    print(f"Input tokens: {result['input_tokens']}")
    print(f"Output tokens: {result['output_tokens']}")
    print(f"Total tokens: {result['total_tokens']}")
    
    # TODO: Calculate actual cost
    # estimated_cost = (result['total_tokens'] / 1_000_000) * 0.20
    # print(f"Estimated cost: ${estimated_cost:.6f}")
    
    print(f"\n💬 Response:")
    print(result['content'])

### Model comparison experiment

Let's see the difference between small and large models on the same task:

In [None]:
# Complex analysis prompt for comparison
analysis_prompt = """Analyze this job description and extract:
1. Required skills (list)
2. Seniority level (basic/intermediate/advanced)
3. Location requirements
4. Whether it's related to sustainability/green jobs

Job Description:
# Renewable Energy Systems Engineer
## Overview
Join our mission to accelerate Brazil's transition to clean energy! We're seeking a systems engineer to design and optimize solar and wind energy installations across São Paulo region.

## Requirements
- Engineering degree (Electrical, Mechanical, or Environmental)
- 2-3 years experience in renewable energy projects
- Proficiency in MATLAB/Simulink and AutoCAD
- Portuguese fluency required
- Willingness to travel within São Paulo state

Format your response as structured JSON."""

print("🔬 Running model comparison experiment...\n")

# Test with small model
print("Testing mistral-small-latest:")
result_small = call_mistral(analysis_prompt, "mistral-small-latest")

if result_small:
    print(f"Duration: {result_small['duration']:.2f}s | Tokens: {result_small['total_tokens']}")
    print("Response:")
    print(result_small['content'][:200] + "..." if len(result_small['content']) > 200 else result_small['content'])

print("\n" + "="*50 + "\n")

# Test with large model
print("Testing mistral-large-latest:")
result_large = call_mistral(analysis_prompt, "mistral-large-latest")

if result_large:
    print(f"Duration: {result_large['duration']:.2f}s | Tokens: {result_large['total_tokens']}")
    print("Response:")
    print(result_large['content'][:200] + "..." if len(result_large['content']) > 200 else result_large['content'])

# TODO: Add cost comparison calculation
# TODO: Add quality assessment framework

---

## Using LLMs for Data Filtering

### The problem
697 items in our dataset. We need to categorize them efficiently without manually reading everything.

### The solution
Use LLMs to do the boring classification work for us.

In [None]:
# Load a training example for filtering
sample_training_path = None
if training_files:
    sample_training_path = training_files[0]
    with open(sample_training_path, 'r', encoding='utf-8') as f:
        sample_training = f.read()
    
    print(f"📁 Loaded training: {sample_training_path.name}")
    print(f"Content length: {len(sample_training)} characters")
    print(f"Estimated tokens: {estimate_tokens(sample_training)}")
else:
    print("No training files available for analysis")
    sample_training = None

In [None]:
# Build a simple classifier
def classify_seniority(content: str, model: str = "mistral-small-latest") -> str:
    """Figure out if this is entry-level, mid-level, or senior stuff"""
    
    prompt = f"""Look at this job/training content and tell me the seniority level.

Options:
* Basic - Entry level, no experience needed
* Intermediate - Some experience (1-3 years)
* Advanced - Senior level (3+ years)

Just respond with one word: Basic, Intermediate, or Advanced.

Content:
{content}"""
    
    result = call_mistral(prompt, model)
    if result:
        return result['content'].strip()
    return "Unknown"

# Test it on our sample
if sample_training:
    print("🎯 Testing the classifier...")
    classification = classify_seniority(sample_training)
    print(f"Result: {classification}")
    
    print("\n📋 Here's what it analyzed:")
    display_markdown_file(sample_training_path)

### Batch Processing Strategy

For efficiency, we should process multiple items at once when possible:

In [None]:
def batch_classify_trainings(training_files: list, batch_size: int = 5) -> dict:
    """Classify multiple trainings in batches to optimize API calls"""
    
    results = {}
    
    # Process first few files as example
    sample_files = training_files[:3]  # Just process 3 for demo
    
    for i, file_path in enumerate(sample_files):
        print(f"Processing {i+1}/{len(sample_files)}: {file_path.name}")
        
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            
            classification = classify_seniority(content)
            results[file_path.name] = {
                'seniority': classification,
                'tokens': estimate_tokens(content)
            }
            
        except Exception as e:
            print(f"Error processing {file_path.name}: {e}")
            results[file_path.name] = {'seniority': 'Error', 'tokens': 0}
    
    return results

# Run batch classification
if training_files:
    print("🔄 Running batch classification (sample of 3 trainings)...")
    batch_results = batch_classify_trainings(training_files)
    
    print("\n📊 Results:")
    for filename, data in batch_results.items():
        print(f"{filename}: {data['seniority']} ({data['tokens']} tokens)")
        
    # Calculate total tokens used
    total_tokens = sum(data['tokens'] for data in batch_results.values())
    print(f"\n💰 Total tokens processed: {total_tokens}")
    # TODO: Add actual cost calculation

### Caching - because we're not stupid

Process once, reuse forever. Basic optimization.

In [None]:
import json

def save_classification_cache(results: dict, filename: str = "classification_cache.json"):
    """Save classification results to avoid re-processing"""
    with open(filename, 'w') as f:
        json.dump(results, f, indent=2)
    print(f"✅ Saved classification cache to {filename}")

def load_classification_cache(filename: str = "classification_cache.json") -> dict:
    """Load cached classification results"""
    try:
        with open(filename, 'r') as f:
            results = json.load(f)
        print(f"✅ Loaded classification cache from {filename}")
        return results
    except FileNotFoundError:
        print(f"No cache file found: {filename}")
        return {}

# Save our batch results
if 'batch_results' in locals():
    save_classification_cache(batch_results)

# Demonstrate loading
loaded_cache = load_classification_cache()
print(f"Cache contains {len(loaded_cache)} items")

---

## Exercises (aka homework)

### Exercise 1: Data analysis
Build some actual statistics about our dataset:

In [None]:
# Your mission: analyze the dataset properly
# What we want to know:
# - Jobs by domain (accounting, marketing, tourism, etc.)
# - Geographic distribution (São Paulo, Rio, etc.)
# - Average token counts per category
# - Most common skills mentioned

print("📝 Exercise 1: Data analysis")
print("Use LLMs to extract domains from titles/content")
print("Count location mentions")  
print("Calculate processing costs for different approaches")

# Your code goes here...
# Hint: Use the classify function pattern we just built

### Exercise 2: Cost optimization
Figure out the cheapest way to process everything:

In [None]:
# Cost comparison challenge
# Calculate costs for:
# 1. All 697 items with small model
# 2. All 697 items with large model
# 3. Hybrid: small for classification, large for complex analysis

print("📝 Exercise 2: Cost optimization")
print("Which approach gives best quality/cost ratio?")
print("What's the break-even point?")

# Your implementation here...

### Exercise 3: Green jobs detector
Build a classifier for sustainability-related jobs:

In [None]:
# Green jobs classifier
def is_green_job(content: str) -> bool:
    """Detect sustainability/climate-related jobs and trainings"""
    # Your implementation here
    # Look for keywords like: renewable energy, sustainability, climate, environment
    pass

print("📝 Exercise 3: Green jobs detector")
print("Test it on the dataset - how many green opportunities can you find?")
print("Bonus: what makes a job 'green'?")

# Implement the function and test it...

---

## What we learned

✅ **Data structure**: 697 items in messy formats  
✅ **API basics**: Tokens, models, costs  
✅ **Smart filtering**: LLMs > regex for unstructured data  
✅ **Cost optimization**: Start small, scale strategically  

### The real lessons
- Token counting matters when you're processing lots of data
- Small models are surprisingly good for classification tasks
- Caching is your friend
- Always track costs as you go

### Next up
Tutorial 3: Building your first submission and getting on the leaderboard ASAP.

---

## Notes for tutorial authors

**TODOs that need real implementation:**
- [ ] Get current Mistral pricing 
- [ ] Add actual token counting with Mistral tokenizer
- [ ] Test all code cells with real data
- [ ] Add Brazilian location context  
- [ ] Implement the exercises properly

**Stuff that works:**
- Basic API setup and calls
- Classification pattern
- Cost tracking framework
- Caching strategy

Stick with the conversational tone - this feels way more authentic than corporate training speak.