# Tutorial 2: Data Exploration and Your First API Call

Alright, let's get our hands dirty with some actual data and API calls. 

## What we're doing today

- Download the GDSC 8 dataset (jobs + trainings from Brazil)
- Make our first Mistral API call (and not go broke doing it)
- Understand why tokens matter (spoiler: they cost money)
- Use LLMs to filter data instead of writing regex hell

**Reality check**: This is about building AI agents that help people find green jobs in Brazil. Cool mission, but also we're in a competition, so let's be smart about costs and performance.

---

## Understanding the Challenge Data

### The Mission (kinda cool actually)
We're helping young people in Brazil find green jobs. UNICEF partnership, climate action, meaningful careers - the whole deal. 
And because it's 2025 we will build AI agents that can sift through job descriptions and training programs, match them to people's profiles, and do it efficiently and ethically!

**The Brazilian green jobs landscape we're working with:**
- **Major cities**: São Paulo (finance & tech hub), Rio de Janeiro (energy & environment), Brasília (policy & government), Salvador (renewable energy), Recife (innovation centers)
- **Key sectors**: Renewable energy (solar, wind, hydro), sustainable agriculture, environmental consulting, green construction, waste management
- **Companies leading the charge**: Petrobras (transitioning to renewables), Vale (sustainable mining), Suzano (sustainable forestry), plus hundreds of green startups

### Let's grab the data
Time to download some files from S3:

In [1]:
# Download the GDSC 8 dataset
!aws s3 cp s3://gdsc-25-data-bucket/ . --recursive

'aws' is not recognized as an internal or external command,
operable program or batch file.


After this runs, you'll have a `data` directory with:
- **`jobs/`** - 200 job postings
- **`trainings/`** - 497 training programs

### Quick math reality check
697 items × however many personas we need to match = potentially expensive if we're not careful with API calls.

This is where being smart about it pays off. Literally.

In [1]:
# Let's see what we're working with
from pathlib import Path

# Count files and get basic statistics
jobs_dir = Path('../data/jobs')
trainings_dir = Path('../data/trainings')

job_files = list(jobs_dir.glob('*.md')) if jobs_dir.exists() else []
training_files = list(trainings_dir.glob('*.md')) if trainings_dir.exists() else []

print(f"Dataset Overview:")
print(f"Jobs: {len(job_files)}")
print(f"Trainings: {len(training_files)}")
print(f"Total items: {len(job_files) + len(training_files)}")

Dataset Overview:
Jobs: 200
Trainings: 467
Total items: 667


### Let's look at a job posting

In [2]:
# Helper function to peek at files
from IPython.display import Markdown, display

def display_markdown_file(path: str) -> None:
    """Display a markdown file in Jupyter - nothing fancy"""
    p = Path(path)
    if not p.exists():
        print(f"File not found: {p}")
        return
    content = p.read_text(encoding='utf-8', errors='ignore')
    display(Markdown(content))

In [3]:
# Display a sample job
display_markdown_file(job_files[0])

# Design Research Analyst

**Location:** Belo Horizonte
**Type:** Full-Time

**About the Role:**
We are seeking a **Design Research Analyst** to join our team in Design Research Studies and Development. This is an excellent entry-level opportunity for someone looking to start their career in design research and user experience analysis.

**Key Responsibilities:**
- Conduct user research studies to understand customer needs and behaviors
- Analyze research data and translate findings into actionable design insights
- Support the development of user personas and journey maps
- Collaborate with design teams to inform product development decisions
- Document research methodologies and present findings to stakeholders
- Assist in planning and executing usability testing sessions

**Qualifications:**
- Tecnólogo degree in a relevant field
- Strong analytical and critical thinking skills
- Interest in user experience and design research methodologies
- Excellent communication skills in Portuguese (BR)
- Ability to work collaboratively in a team environment

**Preferred Qualifications:**
- Familiarity with research tools and survey platforms
- Basic understanding of design thinking principles
- Experience with data analysis or statistics coursework

This role is based in Belo Horizonte and offers the opportunity to grow your expertise in design research while contributing to meaningful product development initiatives.

**How to Apply:**
Please submit your resume and cover letter detailing your interest in design research.

### And a training program

In [4]:
# Display a sample training
display_markdown_file(training_files[0])

**Why take this course?**

The **Intermediate Ship Operations Training** will help you:
✅ Master ship handling and operational procedures on intermediate level
✅ Apply best practices for transparency and compliance
✅ Strengthen your resume with a recognized credential

**Course Details:**
- **Duration:** 8 weeks
- **Format:** online
- **Language:** Portuguese (Brazil)
- **Certification:** Yes

**Prerequisites:**
- Basic knowledge of ship operations and maritime procedures

This comprehensive program focuses on advancing your maritime operational expertise through practical scenarios and industry standards. You'll develop the technical competencies needed to handle complex vessel operations while ensuring safety and regulatory compliance.

The training covers essential aspects of ship management, from navigation procedures to cargo handling protocols. Each module builds systematically on foundational concepts, preparing you for real-world challenges in maritime transport operations.

Upon completion, you'll receive official certification that validates your intermediate-level capabilities in maritime operations, making you a stronger candidate for advancement in the shipping industry.

**Don't miss the chance to stand out—register today!**

### What you'll notice

Both jobs and trainings have:
- **Overview/Description** 
- **Location** (this matters for matching)
- **Prerequisites** (skills, experience levels)
- **Outcomes** (for trainings)

But here's the kicker: they're not consistently formatted. Some use different headers, different structures, different language. 
Our solution needs to handle this chaos gracefully. 

This is why we can't just use regex or simple parsing - we need something smarter: GenAI!

## Your First Mistral API Call

Time to get our hands dirty with the actual AI part.

Firstly, create **`.env`** file. Right click on a project structure next to the **`data`** folder and select *New File*. Name the file: ".env".
Paste your Mistral API key which you generated in the first tutorial **`Tutorial_1_Account_setup.ipynb`** exactly like below:

MISTRAL_API_KEY="your-api-key"

Only after that you will be able to continue with next sections.

In [None]:
# Install strands library for mistral
!pip install strands-agents[mistral] python-dotenv

In [5]:
# Setup time
import os
import dotenv

# Load your API key from .env file
dotenv.load_dotenv(".env")

# Check if we're good to go
if not os.getenv("MISTRAL_API_KEY"):
    print("❌ No MISTRAL_API_KEY found!")
    print("Create a .env file with your API key")
else:
    print("✅ API key found, we're ready to roll")

✅ API key found, we're ready to roll


Next, we add a helper function to actually connect to Mistral, using the [strands framework](https://strandsagents.com/latest/). 

**What's Strands?** It's basically a wrapper that makes Mistral (and other LLMs) actually useful for production. Handles retries, structured output, all that boring stuff. Check their [docs](https://strandsagents.com/latest/) if you're curious, but we'll show you what matters.

In [6]:
import time
from strands import Agent
from strands.models.mistral import MistralModel

def call_mistral(prompt: str, model: str = "mistral-small-latest") -> dict:
    """Call Mistral API and track what it costs us"""
    mistral_model = MistralModel(
        api_key=os.environ["MISTRAL_API_KEY"],
        model_id=model,
        stream=False
    )
    agent = Agent(model=mistral_model, callback_handler=None)
    start_time = time.time()
    
    try:
        response = agent(prompt)
        end_time = time.time()
        
        # Extract useful info
        result = {
            "content": response.message['content'][0]['text'],
            "model": model,
            "duration": end_time - start_time,
            "input_tokens": response.metrics.accumulated_usage['inputTokens'],
            "output_tokens": response.metrics.accumulated_usage['outputTokens'],
            "total_tokens": response.metrics.accumulated_usage['totalTokens']
        }
        
        return result
        
    except Exception as e:
        print(f"❌ API call failed: {e}")
        return None

print("✅ Mistral client ready!")

✅ Mistral client ready!


In [7]:
# Simple test to make sure everything works
test_prompt = """What's a 'green job'? Keep it short and practical."""

print("🚀 First API call...")
result = call_mistral(test_prompt, "mistral-small-latest")

if result:
    print(f"\n📊 Stats:")
    print(f"Model: {result['model']}")
    print(f"Time: {result['duration']:.2f} seconds")
    print(f"Input tokens: {result['input_tokens']}")
    print(f"Output tokens: {result['output_tokens']}")
    print(f"Total tokens: {result['total_tokens']}")
    
    estimated_cost = (result['total_tokens'] / 1_000_000) * 0.30
    print(f"Estimated cost: ${estimated_cost:.6f}")
    
    print(f"\n💬 Response:")
    print(result['content'])

🚀 First API call...

📊 Stats:
Model: mistral-small-latest
Time: 1.25 seconds
Input tokens: 17
Output tokens: 87
Total tokens: 104
Estimated cost: $0.000031

💬 Response:
A **green job** is a role that helps protect the environment or reduce pollution. Examples include:

- **Renewable energy** (solar/wind technician)
- **Sustainability** (energy auditor, recycling coordinator)
- **Conservation** (forest ranger, wildlife biologist)
- **Green construction** (LEED-certified builder)

These jobs focus on reducing carbon footprints and promoting eco-friendly practices.


### Understanding Tokens - Your Cost Unit

LLMs are usually priced via tokens. Usually X$ "per 1 million tokens" - but what exactly is a token?

**Tokens are how LLMs process text**. Think of them as the "billing units" for AI:
- "Olá mundo!" ≈ 4 tokens (Portuguese uses slightly more tokens than English)
- "Green jobs in São Paulo" ≈ 6 tokens  
- Roughly 1 token ≈ 0.75 English words (varies by language)

**Why tokens matter for our challenge:**
- **Cost control**: 697 job postings × 100 tokens each = 69,700 tokens to process
- **Speed**: More tokens = slower responses (matters when processing hundreds of items)  
- **Planning**: Models have token limits (128k for all Mistral models)

**Quick cost reality check:**
- Small model: 69,700 tokens ≈ $0.007 to classify all jobs
- Large model: Same task ≈ $0.14 (20x more expensive)
- For 697 items, choosing the right model matters!

**Pro tip**: Always start with the smallest model that can handle your task. You can always upgrade to larger models for complex reasoning later.

### Model Comparison - The Money Talk

| Model Name           | Size / Version     | Input Cost (per 1M tokens)  | Output Cost (per 1M tokens)  | Context Window |
|----------------------|--------------------|-----------------------------|--------------------------|----------------|
| Mistral Large 24-11  | Large              | \$2.00                       | \$6.00                        | 128k tokens      |
| Mistral Medium 3     | Medium             | \$0.40                       | \$2.00                        | 128k tokens      |
| Mistral Small 3.1    | Small              | \$0.10                       | \$0.30                        | 128k tokens      |

**Real talk**: For most filtering/classification tasks, the small model is plenty good and 15x cheaper. Only use the big guns when you really need them.
We'll see an example in a minute. But first we need to talk about prompting.

## Prompt Engineering Essentials

Before we compare models, let's talk about **prompts** - your instructions to the AI. Think of prompts as the difference between asking a colleague "Can you help?" vs "Can you analyze this São Paulo job posting and extract the required skills in bullet format?"

### What makes a good prompt?

**❌ Vague prompt:**
```
"Analyze this job"
```

**✅ Specific prompt:**
```
"Analyze this Brazilian green job posting and extract:
1. Required skills (list format)
2. Experience level (entry/mid/senior)  
3. Location requirements
4. Sustainability focus areas

Format as structured JSON."
```

### Key principles for GDSC challenge:

1. **Be specific about the task** - "classify" vs "analyze deeply"
2. **Specify output format** - JSON, bullet points, yes/no answers
3. **Provide context** - mention it's Brazilian data, green jobs focus
4. **Set constraints** - "keep it under 50 words" for cost control

### Why this matters:
- **Small models** need very clear, specific instructions
- **Large models** can handle more ambiguous, complex requests
- **Good prompts** = consistent results across your 697 job postings

Let's see this in action with model comparisons...

In [8]:
# Complex analysis prompt for comparison - let's use a Brazilian green energy company!
analysis_prompt = """Analyze this job description and extract:
1. Required skills (list)
2. Seniority level (basic/intermediate/advanced)
3. Location requirements
4. Whether it's related to sustainability/green jobs

Job Description:
# Renewable Energy Systems Engineer - Petrobras Renewables Division
## Overview
Join Petrobras's mission to accelerate Brazil's transition to clean energy! We're seeking a systems engineer to design and optimize solar and wind energy installations across São Paulo and Minas Gerais regions.

## Requirements
- Engineering degree (Electrical, Mechanical, or Environmental)
- 2-3 years experience in renewable energy projects
- Proficiency in MATLAB/Simulink and AutoCAD
- Portuguese fluency required
- Willingness to travel within Southeast Brazil

Format your response as structured JSON."""

print("🔬 Running model comparison experiment...\n")

# Test with small model
print("Testing mistral-small-latest:")
result_small = call_mistral(analysis_prompt, "mistral-small-latest")

if result_small:
    small_input_cost = (result_small['input_tokens'] / 1_000_000) * 0.10
    small_output_cost = (result_small['output_tokens'] / 1_000_000) * 0.30
    small_total_cost = small_input_cost + small_output_cost
    
    print(f"Duration: {result_small['duration']:.2f}s | Tokens: {result_small['total_tokens']} | Cost: ${small_total_cost:.6f}")
    print("Full Response:")
    print(result_small['content'])

print("\n" + "="*50 + "\n")

# Test with large model
print("Testing mistral-large-latest:")
result_large = call_mistral(analysis_prompt, "mistral-large-latest")

if result_large:
    large_input_cost = (result_large['input_tokens'] / 1_000_000) * 2.00
    large_output_cost = (result_large['output_tokens'] / 1_000_000) * 6.00
    large_total_cost = large_input_cost + large_output_cost
    
    print(f"Duration: {result_large['duration']:.2f}s | Tokens: {result_large['total_tokens']} | Cost: ${large_total_cost:.6f}")
    print("Full Response:")
    print(result_large['content'])

# Cost comparison summary
if result_small and result_large:
    cost_multiplier = large_total_cost / small_total_cost
    print(f"\n💰 Cost Analysis:")
    print(f"Small model cost: ${small_total_cost:.6f}")
    print(f"Large model cost: ${large_total_cost:.6f}")
    print(f"Large model is {cost_multiplier:.1f}x more expensive")
    print(f"\nFor 697 job postings:")
    print(f"Small model total: ${small_total_cost * 697:.2f}")
    print(f"Large model total: ${large_total_cost * 697:.2f}")
    print(f"Difference: ${(large_total_cost - small_total_cost) * 697:.2f}")

🔬 Running model comparison experiment...

Testing mistral-small-latest:
Duration: 1.49s | Tokens: 298 | Cost: $0.000056
Full Response:
```json
{
  "required_skills": [
    "Engineering degree (Electrical, Mechanical, or Environmental)",
    "2-3 years experience in renewable energy projects",
    "Proficiency in MATLAB/Simulink",
    "Proficiency in AutoCAD",
    "Portuguese fluency"
  ],
  "seniority_level": "intermediate",
  "location_requirements": {
    "primary_location": "São Paulo and Minas Gerais regions, Brazil",
    "travel_requirements": "Willingness to travel within Southeast Brazil"
  },
  "sustainability_green_job": true
}
```


Testing mistral-large-latest:
Duration: 3.06s | Tokens: 503 | Cost: $0.002354
Full Response:
```json
{
  "job_analysis": {
    "required_skills": [
      {
        "skill": "Engineering degree (Electrical, Mechanical, or Environmental)",
        "type": "education"
      },
      {
        "skill": "2-3 years experience in renewable energy project

### When to use large vs small models?

Looking at both responses, they seem pretty similar, right? Both extracted the key information correctly. So why would you ever pay 15x more for the large model?

**Small model wins when:**
- Simple extraction tasks (skills, location, yes/no questions)
- Consistent input format
- High-volume processing (like our 697 jobs)
- Budget constraints

**Large model wins when:**
- Complex reasoning required ("Would this person from Recife be successful in this São Paulo role given the cultural differences?")
- Ambiguous or poorly formatted input
- Nuanced analysis (understanding implicit requirements)
- Multi-step logical chains

**Exercise for you:**
Try these prompts and compare small vs large model responses:

1. **Complex cultural reasoning:**
```
"This job is in São Paulo but requires frequent travel to Amazon region. 
The candidate is from Rio and has never been to Northern Brazil. 
Analyze the cultural and practical challenges they might face."
```

2. **Implicit skill detection:**
```
"This job mentions 'coordinating with stakeholders across different time zones' 
and 'managing distributed teams.' What soft skills are implicitly required?"
```

3. **Brazilian regulatory knowledge:**
```
"This environmental consulting role mentions 'compliance with CONAMA regulations.' 
What does this tell us about the job requirements?"
```

Share your results in the Teams channel - you'll probably find some interesting differences!

## Using LLMs for Data Filtering

### The problem
We have 697 items in our dataset. How can we categorize them efficiently without manually reading everything?

### Why traditional approaches fail
**Regex and keyword matching** would be a nightmare here. Consider these challenges:
- Job titles vary: "Engenheiro de Energia Solar" vs "Solar Energy Engineer" vs "Renewable Systems Specialist"  
- Skills are described differently: "2 years experience" vs "minimum 24 months" vs "experiência de 2 anos"
- Location formats differ: "São Paulo, SP" vs "Greater São Paulo Area" vs "Estado de São Paulo"
- Requirements buried in paragraphs vs structured lists

**Rule-based classification** would need hundreds of if-then statements and constant maintenance.

### The LLM solution
LLMs understand **semantic meaning**, not just keywords:
- They recognize "energia renovável" and "renewable energy" as the same concept
- They infer experience levels from contextual clues
- They handle inconsistent formatting gracefully  
- They can extract implicit information (e.g., senior-level roles often mention "leadership")

### The trade-offs
- **Accuracy**: Much higher than regex, handles edge cases
- **Cost**: API calls add up - need to optimize model choice
- **Speed**: Slower than regex, but parallel processing helps
- **Consistency**: Good with proper prompt design

Let's see this in action...

In [9]:
# Load a training example for filtering
sample_training_path = None
if training_files:
    sample_training_path = training_files[0]
    with open(sample_training_path, 'r', encoding='utf-8') as f:
        sample_training = f.read()
    
    print(f"📁 Loaded training: {sample_training_path.name}")
    print(f"Content length: {len(sample_training)} characters")
else:
    print("No training files available for analysis")
    sample_training = None

📁 Loaded training: tr_marc_vessel_operations_02.md
Content length: 1258 characters


In [10]:
# Build a simple classifier
def classify_seniority(content: str, model: str = "mistral-large-latest") -> str:
    """Figure out if this is entry-level, mid-level, or senior stuff"""
    
    prompt = f"""Look at this job/training content and tell me the seniority level. Analyze whole file before answering.
Options:
* Basic - Entry level, no experience needed
* Intermediate - Some experience (1-3 years)
* Advanced - Senior level (3+ years)

Just respond with one word: Basic, Intermediate, or Advanced.

Content:
{content}"""
    
    result = call_mistral(prompt, model)
    if result:
        return result['content'].strip()
    return "Unknown"

### Why this design works:

- ✅ Constrained outputs: Only 3 possible answers reduces hallucination
- ✅ Clear definitions: Explicit criteria for each level
- ✅ Simple instruction: 'Just respond with one word' forces compliance
- ✅ Context window: 'Analyze whole file' ensures complete understanding
- ✅ Large model default: Classification needs reasoning, not just pattern matching

In [11]:
# Test it on our sample
if sample_training:
    print("🎯 Testing the classifier...")
    classification = classify_seniority(sample_training)
    print(f"Result: {classification}")
    print('---------------------------------------------------------------')
    print("\n📋 Here's what it analyzed:")
    display_markdown_file(sample_training_path)

🎯 Testing the classifier...
Result: Intermediate
---------------------------------------------------------------

📋 Here's what it analyzed:


**Why take this course?**

The **Intermediate Ship Operations Training** will help you:
✅ Master ship handling and operational procedures on intermediate level
✅ Apply best practices for transparency and compliance
✅ Strengthen your resume with a recognized credential

**Course Details:**
- **Duration:** 8 weeks
- **Format:** online
- **Language:** Portuguese (Brazil)
- **Certification:** Yes

**Prerequisites:**
- Basic knowledge of ship operations and maritime procedures

This comprehensive program focuses on advancing your maritime operational expertise through practical scenarios and industry standards. You'll develop the technical competencies needed to handle complex vessel operations while ensuring safety and regulatory compliance.

The training covers essential aspects of ship management, from navigation procedures to cargo handling protocols. Each module builds systematically on foundational concepts, preparing you for real-world challenges in maritime transport operations.

Upon completion, you'll receive official certification that validates your intermediate-level capabilities in maritime operations, making you a stronger candidate for advancement in the shipping industry.

**Don't miss the chance to stand out—register today!**

### Pro tips for production:

- Start with large model for accuracy baseline
- Test small model on sample - might be sufficient
- Use temperature=0 for consistent classifications
- Consider few-shot examples for edge cases
- Always validate on known examples before scaling

### Batch Processing Strategy

When you're processing hundreds of items, you need to think about **scale optimization**:

**Why batch processing matters:**
- **API rate limits**: Most APIs limit requests per minute/hour
- **Progress tracking**: Users want to see something happening  
- **Error handling**: Individual failures shouldn't kill the whole job
- **Memory management**: Don't load all 697 files into memory at once
- **Cost monitoring**: Track spending as you go, not at the end

**Batch size considerations:**
- **Too small** (1-2 items): Lots of overhead, slow overall progress
- **Too large** (100+ items): Memory issues, harder to recover from errors
- **Sweet spot** (10-25 items): Balance between efficiency and manageability

**For our GDSC dataset:**
- 697 total items to process
- Average ~500 characters per item 
- At 10 items per batch = 70 batches total
- Estimated time: 70 batches × 2 seconds = ~2.5 minutes

Let's implement a smart batch processor:

In [12]:
def batch_classify_trainings(training_files: list, batch_size: int = 5) -> dict:
    """Classify multiple trainings in batches to optimize API calls"""
    
    results = {}
    total_cost = 0.0
    total_tokens = 0
    
    # Process first few files as example (in production, remove [:3])
    sample_files = training_files[:3]  # Just process 3 for demo
    
    print(f"🔄 Processing {len(sample_files)} training files...")
    print(f"Progress tracking and cost accumulation:")
    print()
    
    for i, file_path in enumerate(sample_files):
        # Progress indicator
        progress = ((i + 1) / len(sample_files)) * 100
        print(f"Processing {i+1}/{len(sample_files)} ({progress:.0f}%): {file_path.name}")
        
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            
            # Classify and get actual token usage
            result = call_mistral(f"""Classify this training program seniority level:
Options: Basic, Intermediate, Advanced
Content: {content}""", "mistral-small-latest")
            
            if result:
                # Calculate actual costs based on token usage
                input_cost = (result['input_tokens'] / 1_000_000) * 2.00  # Small model input
                output_cost = (result['output_tokens'] / 1_000_000) * 6.00  # Small model output
                item_cost = input_cost + output_cost
                
                total_cost += item_cost
                total_tokens += result['total_tokens']
                
                results[file_path.name] = {
                    'seniority': result['content'].strip(),
                    'tokens': result['total_tokens'],
                    'cost': item_cost,
                    'duration': result['duration']
                }
                
                print(f"  → {result['content'].strip()} | {result['total_tokens']} tokens | ${item_cost:.6f}")
            else:
                print(f"  → Error processing {file_path.name}")
                
        except Exception as e:
            print(f"  → Error processing {file_path.name}: {e}")
            results[file_path.name] = {'seniority': 'Error', 'tokens': 0, 'cost': 0, 'duration': 0}
    
    return results, total_cost, total_tokens

# Run batch classification with cost tracking
if training_files:
    batch_results, batch_cost, batch_tokens = batch_classify_trainings(training_files)
    
    print(f"\n📊 Batch Processing Results:")
    print(f"Items processed: {len(batch_results)}")
    print(f"Total tokens used: {batch_tokens:,}")
    print(f"Total cost: ${batch_cost:.4f}")
    print(f"Average cost per item: ${batch_cost / len(batch_results):.6f}")
    
    print(f"\n💰 Scaling to full dataset (697 items):")
    avg_cost_per_item = batch_cost / len(batch_results)
    full_dataset_cost = avg_cost_per_item * 697
    print(f"Estimated cost with large model: ${full_dataset_cost:.2f}")
    
    # Cost comparison with small model (roughly 15x cheaper)
    small_model_cost = full_dataset_cost / 15
    print(f"Estimated cost with small model: ${small_model_cost:.2f}")
    print(f"Potential savings: ${full_dataset_cost - small_model_cost:.2f}")
    
    print(f"\n⚡ Performance insights:")
    avg_duration = sum(r.get('duration', 0) for r in batch_results.values()) / len(batch_results)
    print(f"Average API call duration: {avg_duration:.2f} seconds")
    print(f"Full dataset processing time: ~{(avg_duration * 697) / 60:.1f} minutes")
    print(f"Recommendation: Use parallel processing for production!")

🔄 Processing 3 training files...
Progress tracking and cost accumulation:

Processing 1/3 (33%): tr_marc_vessel_operations_02.md
  → Based on the provided content, the training program should be classified as **Intermediate**.

### Key Indicators:
1. **Title:** "Intermediate Ship Operations Training" explicitly states the level.
2. **Prerequisites:** Requires "basic knowledge of ship operations and maritime procedures," implying prior foundational learning.
3. **Content Focus:** Advances beyond basics (e.g., "master ship handling," "complex vessel operations," "real-world challenges") but does not suggest expert-level specialization (e.g., advanced simulations, leadership, or niche maritime technologies).
4. **Certification:** Validates "intermediate-level capabilities," not entry-level or advanced mastery.

The program bridges foundational knowledge and advanced expertise, aligning with the **Intermediate** category. | 396 tokens | $0.001412
Processing 2/3 (67%): tr_law_case_analysis_

---

## Exercises (aka homework)

### Exercise 1: Data analysis
Build some actual statistics about our dataset:

In [None]:
# Your mission: analyze the dataset properly
# What we want to know about Brazilian green jobs:
# - Geographic distribution (São Paulo, Rio, Brasília, Salvador, Recife, etc.)
# - Average token counts per category
# - Most common skills mentioned
# - Portuguese vs English content ratio
# - Green job concentration by region

print("📝 Exercise 1: Data analysis")
print("Use LLMs to extract domains from job titles and training content")
print("Count location mentions across major Brazilian cities")  
print("Calculate processing costs for different classification approaches")
print("Bonus: Identify uniquely Brazilian requirements (e.g., Portuguese fluency, CONAMA compliance)")

# Your code goes here...
# Hint: Use the classify function pattern we just built
# Consider analyzing:
# - Job titles: "Engenheiro Ambiental" vs "Environmental Engineer"
# - Location patterns: "São Paulo, SP" vs "Greater São Paulo" vs "Interior de São Paulo"
# - Brazilian-specific skills: Portuguese fluency, local regulations, regional travel

### Exercise 2: Cost optimization
Figure out the cheapest way to process everything:

In [None]:
# Cost comparison challenge
# Calculate costs for:
# 1. All 697 items with small model
# 2. All 697 items with large model
# 3. Hybrid: small for classification, large for complex analysis

print("📝 Exercise 2: Cost optimization")
print("Which approach gives best quality/cost ratio?")
print("What's the break-even point?")

# Your implementation here...

### Exercise 3: Green jobs detector
Build a classifier for sustainability-related jobs:

In [None]:
# Green jobs classifier for Brazilian context
def is_green_job(content: str) -> bool:
    """Detect sustainability/climate-related jobs and trainings in Brazilian context"""
    # Your implementation here
    # Look for keywords like: 
    # - English: renewable energy, sustainability, climate, environment, solar, wind
    # - Portuguese: energia renovável, sustentabilidade, meio ambiente, solar, eólica
    # - Brazilian specifics: CONAMA, Amazônia, Mata Atlântica, etanol, biodiesel
    # - Companies: Petrobras renewables, Vale sustainability, Suzano forestry
    pass

print("📝 Exercise 3: Green jobs detector")
print("Build a classifier that recognizes sustainability jobs in both Portuguese and English")
print("Test it on the dataset - how many green opportunities can you find?")
print("Bonus questions:")
print("• What makes a job 'green' in the Brazilian context?")
print("• How do green job requirements differ between São Paulo (urban) and Amazon region?")
print("• Which green sectors are growing fastest in Brazil?")

# Implement the function and test it...
# Consider Brazilian green job examples:
# - Solar panel installer in Northeast Brazil
# - Environmental consultant for mining companies  
# - Sustainable agriculture specialist in Cerrado region
# - Carbon credit analyst for forestry companies
# - Renewable energy engineer for hydroelectric plants

## What we learned

✅ **Data structure**: 697 items in messy formats  
✅ **API basics**: Tokens, models, costs  
✅ **Smart filtering**: LLMs > regex for unstructured data  
✅ **Cost optimization**: Start small, scale strategically  

### The real lessons
- Token counting matters when you're processing lots of data
- Small models are surprisingly good for classification tasks
- Always track costs as you go

### Next up
Tutorial 3: Building your first submission and getting on the leaderboard ASAP.