# Tutorial 2: Data Exploration and Your First API Call

Alright, let's get our hands dirty with some actual data and API calls. 

## What we're doing today

- Download the GDSC 8 dataset (jobs + trainings from Brazil)
- Make our first Mistral API call (and not go broke doing it)
- Understand why tokens matter (spoiler: they cost money)
- Use LLMs to filter data instead of writing regex hell

**Reality check**: This is about building AI agents that help people find green jobs in Brazil. Cool mission, but also we're in a competition, so let's be smart about costs and performance.

---

## Understanding the Challenge Data

### The Mission (kinda cool actually)
We're helping young people in Brazil find green jobs. UNICEF partnership, climate action, meaningful careers - the whole deal. 
And because it's 2025 we will build AI agents that can sift through job descriptions and training programs, match them to people's profiles, and do it efficiently and ethically!

### Let's grab the data
Time to download some files from S3:

In [None]:
# Download the GDSC 8 dataset
!aws s3 cp s3://gdsc-25-data-bucket/ . --recursive

download: s3://gdsc25test/data/jobs/job_acc_010.md to data/jobs/job_acc_010.md
download: s3://gdsc25test/data/jobs/job_acc_002.md to data/jobs/job_acc_002.md
download: s3://gdsc25test/data/jobs/job_acc_009.md to data/jobs/job_acc_009.md
download: s3://gdsc25test/data/jobs/job_acc_008.md to data/jobs/job_acc_008.md
download: s3://gdsc25test/data/jobs/job_acc_004.md to data/jobs/job_acc_004.md
download: s3://gdsc25test/data/jobs/job_acc_007.md to data/jobs/job_acc_007.md
download: s3://gdsc25test/data/jobs/job_acc_005.md to data/jobs/job_acc_005.md
download: s3://gdsc25test/data/jobs/job_acc_006.md to data/jobs/job_acc_006.md
download: s3://gdsc25test/data/jobs/job_acc_003.md to data/jobs/job_acc_003.md
download: s3://gdsc25test/data/jobs/job_acc_001.md to data/jobs/job_acc_001.md
download: s3://gdsc25test/data/jobs/job_adm_001.md to data/jobs/job_adm_001.md
download: s3://gdsc25test/data/jobs/job_adm_006.md to data/jobs/job_adm_006.md
download: s3://gdsc25test/data/jobs/job_adm_004.md t

After this runs, you'll have a `data` directory with:
- **`jobs/`** - 200 job postings
- **`trainings/`** - 497 training programs

### Quick math reality check
697 items × however many personas we need to match = potentially expensive if we're not careful with API calls.

This is where being smart about it pays off. Literally.

In [18]:
# Let's see what we're working with
import os
from pathlib import Path

# Count files and get basic statistics
jobs_dir = Path('../data/jobs')
trainings_dir = Path('../data/trainings')

job_files = list(jobs_dir.glob('*.md')) if jobs_dir.exists() else []
training_files = list(trainings_dir.glob('*.md')) if trainings_dir.exists() else []

print(f"Dataset Overview:")
print(f"Jobs: {len(job_files)}")
print(f"Trainings: {len(training_files)}")
print(f"Total items: {len(job_files) + len(training_files)}")

Dataset Overview:
Jobs: 200
Trainings: 467
Total items: 667


### Let's look at a job posting

In [19]:
# Helper function to peek at files
from IPython.display import Markdown, display

def display_markdown_file(path: str) -> None:
    """Display a markdown file in Jupyter - nothing fancy"""
    p = Path(path)
    if not p.exists():
        print(f"File not found: {p}")
        return
    content = p.read_text(encoding='utf-8', errors='ignore')
    display(Markdown(content))

In [20]:
# Display a sample job
display_markdown_file(job_files[0])

# Detailed Job Description: Coordinator – Reception & Guest Experience

**Position Summary:**
The **Coordinator – Reception & Guest Experience** will be responsible for managing front desk operations and ensuring exceptional guest service in our hotel reception environment. This role focuses on creating positive first impressions and maintaining smooth daily operations for all guests.

**Responsibilities & Duties:**
- **Guest Service Operations:** Handle arrival and departure processes with attention to detail and professionalism
  - Manage guest check-in and check-out procedures efficiently
  - Address guest inquiries and resolve concerns promptly
- **Front Desk Management:** Coordinate multiple reception tasks while maintaining service quality
  - Balance various administrative duties throughout each shift
  - Maintain accurate guest records and reservation systems

**Required Skills and Experience:**
- **Technical Skills:** Basic proficiency in guest relations, check-in/check-out procedures, and handling multiple tasks simultaneously
- **Education & Experience:** Bachelor's degree required; entry-level position suitable for recent graduates
- **Language Requirements:** Fluency in Portuguese (Brazilian)

**Location:**
This position is based in Salvador and requires on-site presence at our hotel reception desk.

**To Apply:**
Submit your resume and cover letter for consideration. This role offers an excellent opportunity to begin your career in hospitality while developing essential guest service skills in a professional hotel environment.

### And a training program

In [21]:
# Display a sample training
display_markdown_file(training_files[0])

**Master the Basics of Electronics and Electricity!**

Join our **Basic Electrical Wiring Course** and learn how to safely install, connect, and troubleshoot electrical systems from the ground up.

✔ **Duration:** 12 weeks  
✔ **Format:** online  
✔ **Language:** pt-BR  
✔ **Certification:** Included

**Who is it for?** Anyone looking to enter the electrical field or gain foundational wiring skills—no prior experience needed.

**Prerequisites:** None

You'll master fundamental wiring techniques, understand electrical safety protocols, and gain hands-on knowledge of residential and commercial electrical installations. By the end of this course, you'll confidently handle basic electrical projects and understand how electrical systems work.

This comprehensive program covers everything from reading electrical diagrams to proper wire connections, circuit protection, and troubleshooting common electrical issues. Perfect for aspiring electricians, maintenance professionals, or anyone wanting to understand electrical systems better.

**Secure your spot and invest in your future!**

### What you'll notice

Both jobs and trainings have:
- **Overview/Description** 
- **Location** (this matters for matching)
- **Prerequisites** (skills, experience levels)
- **Outcomes** (for trainings)

But here's the kicker: they're not consistently formatted. Some use different headers, different structures, different language. 
Our solution needs to handle this chaos gracefully. 

This is why we can't just use regex or simple parsing - we need something smarter: GenAI!

## Your First Mistral API Call

Time to get our hands dirty with the actual AI part.

Firstly, create **`.env`** file. Right click on a project structure next to the **`data`** folder and select *New File*. Name the file: ".env".
Paste your Mistral API key which you generated in the first tutorial **`Tutorial_1_Account_setup.ipynb`** exactly like below:

MISTRAL_API_KEY="your-api-key"

Only after that you will be able to continue with next sections.

In [None]:
# Install strands library for mistral
!pip install strands-agents[mistral]

In [22]:
# Setup time
import os
import dotenv

# Load your API key from .env file
dotenv.load_dotenv(".env")

# Check if we're good to go
if not os.getenv("MISTRAL_API_KEY"):
    print("❌ No MISTRAL_API_KEY found!")
    print("Create a .env file with your API key")
else:
    print("✅ API key found, we're ready to roll")

✅ API key found, we're ready to roll


Next, we add a helper function to actually connect to Mistral, using the [strands framework](https://strandsagents.com/latest/).

In [8]:
from strands import Agent
from strands.models.mistral import MistralModel

def call_mistral(prompt: str, model: str = "mistral-small-latest") -> dict:
    """Call Mistral API and track what it costs us"""
    mistral_model = MistralModel(
        api_key=os.environ["MISTRAL_API_KEY"],
        model_id=model,
        stream=False
    )
    agent = Agent(model=mistral_model, callback_handler=None)
    start_time = time.time()
    
    try:
        response = agent(prompt)
        end_time = time.time()
        
        # Extract useful info
        result = {
            "content": response.message['content'][0]['text'],
            "model": model,
            "duration": end_time - start_time,
            "input_tokens": response.metrics.accumulated_usage['inputTokens'],
            "output_tokens": response.metrics.accumulated_usage['outputTokens'],
            "total_tokens": response.metrics.accumulated_usage['totalTokens']
        }
        
        return result
        
    except Exception as e:
        print(f"❌ API call failed: {e}")
        return None

print("✅ Mistral client ready!")

✅ Mistral client ready!


In [23]:
# Simple test to make sure everything works
test_prompt = """What's a 'green job'? Keep it short and practical."""

print("🚀 First API call...")
result = call_mistral(test_prompt, "mistral-small-latest")

if result:
    print(f"\n📊 Stats:")
    print(f"Model: {result['model']}")
    print(f"Time: {result['duration']:.2f} seconds")
    print(f"Input tokens: {result['input_tokens']}")
    print(f"Output tokens: {result['output_tokens']}")
    print(f"Total tokens: {result['total_tokens']}")
    
    estimated_cost = (result['total_tokens'] / 1_000_000) * 0.30
    print(f"Estimated cost: ${estimated_cost:.6f}")
    
    print(f"\n💬 Response:")
    print(result['content'])

🚀 First API call...

📊 Stats:
Model: mistral-small-latest
Time: 0.88 seconds
Input tokens: 17
Output tokens: 87
Total tokens: 104
Estimated cost: $0.000031

💬 Response:
A **green job** is a role that helps protect the environment or reduce pollution. Examples include:

- **Renewable energy** (solar/wind technician)
- **Sustainability** (energy auditor, recycling coordinator)
- **Conservation** (park ranger, wildlife biologist)
- **Green construction** (LEED-certified builder)

These jobs focus on reducing carbon footprints and promoting eco-friendly practices.


### Understanding Tokens - Your Cost Unit

Remember those prices in the table above? They're all "per 1 million tokens" - but what exactly is a token?

**Tokens are how LLMs process text**. Think of them as the "billing units" for AI:
- "Olá mundo!" ≈ 4 tokens (Portuguese uses slightly more tokens than English)
- "Green jobs in São Paulo" ≈ 6 tokens  
- Roughly 1 token ≈ 0.75 English words (varies by language)

**Why tokens matter for our challenge:**
- **Cost control**: 697 job postings × 100 tokens each = 69,700 tokens to process
- **Speed**: More tokens = slower responses (matters when processing hundreds of items)  
- **Planning**: Models have token limits (128k for all Mistral models)

**Quick cost reality check:**
- Small model: 69,700 tokens ≈ $0.007 to classify all jobs
- Large model: Same task ≈ $0.14 (20x more expensive)
- For 697 items, choosing the right model matters!

**Pro tip**: Always start with the smallest model that can handle your task. You can always upgrade to larger models for complex reasoning later.

### Model Comparison - The Money Talk

| Model Name           | Size / Version     | Input Cost (per 1M tokens)  | Output Cost (per 1M tokens)  | Context Window |
|----------------------|--------------------|-----------------------------|--------------------------|----------------|
| Mistral Large 24-11  | Large              | \$2.00                       | \$6.00                        | 128k tokens      |
| Mistral Medium 3     | Medium             | \$0.40                       | \$2.00                        | 128k tokens      |
| Mistral Small 3.1    | Small              | \$0.10                       | \$0.30                        | 128k tokens      |

**Real talk**: For most filtering/classification tasks, the small model is plenty good and 15x cheaper. Only use the big guns when you really need them.
We'll see an example in a minute. But first we need to talk about prompting.

## Prompt Engineering Essentials

Before we compare models, let's talk about **prompts** - your instructions to the AI. Think of prompts as the difference between asking a colleague "Can you help?" vs "Can you analyze this São Paulo job posting and extract the required skills in bullet format?"

### What makes a good prompt?

**❌ Vague prompt:**
```
"Analyze this job"
```

**✅ Specific prompt:**
```
"Analyze this Brazilian green job posting and extract:
1. Required skills (list format)
2. Experience level (entry/mid/senior)  
3. Location requirements
4. Sustainability focus areas

Format as structured JSON."
```

### Key principles for GDSC challenge:

1. **Be specific about the task** - "classify" vs "analyze deeply"
2. **Specify output format** - JSON, bullet points, yes/no answers
3. **Provide context** - mention it's Brazilian data, green jobs focus
4. **Set constraints** - "keep it under 50 words" for cost control

### Why this matters:
- **Small models** need very clear, specific instructions
- **Large models** can handle more ambiguous, complex requests
- **Good prompts** = consistent results across your 697 job postings

Let's see this in action with model comparisons...

In [24]:
# Complex analysis prompt for comparison
analysis_prompt = """Analyze this job description and extract:
1. Required skills (list)
2. Seniority level (basic/intermediate/advanced)
3. Location requirements
4. Whether it's related to sustainability/green jobs

Job Description:
# Renewable Energy Systems Engineer
## Overview
Join our mission to accelerate Brazil's transition to clean energy! We're seeking a systems engineer to design and optimize solar and wind energy installations across São Paulo region.

## Requirements
- Engineering degree (Electrical, Mechanical, or Environmental)
- 2-3 years experience in renewable energy projects
- Proficiency in MATLAB/Simulink and AutoCAD
- Portuguese fluency required
- Willingness to travel within São Paulo state

Format your response as structured JSON."""

print("🔬 Running model comparison experiment...\n")

# Test with small model
print("Testing mistral-small-latest:")
result_small = call_mistral(analysis_prompt, "mistral-small-latest")

if result_small:
    print(f"Duration: {result_small['duration']:.2f}s | Tokens: {result_small['total_tokens']}")
    print("Response:")
    print(result_small['content'][:200] + "..." if len(result_small['content']) > 200 else result_small['content'])

print("\n" + "="*50 + "\n")

# Test with large model
print("Testing mistral-large-latest:")
result_large = call_mistral(analysis_prompt, "mistral-large-latest")

if result_large:
    print(f"Duration: {result_large['duration']:.2f}s | Tokens: {result_large['total_tokens']}")
    print("Response:")
    print(result_large['content'][:200] + "..." if len(result_large['content']) > 200 else result_large['content'])


🔬 Running model comparison experiment...

Testing mistral-small-latest:
Duration: 1.08s | Tokens: 284
Response:
```json
{
  "required_skills": [
    "Engineering degree (Electrical, Mechanical, or Environmental)",
    "2-3 years experience in renewable energy projects",
    "Proficiency in MATLAB/Simulink",
   ...


Testing mistral-large-latest:
Duration: 2.67s | Tokens: 453
Response:
```json
{
  "job_analysis": {
    "required_skills": [
      {
        "type": "education",
        "details": ["Engineering degree (Electrical, Mechanical, or Environmental)"]
      },
      {
      ...


Looks very similar, right? So why would you ever need to use a larger model? 

**Excercise:**
- Try and find an example where the large model creates obviously better results than the small model. Share your results in the teams channel

---

## Using LLMs for Data Filtering

### The problem
697 items in our dataset. We need to categorize them efficiently without manually reading everything.

### The solution
Use LLMs to do the boring classification work for us.

In [11]:
# Load a training example for filtering
sample_training_path = None
if training_files:
    sample_training_path = training_files[0]
    with open(sample_training_path, 'r', encoding='utf-8') as f:
        sample_training = f.read()
    
    print(f"📁 Loaded training: {sample_training_path.name}")
    print(f"Content length: {len(sample_training)} characters")
    print(f"Estimated tokens: {estimate_tokens(sample_training)}")
else:
    print("No training files available for analysis")
    sample_training = None

📁 Loaded training: tr_adm_document_control_02.md
Content length: 565 characters
Estimated tokens: 100


In [14]:
# Build a simple classifier
def classify_seniority(content: str, model: str = "mistral-large-latest") -> str:
    """Figure out if this is entry-level, mid-level, or senior stuff"""
    
    prompt = f"""Look at this job/training content and tell me the seniority level. Analyze whole file before answering.
Options:
* Basic - Entry level, no experience needed
* Intermediate - Some experience (1-3 years)
* Advanced - Senior level (3+ years)

Just respond with one word: Basic, Intermediate, or Advanced.

Content:
{content}"""
    
    result = call_mistral(prompt, model)
    if result:
        return result['content'].strip()
    return "Unknown"

# Test it on our sample
if sample_training:
    print("🎯 Testing the classifier...")
    classification = classify_seniority(sample_training)
    print(f"Result: {classification}")
    print('---------------------------------------------------------------')
    print("\n📋 Here's what it analyzed:")
    display_markdown_file(sample_training_path)

🎯 Testing the classifier...
Result: Intermediate
---------------------------------------------------------------

📋 Here's what it analyzed:


**Master the Basics of Administrative Management in Banking and Insurance!**

Join **Document Control - Intermediário** and learn how to interpret and prepare financial reports with confidence and clarity.

✔ **Duration:** 12 weeks  
✔ **Format:** online  
✔ **Language:** pt-BR  
✔ **Certification:** Included

**Who is it for?** Beginners and early-career professionals in Administrative Management in Banking and Insurance—no prior experience needed.

**Prerequisites:**
- Document Control - Básico level required

**Secure your spot and invest in your future!**

### Batch Processing Strategy

For efficiency, we should process multiple items at once when possible:

In [15]:
def batch_classify_trainings(training_files: list, batch_size: int = 5) -> dict:
    """Classify multiple trainings in batches to optimize API calls"""
    
    results = {}
    
    # Process first few files as example
    sample_files = training_files[:3]  # Just process 3 for demo
    
    for i, file_path in enumerate(sample_files):
        print(f"Processing {i+1}/{len(sample_files)}: {file_path.name}")
        
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                content = f.read()
            
            classification = classify_seniority(content)
            results[file_path.name] = {
                'seniority': classification,
                'tokens': estimate_tokens(content)
            }
        
            
        except Exception as e:
            print(f"Error processing {file_path.name}: {e}")
            results[file_path.name] = {'seniority': 'Error', 'tokens': 0}
    
    return results

# Run batch classification
if training_files:
    print("🔄 Running batch classification (sample of 3 trainings)...")
    batch_results = batch_classify_trainings(training_files)
    
    print("\n📊 Results:")
    for filename, data in batch_results.items():
        print(f"{filename}: {data['seniority']} ({data['tokens']} tokens)")
        
    # Calculate total tokens used
    total_tokens = sum(data['tokens'] for data in batch_results.values())
    print(f"\n💰 Total tokens processed: {total_tokens}")
    # TODO: Add actual cost calculation

🔄 Running batch classification (sample of 3 trainings)...
Processing 1/3: tr_adm_document_control_02.md
Processing 2/3: tr_acc_budgeting_and_forecasting_02.md
Processing 3/3: tr_hot_problem_resolution_01.md

📊 Results:
tr_adm_document_control_02.md: Intermediate (100 tokens)
tr_acc_budgeting_and_forecasting_02.md: Intermediate (102 tokens)
tr_hot_problem_resolution_01.md: Basic (98 tokens)

💰 Total tokens processed: 300


### Caching - because we're not stupid

Process once, reuse forever. Basic optimization.

In [16]:
import json

def save_classification_cache(results: dict, filename: str = "classification_cache.json"):
    """Save classification results to avoid re-processing"""
    with open(filename, 'w') as f:
        json.dump(results, f, indent=2)
    print(f"✅ Saved classification cache to {filename}")

def load_classification_cache(filename: str = "classification_cache.json") -> dict:
    """Load cached classification results"""
    try:
        with open(filename, 'r') as f:
            results = json.load(f)
        print(f"✅ Loaded classification cache from {filename}")
        return results
    except FileNotFoundError:
        print(f"No cache file found: {filename}")
        return {}

# Save our batch results
if 'batch_results' in locals():
    save_classification_cache(batch_results)

# Demonstrate loading
loaded_cache = load_classification_cache()
print(f"Cache contains {len(loaded_cache)} items")

✅ Saved classification cache to classification_cache.json
✅ Loaded classification cache from classification_cache.json
Cache contains 3 items


---

## Exercises (aka homework)

### Exercise 1: Data analysis
Build some actual statistics about our dataset:

In [None]:
# Your mission: analyze the dataset properly
# What we want to know:
# - Geographic distribution (São Paulo, Rio, etc.)
# - Average token counts per category
# - Most common skills mentioned

print("📝 Exercise 1: Data analysis")
print("Use LLMs to extract domains from titles/content")
print("Count location mentions")  
print("Calculate processing costs for different approaches")

# Your code goes here...
# Hint: Use the classify function pattern we just built

### Exercise 2: Cost optimization
Figure out the cheapest way to process everything:

In [None]:
# Cost comparison challenge
# Calculate costs for:
# 1. All 697 items with small model
# 2. All 697 items with large model
# 3. Hybrid: small for classification, large for complex analysis

print("📝 Exercise 2: Cost optimization")
print("Which approach gives best quality/cost ratio?")
print("What's the break-even point?")

# Your implementation here...

### Exercise 3: Green jobs detector
Build a classifier for sustainability-related jobs:

In [None]:
# Green jobs classifier
def is_green_job(content: str) -> bool:
    """Detect sustainability/climate-related jobs and trainings"""
    # Your implementation here
    # Look for keywords like: renewable energy, sustainability, climate, environment
    pass

print("📝 Exercise 3: Green jobs detector")
print("Test it on the dataset - how many green opportunities can you find?")
print("Bonus: what makes a job 'green'?")

# Implement the function and test it...

---

## What we learned

✅ **Data structure**: 697 items in messy formats  
✅ **API basics**: Tokens, models, costs  
✅ **Smart filtering**: LLMs > regex for unstructured data  
✅ **Cost optimization**: Start small, scale strategically  

### The real lessons
- Token counting matters when you're processing lots of data
- Small models are surprisingly good for classification tasks
- Caching is your friend
- Always track costs as you go

### Next up
Tutorial 3: Building your first submission and getting on the leaderboard ASAP.

---

## Notes for tutorial authors

**TODOs that need real implementation:**
- [ ] Get current Mistral pricing 
- [ ] Add actual token counting with Mistral tokenizer
- [ ] Test all code cells with real data
- [ ] Add Brazilian location context  
- [ ] Implement the exercises properly

**Stuff that works:**
- Basic API setup and calls
- Classification pattern
- Cost tracking framework
- Caching strategy

Stick with the conversational tone - this feels way more authentic than corporate training speak.