# Exploring data
---

## Challenge Overview

During this challenge, you will be working with text data representing **job** and **training** opportunities.

The goal of this year's challenge is to develop a chatbot that assists users in navigating their career paths by analyzing available job and training offers. Through interactive conversations, the chatbot will gather relevant information from users and recommend the most suitable next steps in their professional journey.

This prototype aims to simulate a solution that could eventually evolve into a powerful global tool—helping individuals around the world who are facing challenges in finding their next role.

### Data Download
To download the required data, run the following command in a code cell below:

```bash
!aws s3 cp s3://gdsc25test/ . --recursive


In [None]:
!aws s3 cp s3://gdsc25test/ . --recursive


After executing the command, your project structure will include a new directory named **`data`**.

Inside the `data` directory, you will find two subdirectories:

- **`trainings`**
- **`jobs`**

Each subdirectory contains Markdown files that provide structured descriptions of training sessions and job definitions.

---

### Preview
To preview the file you can use the **`display_markdown_file`** function:

In [5]:
from IPython.display import Markdown, display
from pathlib import Path

def display_markdown_file(path: str) -> None:
    p = Path(path)
    if not p.exists():
        print(f"File not found: {p}")
        return
    content = p.read_text(encoding='utf-8', errors='ignore')
    display(Markdown(content))

## Job Example

### Handling Diverse Job Offer Formats

Job offers provided in this challenge may come in various formats. Your task is to design a solution that can **flexibly extract key information** regardless of structural differences between documents.

Each job offer contains essential details such as:

- **Position**
- **Description**
- **Location**
- **Prerequisites** (e.g., skills, experience)

Keep in mind that while all offers include this information, it may be presented in different ways across documents. Your solution should be robust enough to handle these variations and consistently retrieve the necessary data for accurate matching.

Run the cell **below** to take a look at one job example.

In [9]:
display_markdown_file('data/jobs/job_acc_001.md')

# Job Description: Accounting Intern – Bookkeeping & Admin

**Position Summary:**
As an **Accounting Intern – Bookkeeping & Admin** on our **Accounting and Management** team, you'll handle day-to-day financial record keeping and administrative tasks that keep our operations running smoothly.

**Your Responsibilities:**
Your main tasks will include maintaining accurate financial records and transaction entries, managing tax-related documentation and compliance requirements, and supporting general administrative functions across the accounting department. You'll work closely with senior accounting staff and other departments that need financial data and reporting.

**What We're Looking For:**
You should have solid experience with **tax regulations and compliance processes at an intermediate level**, along with strong attention to detail and organizational skills. We expect a background of **graduation-level education and around 2 years of relevant experience**. You'll need to be fluent in **Portuguese (Brazilian)** and **English** to handle our documentation and communications effectively.

**Location:**
This position is based in **Brasília** and requires in-person work to access our financial systems and collaborate with the team.

**How to Apply:**
If you think you're a good fit, please send your application with your resume and a brief cover letter explaining your interest in this accounting internship role.

---

## Training Example

### Training Data Characteristics

Training data shares a similar structure with job offers and contains all the necessary information for effective matching.

A key aspect to remember is that **each training is designed to develop a single skill**. This constraint is crucial when building your matching solution, as it ensures that recommendations are precise and aligned with the user's skill development needs.

Your solution should be capable of extracting relevant details from training descriptions and associating them with the appropriate skill, even when the format or phrasing varies across documents.

Run the cell **below** to take a look at one training example.

In [19]:
display_markdown_file('data/trainings/tr_acc_accounting_software_proficiency_01.md')

**Why take this course?**

The **Accounting Software Proficiency - Básico** will help you:
✅ Build core skills in Accounting Software Proficiency (Current level: Básico)
✅ Apply best practices for transparency and compliance  
✅ Strengthen your resume with a recognized credential

**Course Details:**
- **Domain:** Accounting and Management
- **Duration:** 8 weeks
- **Format:** online
- **Language:** pt-BR
- **Certification:** Yes

**Prerequisites:**
None

**Don't miss the chance to stand out—register today!**

---

### Data Contents

Inside the provided files, you will find structured information including:

- **Overview**
- **Location**
- **Prerequisites** (e.g., required skills, experience)
- **Outcomes**

The dataset includes job and training opportunities from various fields and industries. These documents are presented in **multiple formats**, which means your data retrieval solution must be **flexible and adaptable** to handle structural and linguistic variations.

Accurately extracting this information is essential for building a system that can effectively match available opportunities with individual user needs.


### Basic Overview

The dataset includes:

- **`200 jobs`**
- **`497 trainings`**
- Average token count for jobs: **`264 tokens`**
- Average token count for trainings: **`143 tokens`**

---

### Filtering Example

Given the volume of data—**200 job offers** and **497 training opportunities**—iterating over every item for each query is inefficient. A recommended approach is to **pre-filter and organize** the data based on relevant attributes.

One such attribute is **seniority level**, which can be used to group and narrow down the options.

In the exercise below, you'll explore how AI can be leveraged to filter training data by seniority level, enabling more targeted and efficient matching.


In [None]:
# Import necessary libraries
import os
import dotenv
from strands.agent import Agent
from strands.models.mistral import MistralModel

# load env variables
dotenv.load_dotenv("env")

# Initialize model. Your MISTRAL_API_KEY should be located in .env file.
model = MistralModel(
        api_key=os.environ["MISTRAL_API_KEY"],
        model_id="mistral-large-latest",
        stream=False
    )

# Load example training
with open('data/trainings/tr_acc_accounting_software_proficiency_02.md', 'r') as file:
    training = file.read()

# System prompt for an agent
system_prompt = """
You are a helpful assistant which analyzes training offers and filters them by seniority level.
You create 3 categories:
* Basic
* Intermediate
* Advanced
As a result file you say in which category should the training be.
Provide just an answer without reasoning.
The training will be provided by user.
"""

# Initializing agents
agent = Agent(model=model, system_prompt=system_prompt, callback_handler=None)

# Printing the solution
result = agent(training)

# Final decision
print(result.message['content'][0]['text'])

Possible improvements:
* Implement solution which takes all the training at once and performs filtering solution
* Return 3 lists of already filtered trainings