In [150]:
review_prompt = r"""
You are an expert in job-candidates matching.
From the job description: {job_description}, 
company name: {company_name},
job title: {job_title},
and location: {location},
and my profil: {profil_pro},

Identify which of the following criteria are met by the job description:

## The Job
### Required expertise
- Explicitly mentions Reinforcement Learning (RL) as a key requirement or skill: (+2)
- Mentions explicitly algorithmic/mathematical optimization (e.g., Operations Research, planning, combinatorial optimization, MILP): (+2)
- Agentic workflows (ie. langchain, tool use, prompt engineering, etc.) are part of the job: (+2), +1 more if a large part of the job is dedicated to this.
- Requires demonstrated expertise in a specific technical domain or toolset that is absent from my profile's listed skills and experiences: (-2 if this domain/tool is central to the role, defined as being in the job title, company name, or a primary responsibility/requirement; -1 if it is a secondary qualification).
- Requires a programming language I am not familiar with, AND does not mention Python: (-1)
- More focused on infrastructure (databases, cloud, Docker) than on algorithms: (-3)
- Vague description of actual tasks for a data scientist/engineer job: (-1)
- 'Optimization' mentioned primarily for performance/infrastructure (e.g., inference speed, cloud costs, MLOps): (-3)
- 'optimization' mentioned primarily in the context of quantum algorithms: (-4)
- The job is based in France and requires a good english level. If the description is in english and the job is based in France, this criterion is verified. : (+0.5)
- Requires "deep expertise" / "senior-level experience" / "mastery" of MLOps, large-scale training, or inference optimization (beyond just "good fundamentals" or "being comfortable"): (-1)
- Requires a PhD in a field close to mine (or even if it is just a plus) (has to be explicitly mentioned in the job description. Having experience leading research teams does not imply a PhD): (+1.5)
- Does not mention a PhD but requires experience doing research: (+1)
### Type of role
- More managerial than technical role: (-2)
- Involves leading a team of highly qualified/experienced people (junior excluded): (-1) In a domain I am not familiar with: (-1)
- Involves coaching world-class scientists: (-2)

## The Company
- Top-tier company (e.g., Google, Apple, Meta, Helsing, Mistral AI, Perplexity, OpenAI, Anthropic, Nvidia): (+2) (Do not trust the description of the company in the job description for this criteria, but your prior knowledge about the company if any.)
- More than 150 employees: (-1)
- Offers a full-remote option: (+2)
- Consulting job for a standard/low-tier consulting firm: (-2)
- In the defense sector: (+2)
- In the robotics sector: (+2)
- If not french, requires security clearance: (-1.5)

^ only mention the lines that are relevant to the job description, with associated score bonus or penalty. 
For example, do not output "- Leading a team: No (+0)". Instead do not output anything for this criteria.
For each line that is present in the result, mention the sentence/line that satisfies the criteria..
Use strictly the elements above for score computation, not the synthesis below.
"""

In [34]:
from jobseeker_agent.utils.paths import load_prompt, load_full_job

profil_pro = load_prompt("profil_pro")
job_id = 18
job = load_full_job(job_id)
job_description = job["description"]

In [18]:
job

{'id': 18,
 'title': 'Applied ML/AI Engineer - Monitoring',
 'company': 'Sifflet',
 'location': 'Paris, Île-de-France, France',
 'job_link': 'https://fr.linkedin.com/jobs/view/applied-ml-ai-engineer-monitoring-at-sifflet-4314688170',
 'posted_date': '2025-10-14',
 'status': 'Open',
 'workplace_type': 'Remote',
 'description': "**About Sifflet  \n  \n** We are building the world’s best data observability platform to help\ncompanies excel at data-driven decision making.  \n  \nToday, half of a data team’s time is spent troubleshooting data quality\nissues. Sifflet is putting an end to that. Our solution allows data engineers\nand data consumers to visualize how data flows between their services, define\ndata quality checks, and quickly find the root cause of any data anomaly.  \n  \n**About The Job  \n  \n** The monitoring team implements the foundational capabilities of Sifflet:\ndetecting data quality issues across a wide range of data warehouses and\ndatabases.  \n  \nSifflet's monito

In [15]:
print(job_description)

**About Sifflet  
  
** We are building the world’s best data observability platform to help
companies excel at data-driven decision making.  
  
Today, half of a data team’s time is spent troubleshooting data quality
issues. Sifflet is putting an end to that. Our solution allows data engineers
and data consumers to visualize how data flows between their services, define
data quality checks, and quickly find the root cause of any data anomaly.  
  
**About The Job  
  
** The monitoring team implements the foundational capabilities of Sifflet:
detecting data quality issues across a wide range of data warehouses and
databases.  
  
Sifflet's monitoring capabilities rely heavily on machine learning (ML)
techniques. Most advanced data quality checks are based on time series
forecasting models that detect unexpected distribution changes while
accounting for seasonality and one-off events. Additionally, ML-based features
are present throughout our product, be it for intelligent alert groupi

In [35]:
from typing import TypedDict, Annotated, List, Dict, Union

class Evaluation(TypedDict):
    """Evaluation of the job description."""
    criteria: Annotated[str, ..., "The criteria that are met by the job description."]
    evidence: Annotated[str, ..., "The evidence for the criteria that are met by the job description."]
    score: Annotated[float, ..., "The score for the criteria that are met by the job description."]

class JobReviewResponse(TypedDict):
    """Response structure for job review."""
    evaluation_grid: Annotated[List[Evaluation], ..., "List of evaluations for each relevant evaluation criterion"]
    score: Annotated[float, ..., "raw score computed from the evaluation grid. Can be negative."]

class JobReviewCorrectionResponse(TypedDict):
    """Response structure for job review correction."""
    correction: Annotated[str, ..., "Correction of the evaluation grid."]
    evaluation_grid: Annotated[List[Evaluation], ..., "List of evaluations for each relevant evaluation criterion"]
    score: Annotated[float, ..., "raw score computed from the evaluation grid. Can be negative."]


In [97]:
from jobseeker_agent.utils.llm import get_llm
from langchain_core.messages import HumanMessage, AIMessage
import json
def review(model, job_description, job_title, company_name, location, with_correction=False):
    llm = get_llm(model)
    llm = llm.with_structured_output(JobReviewResponse)
    message = HumanMessage(
        content=review_prompt.format(job_description=job_description, profil_pro=profil_pro, job_title=job_title, company_name=company_name, location=location)
    )
    response = llm.invoke([message])
    if with_correction:
        messages = [
            message,
            AIMessage(content=json.dumps(response)),
            HumanMessage(content="Please correct the evaluation grid. Evaluate each element. Is it correct ? Are there any missing element ? If elements are removed from the evaluation grid, don't put them in the evaluation grid.")
        ]
        response = llm.invoke(messages)
    return response

In [39]:
import json
model = "gpt-4o"
result = review(model, with_correction=True)
print(result["score"])
# print the dict in easy to read format
# print(json.dumps(result, indent=4))


✅ Chargement du modèle OpenAI : gpt-4o
4.5


In [13]:
model = "gemini-2.5-flash"
result = review(model)

# print the dict in easy to read format
print(json.dumps(result, indent=4))

✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761664652.990170 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


{
    "evaluation_grid": [
        {
            "criteria": "Heavily features agentic workflows (ie. langchain, tool use, prompt engineering, etc.)",
            "evidence": "Implement generative AI workflows across the product, such as enabling users to describe their monitoring needs in natural language.",
            "score": 3.0
        },
        {
            "criteria": "Requires a programming language I am not familiar with (and not Python)",
            "evidence": "The web API is written in (modern) Java with Spring Boot 3, the web frontend is a VueJS application written in Typescript. You may occasionally need to make minor changes to this code base.",
            "score": -1.0
        },
        {
            "criteria": "The job is based in France and requires a good english level",
            "evidence": "We have offices in Paris, but we\u2019re very remote friendly - several team members are fully remote. All written communication at Sifflet is in English, but the engi

In [41]:
import time
models = ["gpt-4.1", "gpt-4o", "gpt-5-nano", "gpt-5-mini", "gpt-5"]
responses = []
corrections = [True, False]
for model in models:
    for correction in corrections:
        start_time = time.time()
        response = review(model, correction)
        end_time = time.time()
        response["model"] = model
        response["time"] = end_time - start_time
        response["correction"] = correction
        responses.append(response)

# print the dict in easy to read format
# print(json.dumps(responses, indent=4))

✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-4o
✅ Chargement du modèle OpenAI : gpt-4o
✅ Chargement du modèle OpenAI : gpt-5-nano
✅ Chargement du modèle OpenAI : gpt-5-nano
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-5


In [48]:
openai_responses = responses.copy()

In [57]:
models = ["gemini-2.5-flash"]
google_responses = []
corrections = [False]
for model in models:
    for correction in corrections:
        start_time = time.time()
        response = review(model, correction)
        end_time = time.time()
        response["model"] = model
        response["time"] = end_time - start_time
        response["correction"] = correction
        google_responses.append(response)

✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761729285.206628 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [59]:
# show a table with score, time and model
import pandas as pd
df = pd.DataFrame(google_responses)
df = df[["model", "correction", "score", "time"]]
# print(df[df["score"] ==4.5])
print(df)


              model  correction  score       time
0  gemini-2.5-flash       False    4.5  35.850183


Les deux meilleurs modèles sont gpt-4.1 avec correction et gpt-5-mini sans correction.
Le premier met quasi 2x moins de temps que le second. Du coup pour l'instant il gagne.
à voir quels sont 

In [None]:
for response in responses:
    if response["model"] == "gpt-4.1" and response["correction"]:
        print("____________")
        print(response["model"].upper())
        print(response["correction"])
    if response["model"] == "gpt-5-mini" and not response["correction"]:
        print("____________")
        print(response["model"].upper())
        print(response["correction"])


____________
GPT-4.1
[{'criteria': 'Explicitly mentions Reinforcement Learning (RL) as a key requirement or skill', 'evidence': 'The job description does not mention Reinforcement Learning (RL) anywhere.', 'score': 0}, {'criteria': 'Mentions explicitly algorithmic/mathematical optimization (e.g., Operations Research, planning, combinatorial optimization, MILP)', 'evidence': 'There is no explicit mention of algorithmic/mathematical optimization, Operations Research, or related terms in the job description.', 'score': 0}, {'criteria': 'Agentic workflows (ie. langchain, tool use, prompt engineering, etc.) are part of the job', 'evidence': "The job description mentions 'Implement generative AI workflows across the product, such as enabling users to describe their monitoring needs in natural language.' This suggests some involvement with prompt engineering and generative AI workflows, but does not explicitly mention agentic workflows or frameworks like LangChain.", 'score': 2}, {'criteria':

In [30]:
from numpy.char import upper


for r in responses:
    print("____________")
    print(r["model"].upper())
    for e in r["evaluation_grid"]:
        print(e["criteria"])
        print(e["evidence"])
    # print(e["criteria"] for e in r["evaluation_grid"])

____________
GPT-4O
The job is based in France and requires a good English level
"All written communication at Sifflet is in English, but the engineering team routinely uses French, so some level of fluency in French is required."
Offers a full-remote option
"We have offices in Paris, but we’re very remote friendly - several team members are fully remote."
Requires a PhD in a field close to mine (or even if it is just a plus)
"More than three years of experience in a ML engineer role or equivalent. Hands-on production experience is appreciated."
Requires a programming language I am not familiar with (and not Python)
"The web API is written in (modern) Java with Spring Boot 3, the web frontend is a VueJS application written in Typescript."
Requires strong expertise in a topic/domain I am not familiar with
"Sifflet's monitoring capabilities rely heavily on machine learning (ML) techniques. Most advanced data quality checks are based on time series forecasting models."
More focused on inf

# Evaluation sur Les exemples du jeu d'évaluation

In [63]:
models_tested = [("gpt-4.1", True), ("gpt-5-mini", False), ("gemini-2.5-flash", False)]
oracle_model = ("gpt-5", False)

In [61]:
from jobseeker_agent.utils.paths import get_data_path
evals_path = get_data_path() / "reviewer" / "tests" / "5" / "evals.json"
with open(evals_path, "r") as f:
    evals = json.load(f)

ids = []
for ev in evals:
    ids.append(ev["id"])


In [131]:
results = []
job_id = 3
for id in ids:
    print(f"Evaluation for job {id}")
    result = {}
    job = load_full_job(id)
    job_description = job["description"]
    job_title = job["title"]
    company_name = job["company"]
    location = job["location"]
    start_time = time.time()
    result["id"] = id
    result["job_description"] = job_description
    result["oracle"] = review(oracle_model[0], job_description, job_title, company_name, location,  oracle_model[1])
    end_time = time.time()
    result["oracle"]["time"] = end_time - start_time
    for model in models_tested:
        start_time = time.time()
        result[model[0]] = review(model[0], job_description, job_title, company_name, location, model[1])
        end_time = time.time()
        result[model[0]]["time"] = end_time - start_time
    results.append(result)

# results

    



Evaluation for job 1
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761740195.939168 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 2
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761740330.323214 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 3
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761740469.829023 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 4
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761740598.169456 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 5
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761740803.188805 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 6
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761740976.450089 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 7
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761741094.993964 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 8
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761741286.783700 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 9
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761741438.654276 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 10
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761741580.157041 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 11
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761741720.866506 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 12
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761741838.563378 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


Evaluation for job 13
✅ Chargement du modèle OpenAI : gpt-5
✅ Chargement du modèle OpenAI : gpt-4.1
✅ Chargement du modèle OpenAI : gpt-5-mini
✅ Chargement du modèle Gemini : gemini-2.5-flash


E0000 00:00:1761741990.196153 9513333 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


KeyboardInterrupt: 

In [132]:
results

[{'id': 1,
  'job_description': 'Description not found.',
  'oracle': {'evaluation_grid': [], 'score': 0, 'time': 26.32979917526245},
  'gpt-4.1': {'evaluation_grid': [{'criteria': 'Vague description of actual tasks for a data scientist/engineer job',
     'evidence': "Job description is 'Description not found.' No details are provided about the actual tasks or requirements.",
     'score': -1},
    {'criteria': 'The job is based in France and requires a good english level. If the description is in english and the job is based in France, this criterion is verified.',
     'evidence': "Job title and location: 'Research Scientist (AI) - Science Team', Paris, Île-de-France, France. The job title is in English, and the location is in France.",
     'score': 0.5}],
   'score': -0.5,
   'time': 5.39702296257019},
  'gpt-5-mini': {'evaluation_grid': [{'criteria': 'Vague description of actual tasks for a data scientist/engineer job',
     'evidence': 'Description not found.',
     'score': -1}

In [133]:
#make df containing score and time for each model
df = pd.DataFrame(results[10])
# df = df[["model", "score", "time"]]
df




Unnamed: 0,id,job_description,oracle,gpt-4.1,gpt-5-mini,gemini-2.5-flash
evaluation_grid,11,**Job Title:** Research Scientist – GC-MS/MS\n...,[{'criteria': 'Requires demonstrated expertise...,[{'criteria': 'Requires demonstrated expertise...,[{'criteria': 'Requires demonstrated expertise...,[{'criteria': 'Requires demonstrated expertise...
score,11,**Job Title:** Research Scientist – GC-MS/MS\n...,-3.5,-2,-2,-5
time,11,**Job Title:** Research Scientist – GC-MS/MS\n...,70.463378,21.458253,26.781385,21.411925


In [130]:
results[0]["gemini-2.5-flash"]["evaluation_grid"]

[{'criteria': "Requires demonstrated expertise in a specific technical domain or toolset that is absent from my profile's listed skills and experiences",
  'evidence': 'Strong software-design instincts: testing, code review, CI/CD',
  'score': -1.0},
 {'criteria': "'Optimization' mentioned primarily for performance/infrastructure (e.g., inference speed, cloud costs, MLOps)",
  'evidence': 'you’ll build and optimise the large-scale learning systems',
  'score': -3.0},
 {'criteria': 'The job is based in France and requires a good english level. If the description is in english and the job is based in France, this criterion is verified.',
  'evidence': 'Location: Paris / London (hybrid) or remote from EU/UK',
  'score': 0.5},
 {'criteria': 'Requires "deep expertise" / "senior-level experience" / "mastery" of MLOps, large-scale training, or inference optimization (beyond just "good fundamentals" or "being comfortable")',
  'evidence': '4 + years working on large-scale ML codebases',
  'sco

In [134]:
# Compute mean absolute error and average time for each model compared to oracle
model_performance = {}

for model_name in ["gpt-4.1", "gpt-5-mini", "gemini-2.5-flash"]:
    mae_scores = []
    times = []
    
    for result in results:
        oracle_score = result["oracle"]["score"]
        model_score = result[model_name]["score"]
        mae = abs(oracle_score - model_score)
        mae_scores.append(mae)
        times.append(result[model_name]["time"])
    
    model_performance[model_name] = {
        "mean_absolute_error": sum(mae_scores) / len(mae_scores),
        "average_time": sum(times) / len(times)
    }

# Display results
for model, metrics in model_performance.items():
    print(f"{model}:")
    print(f"  Mean Absolute Error: {metrics['mean_absolute_error']:.3f}")
    print(f"  Average Time: {metrics['average_time']:.3f}s")
    print()


gpt-4.1:
  Mean Absolute Error: 1.292
  Average Time: 10.774s

gpt-5-mini:
  Mean Absolute Error: 1.792
  Average Time: 31.905s

gemini-2.5-flash:
  Mean Absolute Error: 1.417
  Average Time: 27.163s



In [135]:
# Compute variance in scores for each job offer
job_variances = []

for i, result in enumerate(results):
    scores = [
        result["oracle"]["score"],
        result["gpt-4.1"]["score"],
        result["gpt-5-mini"]["score"],
        result["gemini-2.5-flash"]["score"]
    ]
    
    # Calculate variance
    mean_score = sum(scores) / len(scores)
    variance = sum((score - mean_score) ** 2 for score in scores) / len(scores)
    
    job_variances.append({
        "job_id": i,  # Assuming job_id corresponds to index in results
        "variance": variance,
        "scores": scores
    })

# Display results sorted by variance (highest first)
job_variances_sorted = sorted(job_variances, key=lambda x: x["variance"], reverse=True)

print("Job offer score variances (sorted by variance):")
print("=" * 50)
for job in job_variances_sorted:
    print(f"Job ID: {job['job_id']}")
    print(f"  Variance: {job['variance']:.3f}")
    print(f"  Scores: {job['scores']}")
    print()


Job offer score variances (sorted by variance):
Job ID: 7
  Variance: 7.688
  Scores: [-1.5, 3.5, 3.5, -2.5]

Job ID: 4
  Variance: 6.688
  Scores: [6, 4, 2.0, 9]

Job ID: 5
  Variance: 6.562
  Scores: [-1.5, 3, 4.5, 5]

Job ID: 11
  Variance: 2.250
  Scores: [2.5, 1.5, 1.5, -1.5]

Job ID: 10
  Variance: 1.547
  Scores: [-3.5, -2, -2, -5]

Job ID: 2
  Variance: 0.750
  Scores: [-4, -4, -2, -4]

Job ID: 1
  Variance: 0.188
  Scores: [2, 2, 2, 3]

Job ID: 3
  Variance: 0.188
  Scores: [3, 3, 4, 3]

Job ID: 8
  Variance: 0.188
  Scores: [1, 0, 1.0, 1]

Job ID: 0
  Variance: 0.172
  Scores: [0, -0.5, -1, 0]

Job ID: 6
  Variance: 0.000
  Scores: [1, 1, 1.0, 1]

Job ID: 9
  Variance: 0.000
  Scores: [4, 4.0, 4.0, 4]



In [142]:
results[7]["job_description"]

"As a research engineer on our team, you will partner with research scientists\nto turn research ideas into working systems; building the data, tooling, and\ninfrastructure that enable rapid iteration, trustworthy evaluation, and a\nsmooth path from prototype to production.  \n  \nBuilding on our proven track record of AI-powered solutions (e.g., Bits AI,\nWatchdog, and Toto), Datadog AI Research is tackling high-risk, high-reward\nprojects grounded in real-world challenges in cloud observability and\nsecurity.  \n  \nWe are currently focused on three key research areas:  \n  \n\n  * Observability Foundation Models – Building state-of-the-art models for advanced forecasting, anomaly detection, and multi-modal telemetry analysis (logs, metrics, traces, etc.). These models will also provide the foundation for our agents (described below) to natively analyze telemetry data. \n  * Site Reliability Engineering (SRE) Autonomous Agents – Creating AI agents to automatically detect, diagnose, a

In [153]:
results[7]["gpt-4.1"]["evaluation_grid"]

[{'criteria': 'Explicitly mentions Reinforcement Learning (RL) as a key requirement or skill',
  'evidence': "The job description states: 'Orchestrate distributed training and distributed RL with Ray, including scheduling, scaling, and failure recovery.'",
  'score': 2},
 {'criteria': 'Mentions explicitly algorithmic/mathematical optimization (e.g., Operations Research, planning, combinatorial optimization, MILP)',
  'evidence': 'The job description does not mention algorithmic/mathematical optimization, operations research, or related terms.',
  'score': 0},
 {'criteria': 'Agentic workflows (ie. langchain, tool use, prompt engineering, etc.) are part of the job',
  'evidence': "The job description includes: 'Site Reliability Engineering (SRE) Autonomous Agents – Creating AI agents to automatically detect, diagnose, and resolve incidents in production environments...' and 'Production Code Repair Agents – Developing agents and models...' This is about building AI agents, but does not me

In [79]:
load_full_job(2)["description"]


"AryaXAI stands at the forefront of AI innovation, revolutionizing AI for\nmission-critical businesses by building explainable, safe, and aligned systems\nthat scale responsibly. Our mission is to create AI tools that empower\nresearchers, engineers, and organizations to unlock AI's full potential while\nmaintaining transparency and safety.  \n  \nOur team thrives on a shared passion for cutting-edge innovation,\ncollaboration, and a relentless drive for excellence. At AryaXAI, everyone\ncontributes hands-on to our mission in a flat organizational structure that\nvalues curiosity, initiative, and exceptional performance.  \n  \nAs a research scientist at AryaXAI, you will be uniquely positioned in our\nteam to work on very large-scale industry problems and push forward the\nfrontiers of AI technologies. You will become a part of the unique atmosphere\nwhere startup culture meets research innovation, with key outcomes of speed\nand reliability.  \n  \n**Responsibilities  \n  \n**\n\n  *