# End-to-end run of Agent Arena v0

This jupyter notebook will take you through all the components of Agent Arena.

![Main Stages of Agent Arena](media/AgentArenaPartsv2.png)

Each necessary component will be showcased in its v0 form.

# Building the Agent Arena Dataset

The first aspect of Agent Arena we will have to accomplish is building the dataset.  This will include pulling the right data from Upwork's databases and separating them out for the arena.

There are many important points to take into consideration for Agent Arena Dataset.  After each consideration, we will list possible directions we take and bold the path we took for this demo:
- How will we store the dataset?
  - **a job data .csv that contains filepaths to a bucket of attached files**
  - Advanced Database
- How will we sample the data from Upwork's database?
  - **random uniform sample**
  - sample based on time
  - sample based on sector/category (like ontological category)
  - sample based difficulty of task
  - sample based on skillset needed
  - sample based on budget
- Will we include attachment data?
  - **No.**
  - Yes.
- How do we filter for feasibility of the job?
  - We don't.
  - **We use an off-the-shelf LLM**
  - We train Uma to filter for feasibility.
  - We single-verify with a freelancer.
  - We triple-verify with a freelancer.
- Do we include deliverables data?
  - **No.**
  - Yes

In [1]:
# Part 1: Pulling the Data from Upwork Databases

# Using SQL queries, we can pull 1000 random recently posted jobs.
# This will be stored in a .csv file that we save in data/df_randomized.csv.
# Let's take a look at what this looks like.

import pandas as pd

csv_orig = "data/df_randomized_attachment.csv"

df = pd.read_csv(csv_orig)
df.head()

Unnamed: 0,POST_KEY,TITLE,DESCRIPTION,SECTOR,EXPERIENCE_LEVEL,CLIENT_RATING,PROJECTED_VALUE,PROJECTED_VALUE_CATEGORY,IS_HOURLY,HOURLY_LOW,HOURLY_HIGH,BUDGET,COUNTRY,LANGUAGE,SKILLS_AND_EXPERTISE,POST_DATE,AGORA_POST_ID,HAS_ATTACHMENT,SUBSECTOR,SUBSUBSECTOR
0,63107529,Need shopify backend expert for selecting my p...,This is going to be 5 min work for an expert ....,"Web, Mobile & Software Dev",Expert/Expensive,,83.195333,VLV,False,,,40.0,United States,,Shopify,2024-01-02,1742055756478644224,False,Ecommerce Development,Ecommerce Website Development
1,63114550,Presentation Pitch Deck,Hi there\n\nI want to have a series of calls e...,Writing,Expert/Expensive,,415.515007,LV,False,,,300.0,United Kingdom,,"Business Presentation,Marketing Presentation,M...",2024-01-02,1742206942868066304,False,Professional & Business Writing,Academic & Research Writing
2,63120800,Video creation and editing,\nI would like videos that mother is making to...,Design & Creative,Intermediate,,312.69655,LV,False,,,5.0,Australia,,"Adobe After Effects,Adobe Premiere Pro,Cinemat...",2024-01-02,1742299785375170560,False,Video & Animation,Video Editing
3,63095881,Real Estate Cold Calling Lead Manager,We seek a skilled and experienced cold-calling...,Sales & Marketing,Intermediate,,1657.371981,HV,True,6.0,10.0,,United States,,"Cold Calling,Communications,Lead Generation,Sa...",2024-01-01,1741723011960696832,False,Lead Generation & Telemarketing,Telemarketing
4,63108525,Intercom Ticketing Software Setup and Automation,We are seeking an experienced professional to ...,Sales & Marketing,Intermediate,,844.183541,MV,True,15.0,75.0,,United States,,"API,Chatbot Development,Intercom,Python,Zendesk",2024-01-02,1742083254525677568,False,Digital Marketing,Marketing Automation


In [2]:
# Part 2: Re-inserting the links

# The data has links removed by default.  All url links will be
# replaced by "(link removed)".  We will use a separate link .csv
# to reinsert the links into the job description.

# TODO: CHECK FOR STALE LINKS

import pandas as pd

from utils.links import reinsert_links

csv_orig = "data/df_randomized_attachment.csv"
csv_link = "data/df_links.csv"

df = reinsert_links(csv_orig, csv_link)
df.head()

Unnamed: 0,POST_KEY,TITLE,DESCRIPTION,SECTOR,EXPERIENCE_LEVEL,CLIENT_RATING,PROJECTED_VALUE,PROJECTED_VALUE_CATEGORY,IS_HOURLY,HOURLY_LOW,HOURLY_HIGH,BUDGET,COUNTRY,LANGUAGE,SKILLS_AND_EXPERTISE,POST_DATE,AGORA_POST_ID,HAS_ATTACHMENT,SUBSECTOR,SUBSUBSECTOR
0,63107529,Need shopify backend expert for selecting my p...,This is going to be 5 min work for an expert ....,"Web, Mobile & Software Dev",Expert/Expensive,,83.195333,VLV,False,,,40.0,United States,,Shopify,2024-01-02,1742055756478644224,False,Ecommerce Development,Ecommerce Website Development
1,63114550,Presentation Pitch Deck,Hi there\n\nI want to have a series of calls e...,Writing,Expert/Expensive,,415.515007,LV,False,,,300.0,United Kingdom,,"Business Presentation,Marketing Presentation,M...",2024-01-02,1742206942868066304,False,Professional & Business Writing,Academic & Research Writing
2,63120800,Video creation and editing,\nI would like videos that mother is making to...,Design & Creative,Intermediate,,312.69655,LV,False,,,5.0,Australia,,"Adobe After Effects,Adobe Premiere Pro,Cinemat...",2024-01-02,1742299785375170560,False,Video & Animation,Video Editing
3,63095881,Real Estate Cold Calling Lead Manager,We seek a skilled and experienced cold-calling...,Sales & Marketing,Intermediate,,1657.371981,HV,True,6.0,10.0,,United States,,"Cold Calling,Communications,Lead Generation,Sa...",2024-01-01,1741723011960696832,False,Lead Generation & Telemarketing,Telemarketing
4,63108525,Intercom Ticketing Software Setup and Automation,We are seeking an experienced professional to ...,Sales & Marketing,Intermediate,,844.183541,MV,True,15.0,75.0,,United States,,"API,Chatbot Development,Intercom,Python,Zendesk",2024-01-02,1742083254525677568,False,Digital Marketing,Marketing Automation


In [7]:
# Part 3: Downloading attachments

# Some jobs has attachments, as can be seen by the HAS_ATTACHMENT.
# We will go through and download all attachments using the AGORA_POST_ID.

# TODO: CLEAN UP THE FILENAMES.

import pandas as pd

from utils.download_attachments import download_all_attachments

csv_file = "data/df_randomized_attachment_links.csv"
save_path = "data"

download_all_attachments(csv_file, save_path)

Found 41 posts with attachments

Processing post 1741721204732399616
Renaming to: Looking%20for%20a%20Dedicated%20A-Player%20Virtual%20Assistant.docx
Most recently created file: dce94b4805824951345f9633780a853e?response-content-disposition=attachment; filename="Looking%20for%20a%20Dedicated%20A-Player%20Virtual%20Assistant.docx"; filename*=utf-8''Looking%20for%20a%20Dedicated%20A-Player%20Virtual%20Assistant.do
Successfully renamed file to: Looking%20for%20a%20Dedicated%20A-Player%20Virtual%20Assistant.docx

Processing post 1741990855169941504
Renaming to: fulllogo.jpg
Most recently created file: 7fb8337385bda2232e2018bfd54dd4a7?response-content-disposition=inline; filename="fulllogo.jpg"; filename*=utf-8''fulllogo.jpg&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEFgaCXVzLXdlc3QtMiJHMEUCIHR9144J9x8SdGuJxB+BaFy1S9uGDrjnTVDOIgwRmYAmAiEA3P
Successfully renamed file to: fulllogo.jpg
Renaming to: fulllogo_transparent.png
Most recently created file: a6750c11fc49507d8b05489f792ca36d?response-content-

In [2]:
# Part 4: Filtering for Feasibility

# Here we use an off-the-shelf LLM to judge whether the job is feasible or not.
# This step can be done by humans as well, such as freelancers.
# Judging feasibility by human experts could be necessary to earn trust in the Arena by our 3rd party partners.
# In the future, this can also be done by Uma if it is trained to understand job feasibility.

from utils.create_feasible_dataset import filter_csv_for_feasible_jobs

csv_feasible = "data/df_randomized_feasible.csv"

filter_csv_for_feasible_jobs(csv_orig, csv_feasible) #This can be manually done with Freelancers.

Reading CSV file...
Analyzing job descriptions...


100%|██████████████████████████████████████████████████████████| 1000/1000 [11:38<00:00,  1.43it/s]

Saving results...
Done! Found 187 feasible jobs out of 1000 total jobs analyzed.
Results saved to: data/df_randomized_feasible.csv





In [1]:
# Part 5: Cleanup

# We will clean up the data columns and only include what is necessary for the Arena.
# As an example of clean up, we replace post_key with a new Arena JobID.

# TODO: there still seems to be some mapping issues

from utils.clean_up_csv import clean_csv, rename_data_dirs

csv_orig = "data/df_randomized_attachment_links.csv"
data_dir = "data"

# Clean up the .csv file.
clean_csv(csv_orig)
# Rename the attachment directories.
rename_data_dirs(csv_orig, data_dir)

Successfully saved cleaned data to data/df_randomized_attachment_links_cleaned.csv
Columns saved in order: ID, POST_KEY, AGORA_POST_ID, TITLE, DESCRIPTION, SECTOR, SUBSECTOR, SUBSUBSECTOR, SKILLS_AND_EXPERTISE, EXPERIENCE_LEVEL, CLIENT_RATING, IS_HOURLY, HOURLY_LOW, HOURLY_HIGH, BUDGET, COUNTRY, LANGUAGE, POST_DATE
Renamed: 1741883886514262016 -> 63101171
Renamed: 1742203609755996160 -> 63114316
Renamed: 1741854292737830912 -> 63099905
Renamed: 1742142968768155648 -> 63111055
Renamed: 1742258392386719744 -> 63118102
Renamed: 1742233439487676416 -> 63116411
Renamed: 1742092662980657152 -> 63108892
Renamed: 1741882011580149760 -> 63101102
Renamed: 1742213737218592768 -> 63115009
Renamed: 1742311730898817024 -> 63121503
Skipped: 1886807791342739171 (no mapping found)
Renamed: 1742211586835566592 -> 63114861
Renamed: 1742170049703628800 -> 63112363
Renamed: 1742208045882155008 -> 63114628
Renamed: 1741897769277554688 -> 63101743
Renamed: 1742280608905674752 -> 63119587
Renamed: 17417212047

In [5]:
# Let's take a look at what our final Arena Dataset looks like.

df = pd.read_csv(csv_data)
df.head()

Unnamed: 0,ID,TITLE,DESCRIPTION,SECTOR,SKILLS_AND_EXPERTISE,EXPERIENCE_LEVEL,CLIENT_RATING,IS_HOURLY,HOURLY_LOW,HOURLY_HIGH,BUDGET,COUNTRY,LANGUAGE,POST_DATE
0,0,Kayak Bass Fishing Logo Design,"My website is focused on kayak bass fishing, s...",Design & Creative,"Abstract Logo,Brand Identity,Emblem Logo,Graph...",Intermediate,9.0,True,0.0,0.0,,United States,,2024-01-02
1,1,C# Native AOT console app,I need someone to get a c# native AOT console...,"Web, Mobile & Software Dev",C#,Intermediate,,True,16.0,35.0,,United States,,2024-01-01
2,2,Italian typist,Hello - I am looking for someone to retype on ...,Admin Support,"Data Entry,Italian",Cheap/Inexperienced,,True,5.0,7.0,,United States,,2024-01-01
3,3,Fix Laravel API for Firebase Social & Apple Login,Need help with fixing bugs in Laravel based Re...,"Web, Mobile & Software Dev","API,API Development,API Integration,Firebase,L...",Intermediate,,False,,,50.0,India,,2024-01-02
4,4,Improve speed loading time and Core Web Vitals,"Hello,\nWe are looking for someone who is skil...","Web, Mobile & Software Dev","CSS,JavaScript,Page Speed Optimization,PHP,Web...",Expert/Expensive,0.0,False,,,200.0,United States,,2024-01-02


# Agent Arena API

We can provide an API for agents to use to pull this data.  In this demo, we use a python object that's been initialized with the Arena Dataset we created above.  In the future, this should be a served API that also comes with logging to understand how agents are interacting with the Arena.

In this demo, there are four main API calls:
- get_num_jobs: returns the total number of jobs in the arena
- get_jobs_metadata: returns a dictionary of jobID keys mapped to key metadata of the jobs
- get_job_description: given a jobID, returns the full job description and any additional context.
- submit_job: submits the deliverables of a job (aka output of the agent) to be evaluated

This would allow agent behavior like looking over all the jobs with `get_jobs_metadata`, and then pick jobs that it thinks it can do, and then it will get the details of that job through `get_job_description`.

One of the most important aspects of the API keeps the Upwork platform in the loop of the agents competing in the arena.  It also lets us keep track of when the API is being triggered and thus, what jobs the agents are currently looking at.

In [6]:
# Let's take a look at this API

from api.data import AgentArenaData

csv_data = "data/df_randomized_feasible_cleaned.csv"

# Initialize the data
data = AgentArenaData(csv_data)

# Print total number of jobs
print(f"\nTotal number of jobs: {data.get_num_jobs():,}")

# Print metadata for first 3 jobs
print("\nMetadata for first 3 jobs:")
metadata = data.get_jobs_metadata()
for job_id in range(3):
    title, sector, skills, exp_level, budget, country = metadata[job_id]
    print(f"\nJob ID: {job_id}")
    print(f"Title: {title}")
    print(f"Sector: {sector}")
    print(f"Skills: {skills}")
    print(f"Experience Level: {exp_level}")
    print(f"Budget: ${budget:,.2f}" if pd.notna(budget) else "Budget: Not specified")
    print(f"Country: {country}")

# Print description for first job
print("\nDescription for first job:")
print("-" * 80)
print(data.get_job_description(0))
print("-" * 80)


Total number of jobs: 187

Metadata for first 3 jobs:

Job ID: 0
Title: Kayak Bass Fishing Logo Design
Sector: Design & Creative
Skills: Abstract Logo,Brand Identity,Emblem Logo,Graphic Design,Logo Animation,Logo Design,Logo Transparency,Logo Usage Guidelines,Logotype,Minimalist
Experience Level: Intermediate
Budget: Not specified
Country: United States

Job ID: 1
Title: C# Native AOT console app
Sector: Web, Mobile & Software Dev
Skills: C#
Experience Level: Intermediate
Budget: Not specified
Country: United States

Job ID: 2
Title: Italian typist
Sector: Admin Support
Skills: Data Entry,Italian
Experience Level: Cheap/Inexperienced
Budget: Not specified
Country: United States

Description for first job:
--------------------------------------------------------------------------------
My website is focused on kayak bass fishing, so I was hoping that you could create a simple modern logo of either a kayak or largemouth bass. Please let me know if you would be able to do this. I've atta

# Simple Agent

We create a simple agent.  This agent is just an OpenAI LLM API calls wrapped in a python class.

It is initialized with an API key and the API data class.

The agent will then look through all of the jobs using the `.get_jobs_metadata` module.  Normally, an agent would make a decision on whether or not to take the job.  In our case, we will use a random number generator to coin flip whether our agent takes on this job or not.

If the agent takes on the job, it will feed the job description through gpt-4o api call with a system prompt about how it should attempt to do that job.

Finally, it will submit that job with `submit_job`.

In [7]:
# Part 1: Initializing the Agent with Acess to the API

from dotenv import load_dotenv
import os
load_dotenv()  # take environment variables

from agent.basic_llm import BasicLLMAgent
from api.data import AgentArenaData

csv_data = "data/df_randomized_feasible_cleaned.csv"

# Initialize the Agent Arena API
API = AgentArenaData(csv_data)

# Initializing the Agent with the API
agent = BasicLLMAgent(os.getenv("OPENAI_API_KEY"), API)

In [8]:
# Part 2: Processing all the jobs in the Arena and Submitting any Deliverables

agent.process_jobs("output/")

Agent decided to take job 0: Kayak Bass Fishing Logo Design
Successfully processed job 0. Output saved to: output/output_simpleLLM_0.txt
Agent decided to take job 1: C# Native AOT console app
Successfully processed job 1. Output saved to: output/output_simpleLLM_1.txt
Agent decided not to take job 2: Italian typist
Agent decided to take job 3: Fix Laravel API for Firebase Social & Apple Login
Successfully processed job 3. Output saved to: output/output_simpleLLM_3.txt
Agent decided not to take job 4: Improve speed loading time and Core Web Vitals
Agent decided to take job 5: Webflow site build
Successfully processed job 5. Output saved to: output/output_simpleLLM_5.txt
Agent decided not to take job 6: Data entry | Download files and compress a file
Agent decided to take job 7: Configure pppoe client via pppd and site to site layer 2 openvpn all via terminal
Successfully processed job 7. Output saved to: output/output_simpleLLM_7.txt
Agent decided to take job 8: Add custom domain in for

# Evaluating Submitted Results

We will now take all of the outputs and verify them.  We do this doing a very simple wrapper around an LLM that takes a look at all the output files and determines whether the output deserved to earn payment or not.  However, this can be done in several other ways:
- By human verification (using freelancers who are experts in this area)
- By using a well-trained LLM (i.e. Uma trained on this task)
- By the client themselves.

In [9]:
# Part 1: We initialize our evaluator
from dotenv import load_dotenv
import os
load_dotenv()  # take environment variables

from agent.verifier_simple import SimpleVerifier
from api.data import AgentArenaData

csv_data = "data/df_randomized_feasible_cleaned.csv"

# Initialize the Agent Arena API
API = AgentArenaData(csv_data)

# Initializing the evaluator
verifier = SimpleVerifier(os.getenv("OPENAI_API_KEY"), API)

In [10]:
# Part 2: Evaluating all Outputs

verifier.process_outputs("output/")


Job: RTSP and WebRTC bug fix (ID: 46)
Agent: simpleLLM
Result: ✗ FAILURE
--------------------------------------------------------------------------------

Job: Tech pack for boxers/ men’s underwear  (ID: 52)
Agent: simpleLLM
Result: ✓ SUCCESS
--------------------------------------------------------------------------------

Job: Making Llama (LLM) Labeling Model work (ID: 148)
Agent: simpleLLM
Result: ✗ FAILURE
--------------------------------------------------------------------------------

Job: One week Job - Code Simple Landing page (ID: 85)
Agent: simpleLLM
Result: ✗ FAILURE
--------------------------------------------------------------------------------

Job: Ethereum Gas Testing Proof-of-Concept (ID: 91)
Agent: simpleLLM
Result: ✗ FAILURE
--------------------------------------------------------------------------------

Job: Pdf form field editor (ID: 90)
Agent: simpleLLM
Result: ✓ SUCCESS
--------------------------------------------------------------------------------

Job: Websi

In [11]:
# Let's take a Look at the Results of the Evaluation

import pandas as pd

csv_results = "output/results.csv"

df = pd.read_csv(csv_results)
df.head()

Unnamed: 0,jobID,simpleLLM
0,0,fail
1,1,win
2,3,win
3,5,fail
4,7,fail


# Metrics

The final aspect of the arena will be synthesizing all of the results from the different competing agents as well as data collected from the API (such as time to job completion or price of running the agent).  This will allow us to do important analysis on the different models and present different metrics that showcase the strenghts and weaknesses of different agents.