# End-to-end run of Agent Arena v1

This jupyter notebook will take you through all the components of Agent Arena.

![Main Stages of Agent Arena](media/AgentArenaParts.jpg)

This script will showcase all the aspects of Agent Arena, but things are still quite hacky.

# Building the Agent Arena Dataset

The first aspect of Agent Arena we will have to accomplish is building the dataset.  This will include pulling the right data from Upwork's databases and separating them out for the arena.

There are many important points to take into consideration for Agent Arena Dataset.  After each consideration, we will list possible directions we take and bold the path we took for this demo:
- How will we store the dataset?
  - SQL Database
  - **a job data .csv that contains filepaths to a bucket of attached files**
- How will we sample the data from Upwork's database?
  - **random uniform sample**
  - sample based on time
  - sample based on sector/category (like ontological category)
  - sample based difficulty of task
  - sample based on skillset needed
  - sample based on budget
- How do we filter for feasibility of the job?
  - We don't.
  - **We use an off-the-shelf LLM with a prompt and a prayer.**
  - We train Uma to filter for feasibility.
  - We single-verify with a freelancer.
  - We triple-verify with a freelancer.
- Do we include deliverables data?
  - Yes
  - **No.**

Also important to note: at the current data (3/25/25), I (darvin) don't have the ability to SQL query snowflake.  Nor do we as a team have the ability to find the attached files of the jobs.  Thus, this will be reflected in the code.

## The biggest gaps (rank ordered)
1. **We do not have attachment data.**
2. **We do not have deliverables data.**
3. **We need a stronger pipeline of pulling data.**
4. **We need to verify feasiblity with human freelancers.**
5. **We need to train Uma to learn feasiblity.**

In [3]:
# Pulling the data: this is all fake/pseudo-code that represents what we do today.
# When this is all done, let's say we end up with 1000 rows of random uniform data in a .csv file in data/.
# We will save this as data/df_randomized.csv

# import snowflake_database
# import ted
# ted.pull_data(snowflake_database, "random_uniform")

import pandas as pd

csv_orig = "data/df_randomized.csv"

df = pd.read_csv(csv_orig)
df.head()

Unnamed: 0,POST_KEY,TITLE,DESCRIPTION,SECTOR,EXPERIENCE_LEVEL,CLIENT_RATING,PROJECTED_VALUE,PROJECTED_VALUE_CATEGORY,IS_HOURLY,HOURLY_LOW,HOURLY_HIGH,BUDGET,COUNTRY,LANGUAGE,SKILLS_AND_EXPERTISE,POST_DATE
0,63107529,Need shopify backend expert for selecting my p...,This is going to be 5 min work for an expert ....,"Web, Mobile & Software Dev",Expert/Expensive,,83.195333,VLV,False,,,40.0,United States,,Shopify,2024-01-02
1,63114550,Presentation Pitch Deck,Hi there\n\nI want to have a series of calls e...,Writing,Expert/Expensive,,415.515007,LV,False,,,300.0,United Kingdom,,"Business Presentation,Marketing Presentation,M...",2024-01-02
2,63120800,Video creation and editing,\nI would like videos that mother is making to...,Design & Creative,Intermediate,,312.69655,LV,False,,,5.0,Australia,,"Adobe After Effects,Adobe Premiere Pro,Cinemat...",2024-01-02
3,63095881,Real Estate Cold Calling Lead Manager,We seek a skilled and experienced cold-calling...,Sales & Marketing,Intermediate,,1657.371981,HV,True,6.0,10.0,,United States,,"Cold Calling,Communications,Lead Generation,Sa...",2024-01-01
4,63108525,Intercom Ticketing Software Setup and Automation,We are seeking an experienced professional to ...,Sales & Marketing,Intermediate,,844.183541,MV,True,15.0,75.0,,United States,,"API,Chatbot Development,Intercom,Python,Zendesk",2024-01-02


In [4]:
from utils.create_feasible_dataset import filter_csv_for_feasible_jobs

csv_feasible = "data/df_randomized_feasible.csv"

filter_csv_for_feasible_jobs(csv_orig, csv_feasible) #This can be manually done with Freelancers.

Reading CSV file...
Analyzing job descriptions...


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [11:41<00:00,  1.43it/s]

Saving results...
Done! Found 188 feasible jobs out of 1000 total jobs analyzed.
Results saved to: data/df_randomized_feasible.csv





In [5]:
from utils.clean_up_csv import clean_csv

csv_data = "data/df_randomized_feasible_cleaned.csv"

clean_csv(csv_feasible, csv_data)

Successfully saved cleaned data to data/df_randomized_feasible_cleaned.csv
Columns saved in order: ID, TITLE, DESCRIPTION, SECTOR, SKILLS_AND_EXPERTISE, EXPERIENCE_LEVEL, CLIENT_RATING, IS_HOURLY, HOURLY_LOW, HOURLY_HIGH, BUDGET, COUNTRY, LANGUAGE, POST_DATE
Number of actionable tasks saved: 188


In [6]:
df = pd.read_csv(csv_data)
df.head()

Unnamed: 0,ID,TITLE,DESCRIPTION,SECTOR,SKILLS_AND_EXPERTISE,EXPERIENCE_LEVEL,CLIENT_RATING,IS_HOURLY,HOURLY_LOW,HOURLY_HIGH,BUDGET,COUNTRY,LANGUAGE,POST_DATE
0,0,Kayak Bass Fishing Logo Design,"My website is focused on kayak bass fishing, s...",Design & Creative,"Abstract Logo,Brand Identity,Emblem Logo,Graph...",Intermediate,9.0,True,0.0,0.0,,United States,,2024-01-02
1,1,C# Native AOT console app,I need someone to get a c# native AOT console...,"Web, Mobile & Software Dev",C#,Intermediate,,True,16.0,35.0,,United States,,2024-01-01
2,2,Italian typist,Hello - I am looking for someone to retype on ...,Admin Support,"Data Entry,Italian",Cheap/Inexperienced,,True,5.0,7.0,,United States,,2024-01-01
3,3,Fix Laravel API for Firebase Social & Apple Login,Need help with fixing bugs in Laravel based Re...,"Web, Mobile & Software Dev","API,API Development,API Integration,Firebase,L...",Intermediate,,False,,,50.0,India,,2024-01-02
4,4,Improve speed loading time and Core Web Vitals,"Hello,\nWe are looking for someone who is skil...","Web, Mobile & Software Dev","CSS,JavaScript,Page Speed Optimization,PHP,Web...",Expert/Expensive,0.0,False,,,200.0,United States,,2024-01-02


# Lightweight API

We can provide a lightweight API for agents to use to pull this data.  In this ipython notebook demo, a VERY lightweight AgentArenaData object is acting as a stand in for an API that we will serve one day.

In this demo, there are three main API calls:
- get_num_jobs: returns the total number of jobs in the arena
- get_jobs_metadata: returns a dictionary of jobID keys mapped to key metadata of the jobs
- get_job_description: given a jobID, returns the full job description and attachments (**NOTE: WE DON'T HAVE ATTACHMENTS YET SO THIS IS NOT BUILT.**)
- submit_job: this is a very simple function to help save the output string as a .txt file.  For the real API, we probably want to design something stronger with output directories to handle any complexity of files being the output of the agent's work.

The assumed agent behavior loop is that it will look over all the jobs with `get_jobs_metadata`, and then pick jobs that it thinks it can do, and then it will get the details of that job through `get_job_description`.

It is important to note that that this is not more secure than just passing the full .csv.  However, it keeps the Upwork platform in the loop of the agents competing in the arena.  It also lets us keep track of when the API is being triggered and thus, what jobs the agents are currently looking at.

## The biggest gaps (rank ordered)
1. **We need to create a real API that's served.**
2. **We need to creating logging associated with API Calls.**
3. **We need to design an actually good API.**

In [8]:
from api.data import AgentArenaData

csv_data = "data/df_randomized_feasible_cleaned.csv"

# Initialize the data
data = AgentArenaData(csv_data)

# Print total number of jobs
print(f"\nTotal number of jobs: {data.get_num_jobs():,}")

# Print metadata for first 3 jobs
print("\nMetadata for first 3 jobs:")
metadata = data.get_jobs_metadata()
for job_id in range(3):
    title, sector, skills, exp_level, budget, country = metadata[job_id]
    print(f"\nJob ID: {job_id}")
    print(f"Title: {title}")
    print(f"Sector: {sector}")
    print(f"Skills: {skills}")
    print(f"Experience Level: {exp_level}")
    print(f"Budget: ${budget:,.2f}" if pd.notna(budget) else "Budget: Not specified")
    print(f"Country: {country}")

# Print description for first job
print("\nDescription for first job:")
print("-" * 80)
print(data.get_job_description(0))
print("-" * 80)


Total number of jobs: 188

Metadata for first 3 jobs:

Job ID: 0
Title: Kayak Bass Fishing Logo Design
Sector: Design & Creative
Skills: Abstract Logo,Brand Identity,Emblem Logo,Graphic Design,Logo Animation,Logo Design,Logo Transparency,Logo Usage Guidelines,Logotype,Minimalist
Experience Level: Intermediate
Budget: Not specified
Country: United States

Job ID: 1
Title: C# Native AOT console app
Sector: Web, Mobile & Software Dev
Skills: C#
Experience Level: Intermediate
Budget: Not specified
Country: United States

Job ID: 2
Title: Italian typist
Sector: Admin Support
Skills: Data Entry,Italian
Experience Level: Cheap/Inexperienced
Budget: Not specified
Country: United States

Description for first job:
--------------------------------------------------------------------------------
My website is focused on kayak bass fishing, so I was hoping that you could create a simple modern logo of either a kayak or largemouth bass. Please let me know if you would be able to do this. I've atta

# Simple Agent

We create the world's worst agent.  This agent is just an OpenAI LLM API calls wrapped in a python class.

It is initialized with an API key and the API data class.

The "agent" will then look through all of the jobs using the `.get_jobs_metadata` module.  Normally, an agent would make a decision on whether or not to take the job.  In our case, we will simply use a random number generator to coin flip whether our agent takes on this job or not.

If the agent takes on the job, it will simply feed the job description through gpt-4o api call with a system prompt about how it should attempt to do that job.

Finally, it will submit that job with `submit_job`.

In [1]:
from dotenv import load_dotenv
import os
load_dotenv()  # take environment variables

from agent.basic_llm import BasicLLMAgent
from api.data import AgentArenaData

csv_data = "data/df_randomized_feasible_cleaned.csv"

# Initialize the data
data = AgentArenaData(csv_data)

agent = BasicLLMAgent(os.getenv("OPENAI_API_KEY"), data)
agent.process_jobs("output/")

Agent decided not to take job 0: Kayak Bass Fishing Logo Design
Agent decided not to take job 1: C# Native AOT console app
Agent decided to take job 2: Italian typist
Successfully processed job 2. Output saved to: output/output_gpt-4_2.txt
Agent decided to take job 3: Fix Laravel API for Firebase Social & Apple Login
Successfully processed job 3. Output saved to: output/output_gpt-4_3.txt
Agent decided not to take job 4: Improve speed loading time and Core Web Vitals
Agent decided to take job 5: Webflow site build
Successfully processed job 5. Output saved to: output/output_gpt-4_5.txt
Agent decided to take job 6: Data entry | Download files and compress a file
Successfully processed job 6. Output saved to: output/output_gpt-4_6.txt
Agent decided not to take job 7: Configure pppoe client via pppd and site to site layer 2 openvpn all via terminal
Agent decided to take job 8: Add custom domain in form & webhook settings in self-hosted n8n
Successfully processed job 8. Output saved to: ou

# Grading Submitted Results

We will now take all of the outputs and verify them.  We do this doing a very simple wrapper around an LLM that takes a look at all the output files and determines whether the output deserved to earn payment or not.  However, this can be done in several other ways:
- By human verification (using freelancers who are experts in this area)
- By using a well-trained LLM (i.e. Uma trained on this task)
- By the client themselves.

In [3]:
from dotenv import load_dotenv
import os
load_dotenv()  # take environment variables

from agent.verifier_simple import SimpleVerifier
from api.data import AgentArenaData

csv_data = "data/df_randomized_feasible_cleaned.csv"

# Initialize the data
data = AgentArenaData(csv_data)

# Verifying the Output
verifier = SimpleVerifier(os.getenv("OPENAI_API_KEY"), data)
verifier.process_outputs("output/")


Job: Re-create this poster in hi res print format (ID: 105)
Agent: gpt-4
Result: ✗ FAILURE
--------------------------------------------------------------------------------

Job:  developer to create a receipt generation system (ID: 111)
Agent: gpt-4
Result: ✓ SUCCESS
--------------------------------------------------------------------------------

Job: Data Entry- Provide entry from a restaurant menu into another (ID: 110)
Agent: gpt-4
Result: ✗ FAILURE
--------------------------------------------------------------------------------

Job: Telegram Bot to post new token creation alerts and liquidity pool additions on Eth and BNB (ID: 12)
Agent: gpt-4
Result: ✓ SUCCESS
--------------------------------------------------------------------------------

Job: Designing a  wiring part on detected symbols of electrical diagrams. (ID: 10)
Agent: gpt-4
Result: ✓ SUCCESS
--------------------------------------------------------------------------------

Job: Embroidery Digitizing  (ID: 38)
Agent: g

In [5]:
import pandas as pd

csv_results = "output/results.csv"

df = pd.read_csv(csv_results)
df.head()

Unnamed: 0.1,Unnamed: 0,gpt-4
0,2,fail
1,3,win
2,5,fail
3,6,fail
4,8,win
