PLUGH

This is a supplementary code for the paper "PLUGH: A Benchmark for Spatial Understanding and Reasoning in Large Language Models," accepted for Wordplay Workshop at ACL 2024.

Abstract

We present PLUGH*, a modern benchmark that currently consists of 5 tasks, each with 125 input texts extracted from 48 different games and representing 61 different (non-isomorphic) spatial graphs to assess the abilities of Large Language Models (LLMs) for spatial understanding and reasoning. Our evaluation of API-based and open-sourced LLMs shows that while some commercial LLMs exhibit strong reasoning abilities, open-sourced competitors can demonstrate almost the same level of quality; however, all models still have significant room for improvement. We identify typical reasons for LLM failures and discuss possible ways to deal with them. Datasets and evaluation code are released.

Content

plugh.json -- the benchmark data.
sample_responses.py -- use this script to query models via OpenAI API or to generate the prompts, so you can use them to query any custom model.
plugh.responses.json -- already sampled responses for several models mentioned in the paper.
calc_metrics.py -- use this script to process responses, parse them, and calculate the benchmark metrics.

Results

The run of python calc_metrics.py -i plugh.json -r plugh.responses.json should produce the following report:

# task1

## task1_strict_nodes_f1
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  55.2%     66.6%     70.7%     73.5%
openai:gpt-4-turbo-2024-04-09         113  73.2%     76.5%     76.4%     77.1%
anthropic:claude-3-opus-20240229      113  70.0%     77.5%     78.9%     81.4%
local:llama3_8b                       113  64.8%     60.1%     65.4%     67.1%
local:mixtral_8x7b                    113  25.2%     67.2%     69.0%     69.0%
local:llama3_70b                      113  68.0%     73.7%     74.0%     74.2%
local:mixtral_8x22b                   113  67.7%     75.3%     76.6%     75.4%
openai:gpt-4o-2024-05-13              113  71.7%     73.4%     74.7%     75.8%

## task1_strict_edges_f1
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  30.6%     42.0%     46.0%     49.5%
openai:gpt-4-turbo-2024-04-09         113  53.1%     56.4%     56.4%     57.4%
anthropic:claude-3-opus-20240229      113  50.7%     59.0%     60.4%     62.9%
local:llama3_8b                       113  41.1%     36.8%     40.3%     40.9%
local:mixtral_8x7b                    113  12.6%     41.4%     42.4%     44.6%
local:llama3_70b                      113  46.5%     53.1%     53.6%     54.2%
local:mixtral_8x22b                   113  45.8%     54.1%     56.2%     55.4%
openai:gpt-4o-2024-05-13              113  52.2%     53.7%     56.1%     57.4%

## task1_fuzzy_nodes_f1
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  64.6%     75.8%     81.3%     82.8%
openai:gpt-4-turbo-2024-04-09         113  82.9%     86.4%     86.3%     86.8%
anthropic:claude-3-opus-20240229      113  80.3%     87.1%     88.3%     90.7%
local:llama3_8b                       113  77.5%     70.3%     76.2%     78.6%
local:mixtral_8x7b                    113  29.2%     76.7%     79.1%     79.3%
local:llama3_70b                      113  78.9%     84.0%     85.4%     84.9%
local:mixtral_8x22b                   113  78.9%     85.2%     86.5%     84.7%
openai:gpt-4o-2024-05-13              113  82.0%     83.3%     84.3%     85.5%

## task1_fuzzy_edges_f1
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  41.4%     54.3%     60.4%     62.3%
openai:gpt-4-turbo-2024-04-09         113  67.1%     71.1%     71.3%     71.8%
anthropic:claude-3-opus-20240229      113  64.4%     74.3%     76.2%     79.0%
local:llama3_8b                       113  55.7%     50.9%     53.9%     57.2%
local:mixtral_8x7b                    113  17.0%     53.0%     55.6%     58.4%
local:llama3_70b                      113  61.6%     68.0%     70.1%     70.5%
local:mixtral_8x22b                   113  60.1%     68.5%     70.6%     68.8%
openai:gpt-4o-2024-05-13              113  66.8%     67.8%     70.8%     72.3%

## task1_strict_nodes_rec
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  73.7%     77.3%     76.8%     77.2%
openai:gpt-4-turbo-2024-04-09         113  76.5%     76.8%     75.6%     76.5%
anthropic:claude-3-opus-20240229      113  79.9%     81.4%     80.2%     81.1%
local:llama3_8b                       113  68.6%     70.1%     70.6%     70.7%
local:mixtral_8x7b                    113  30.5%     73.6%     73.0%     71.8%
local:llama3_70b                      113  75.6%     77.3%     76.4%     75.9%
local:mixtral_8x22b                   113  73.9%     77.1%     76.3%     74.2%
openai:gpt-4o-2024-05-13              113  78.8%     78.8%     78.7%     78.9%

## task1_strict_edges_rec
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  44.2%     52.5%     52.9%     54.8%
openai:gpt-4-turbo-2024-04-09         113  57.0%     56.7%     55.7%     56.9%
anthropic:claude-3-opus-20240229      113  58.0%     61.9%     61.1%     62.2%
local:llama3_8b                       113  48.2%     45.8%     45.9%     45.5%
local:mixtral_8x7b                    113  15.2%     46.9%     45.9%     48.2%
local:llama3_70b                      113  54.5%     57.9%     57.2%     56.5%
local:mixtral_8x22b                   113  51.1%     56.4%     56.0%     54.2%
openai:gpt-4o-2024-05-13              113  58.4%     58.2%     59.5%     60.0%

## task1_fuzzy_nodes_rec
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  87.6%     88.5%     88.7%     87.4%
openai:gpt-4-turbo-2024-04-09         113  86.7%     87.0%     85.4%     86.1%
anthropic:claude-3-opus-20240229      113  91.8%     91.5%     89.6%     90.6%
local:llama3_8b                       113  82.3%     82.0%     82.2%     83.1%
local:mixtral_8x7b                    113  35.3%     84.0%     83.7%     82.3%
local:llama3_70b                      113  88.2%     88.3%     88.3%     87.1%
local:mixtral_8x22b                   113  86.6%     87.4%     86.2%     83.5%
openai:gpt-4o-2024-05-13              113  90.4%     89.7%     88.8%     89.2%

## task1_fuzzy_edges_rec
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  62.2%     68.5%     70.5%     69.9%
openai:gpt-4-turbo-2024-04-09         113  71.9%     71.9%     70.4%     71.2%
anthropic:claude-3-opus-20240229      113  73.8%     78.0%     76.9%     78.2%
local:llama3_8b                       113  65.3%     64.1%     61.1%     63.6%
local:mixtral_8x7b                    113  21.2%     60.7%     60.6%     62.6%
local:llama3_70b                      113  72.5%     74.3%     74.7%     73.6%
local:mixtral_8x22b                   113  67.7%     71.6%     70.6%     67.5%
openai:gpt-4o-2024-05-13              113  75.1%     74.0%     75.1%     76.0%

## task1_strict_nodes_prec
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  50.0%     62.5%     68.5%     72.3%
openai:gpt-4-turbo-2024-04-09         113  72.4%     77.8%     78.7%     78.9%
anthropic:claude-3-opus-20240229      113  64.7%     75.6%     78.8%     82.6%
local:llama3_8b                       113  63.4%     55.6%     64.4%     66.5%
local:mixtral_8x7b                    113  23.0%     64.9%     67.9%     68.6%
local:llama3_70b                      113  63.8%     72.1%     73.1%     74.0%
local:mixtral_8x22b                   113  65.2%     75.1%     78.3%     78.2%
openai:gpt-4o-2024-05-13              113  67.6%     70.4%     72.7%     74.3%

## task1_strict_edges_prec
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  26.0%     37.1%     42.5%     46.6%
openai:gpt-4-turbo-2024-04-09         113  51.7%     57.3%     58.6%     59.2%
anthropic:claude-3-opus-20240229      113  47.0%     57.9%     60.9%     65.1%
local:llama3_8b                       113  37.1%     32.6%     38.0%     38.8%
local:mixtral_8x7b                    113  11.4%     39.0%     41.0%     43.0%
local:llama3_70b                      113  41.9%     50.5%     51.7%     53.1%
local:mixtral_8x22b                   113  43.4%     53.4%     57.8%     58.3%
openai:gpt-4o-2024-05-13              113  48.8%     51.6%     54.7%     56.6%

## task1_fuzzy_nodes_prec
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  58.1%     71.0%     78.4%     81.2%
openai:gpt-4-turbo-2024-04-09         113  81.9%     87.6%     88.9%     88.9%
anthropic:claude-3-opus-20240229      113  74.1%     84.9%     88.2%     92.1%
local:llama3_8b                       113  75.9%     65.0%     75.0%     77.6%
local:mixtral_8x7b                    113  26.7%     74.1%     77.7%     79.1%
local:llama3_70b                      113  73.9%     82.1%     84.4%     84.6%
local:mixtral_8x22b                   113  75.8%     84.9%     88.3%     87.6%
openai:gpt-4o-2024-05-13              113  77.1%     79.6%     82.0%     83.7%

## task1_fuzzy_edges_prec
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  34.7%     48.2%     55.6%     58.5%
openai:gpt-4-turbo-2024-04-09         113  65.4%     72.0%     74.1%     74.0%
anthropic:claude-3-opus-20240229      113  59.6%     73.0%     77.0%     81.5%
local:llama3_8b                       113  50.8%     44.8%     51.1%     54.2%
local:mixtral_8x7b                    113  15.3%     49.7%     53.7%     56.8%
local:llama3_70b                      113  55.6%     64.7%     67.8%     69.1%
local:mixtral_8x22b                   113  56.8%     67.6%     72.6%     72.1%
openai:gpt-4o-2024-05-13              113  62.4%     64.7%     69.1%     71.0%


# task2a

## task2a_strict_distance
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  38.0%     31.6%     29.7%     28.3%
openai:gpt-4-turbo-2024-04-09         113  26.1%     12.8%     12.2%     11.7%
anthropic:claude-3-opus-20240229      113  34.6%     11.2%     10.5%     10.0%
local:llama3_8b                       113  80.6%     65.8%     53.7%     54.0%
local:mixtral_8x7b                    113  61.4%     46.6%     47.7%     46.3%
local:llama3_70b                      113  58.7%     15.1%     14.5%     15.9%
local:mixtral_8x22b                   113  34.7%     19.7%     16.9%     18.1%
openai:gpt-4o-2024-05-13              113  21.5%     10.5%     10.0%     9.5%

## task2a_fuzzy_distance
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  33.6%     30.5%     28.5%     28.2%
openai:gpt-4-turbo-2024-04-09         113  16.2%     12.8%     11.9%     11.7%
anthropic:claude-3-opus-20240229      113  19.2%     11.1%     10.5%     10.0%
local:llama3_8b                       113  38.7%     51.3%     44.1%     46.4%
local:mixtral_8x7b                    113  35.9%     36.6%     38.2%     37.6%
local:llama3_70b                      113  23.4%     14.9%     14.4%     15.8%
local:mixtral_8x22b                   113  18.7%     17.7%     16.4%     17.6%
openai:gpt-4o-2024-05-13              113  12.7%     9.5%      9.3%      9.3%


# task2b

## task2b_strict_distance
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  64.5%     61.0%     58.8%     59.6%
openai:gpt-4-turbo-2024-04-09         113  22.8%     18.2%     15.2%     15.5%
anthropic:claude-3-opus-20240229      113  31.1%     23.6%     21.1%     21.0%
local:llama3_8b                       113  90.7%     69.6%     67.1%     66.1%
local:mixtral_8x7b                    113  83.7%     60.7%     63.2%     60.6%
local:llama3_70b                      113  59.9%     32.2%     30.3%     32.9%
local:mixtral_8x22b                   113  63.0%     40.9%     39.8%     39.0%
openai:gpt-4o-2024-05-13              113  21.8%     13.0%     11.7%     12.4%

## task2b_fuzzy_distance
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  64.5%     60.2%     58.4%     59.5%
openai:gpt-4-turbo-2024-04-09         113  21.7%     18.2%     15.2%     15.5%
anthropic:claude-3-opus-20240229      113  25.7%     23.6%     21.1%     21.0%
local:llama3_8b                       113  65.0%     63.6%     62.2%     61.7%
local:mixtral_8x7b                    113  60.8%     54.4%     57.2%     53.4%
local:llama3_70b                      113  37.2%     32.1%     30.2%     32.8%
local:mixtral_8x22b                   113  46.1%     40.8%     39.8%     38.8%
openai:gpt-4o-2024-05-13              113  19.4%     13.0%     11.7%     12.4%


# task3

## task3_strict_distance
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  31.0%     21.6%     21.2%     20.6%
openai:gpt-4-turbo-2024-04-09         113  18.6%     13.1%     12.4%     11.8%
anthropic:claude-3-opus-20240229      113  49.3%     23.4%     20.3%     17.4%
local:llama3_8b                       113  75.3%     52.1%     58.0%     73.8%
local:mixtral_8x7b                    113  64.7%     60.1%     51.5%     50.0%
local:llama3_70b                      113  56.3%     15.4%     16.2%     14.1%
local:mixtral_8x22b                   113  36.8%     18.1%     16.8%     14.4%
openai:gpt-4o-2024-05-13              113  45.9%     15.9%     13.2%     13.0%

## task3_fuzzy_distance
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  30.4%     21.6%     21.2%     20.6%
openai:gpt-4-turbo-2024-04-09         113  18.6%     13.1%     12.4%     11.8%
anthropic:claude-3-opus-20240229      113  37.7%     23.0%     20.2%     17.4%
local:llama3_8b                       113  53.4%     43.3%     51.9%     69.7%
local:mixtral_8x7b                    113  37.4%     34.1%     21.5%     18.2%
local:llama3_70b                      113  38.3%     15.3%     16.2%     14.1%
local:mixtral_8x22b                   113  33.2%     17.2%     15.9%     13.5%
openai:gpt-4o-2024-05-13              113  26.2%     15.9%     13.2%     13.0%


# task4

## task4_strict_distance
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  67.3%     62.2%     52.2%     53.5%
openai:gpt-4-turbo-2024-04-09         113  42.0%     17.9%     20.0%     20.7%
anthropic:claude-3-opus-20240229      113  49.0%     30.2%     25.7%     21.6%
local:llama3_8b                       113  91.8%     74.4%     83.4%     89.4%
local:mixtral_8x7b                    113  74.8%     75.9%     75.5%     74.8%
local:llama3_70b                      113  73.1%     31.4%     30.6%     31.0%
local:mixtral_8x22b                   113  55.9%     35.0%     33.0%     29.5%
openai:gpt-4o-2024-05-13              113  59.8%     15.9%     14.8%     14.7%

## task4_fuzzy_distance
model                               items  0-shot    1-shot    2-shot    3-shot
--------------------------------  -------  --------  --------  --------  --------
openai:gpt-3-5-turbo                  113  66.6%     60.3%     51.9%     53.1%
openai:gpt-4-turbo-2024-04-09         113  30.2%     17.9%     20.0%     20.7%
anthropic:claude-3-opus-20240229      113  38.7%     30.2%     25.7%     21.6%
local:llama3_8b                       113  70.6%     67.8%     79.7%     87.2%
local:mixtral_8x7b                    113  64.5%     62.8%     60.9%     59.5%
local:llama3_70b                      113  44.8%     31.4%     30.4%     31.0%
local:mixtral_8x22b                   113  46.3%     34.8%     32.6%     28.9%
openai:gpt-4o-2024-05-13              113  24.4%     15.9%     14.8%     14.7%

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PLUGH

Abstract

Content

Results

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
calc_metrics.py		calc_metrics.py
plugh.json		plugh.json
plugh.responses.json		plugh.responses.json
sample_responses.py		sample_responses.py

altsoph/PLUGH

Folders and files

Latest commit

History

Repository files navigation

PLUGH

Abstract

Content

Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages