Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some OCD concerns about the guides #127

Open
basilkorompilias opened this issue Jul 14, 2024 · 0 comments
Open

Some OCD concerns about the guides #127

basilkorompilias opened this issue Jul 14, 2024 · 0 comments

Comments

@basilkorompilias
Copy link

Hey there!
So I just want to add some input about the way you describe the data which is critically important for us to understand them right away. There are three types of descriptions, one on Kaggle, one on the website, and one here which I just found on the readme page. On the readme page of this repo is the best I believe presentation of the hierarchy and logic of the files and the most clear (too bad for me that I did not open every link first and spend hours tying to figure out the basics).

My concern is with the way that the actual datasets are structured. More specifically, when we take the "arc-agi_training-challenges.json", the tests are placed first, bringing some sort of confusion in the way that we see them. This might sound trivial to many, and computational it might not be a problem at all for well-structured models, but logically they should come after the training - as they are mentioned in the guides.

Before finding the clear and very direct explanation on this repo, I made the following which I give you in case you wish to consider adjusting it as you wish, and placing it in Kaggle and your website, so to improve your guides.

I also have a concern about the tern "Train" when we discuss about AGI, but I will make a different thread about this.

Why this is important?

  • First it is important for people from all backgrounds to understand the way we approach this Task of designing an AGI which senses different environments and eventually attempts to make sense of one.
  • Then it is importnat for our AI collaborators (our NLP agents that most of us today use to help us) to understand all the components precisely and their hierarchical architecture existent within the datasets - otherwise they just offer suggestions leading to trivial errors - because the guides are not very clear, most people find it difficult to explain repetitive patterns in a hierarchical simple way, and the datasets themselves are organized in an unorthodox manner.

P.s. If I am the only one who sees it as unorthodox, please excuse me because I am not an engineer, but a designer and information architect first. I just hope my input can help you become more consistent and specific - which is important when outlining tasks.

Cheers,
Basil.


Dataset Structure Overview

Each dataset is a collection of tasks, uniquely identified by an ID. Each task includes training data to develop models and test data to evaluate their performance.

Tasks Collection:

  • Each ID is a key in the dataset.
  • The value for each ID is an object (dictionary) containing "train" and "test" keys, where tests are presented first, although logically they aim to be our concluding reference.

ID Object

  • Train (Key-Value Pair)
    • Key: "train"
    • Value: List (Array)
      • Each element in the list is a dictionary containing multiple pairs of "input" and "output" keys to help us train/design our model.
  • Test (Key-Value Pair)
    • Key: "test"
    • Value: List (Array)
      • Single element: A dictionary containing a single "input" key for testing our designed/trained model.

Here is the structured representation with emphasis:

  • ID
    • Train
      • Entry 1
        • Input: Grid (2D array of numbers)
        • Output: Grid (2D array of numbers)
      • Entry 2
        • Input: Grid (2D array of numbers)
        • Output: Grid (2D array of numbers)
      • ...
    • Test
      • Single Input: Grid (2D array of numbers)

Example structure from the JSON file with the correct hierarchy:

{
  "ID": {
    "train": [
      {
        "input": [
          [/* grid data */]
        ],
        "output": [
          [/* grid data */]
        ]
      },
      {
        "input": [
          [/* grid data */]
        ],
        "output": [
          [/* grid data */]
        ]
      }
      // more train entries
    ],
    "test": [
      {
        "input": [
          [/* grid data */]
        ]
      }
      // only one test input per ID
    ]
  }
  // more IDs
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant