In [10]:
Skip to content
Chat history

You said:
how do I change the permissions for the user when launching the studio
ChatGPT said:

Thought for 22s
Gotcha — “Studio” means different things in different platforms, and permissions live in different places. Here are the exact clicks for the most common ones. Pick the section that matches what you’re using and you should be set.

Azure Machine Learning Studio
Azure Portal → your ML workspace → Access control (IAM) → Add role assignment.

Assign a role (common picks):

AzureML Data Scientist / Contributor (full workspace use)

Reader (view-only)

Also grant data/secret access so Studio works end-to-end:

On the linked Storage account: Storage Blob Data Reader or Contributor

On the linked Key Vault: Key Vault Secrets User (if they use secrets)

Have the user sign out/in and re-open Studio from the workspace (Launch studio).

AWS SageMaker Studio
AWS Console → SageMaker → Domains → your domain → User profiles → select the user → Edit.

Set the Execution role (IAM role) that has the needed permissions (S3, ECR, KMS, SageMaker APIs, etc.).

If you changed IAM policies, have the user pick them up by restarting apps: Studio Control Panel → user’s Apps (JupyterServer / KernelGateway) → Delete/Stop, then Re-launch. (Their home EFS volume persists.)

Google Vertex AI Studio (Generative AI)
Google Cloud Console → IAM & Admin → grant roles at the project:

Vertex AI User (use resources)

Service Account User (if they need to run with a SA)

For data, grant Storage Object Viewer/Creator on the relevant GCS buckets.

Reopen Vertex AI Studio.

OpenAI Studio
In your OpenAI workspace/org, go to Members & roles → set Owner / Editor / Viewer.

If you’re using Projects, adjust the user’s project role too (project-level sharing is separate).

Looker Studio (formerly Data Studio)
Open the report → Share → Manage access → add the user as Viewer or Editor.

Also share the Data Source (Data Source → Share) if the report doesn’t use owner’s credentials.

Quick checks if changes don’t “stick”
Close/reopen the Studio (or sign out/in).

In SageMaker, restart/delete the user’s Studio Apps.

In Azure, confirm the role is on the workspace (and not just the resource group), and that Storage/Key Vault roles are set.

Propagation usually requires a fresh session.

If you tell me which Studio you’re on (Azure ML, SageMaker, Vertex AI, OpenAI, Looker, something else), I’ll tailor the steps precisely to your setup (including which exact roles your user needs for their tasks).

You said:
dont I need to use this on the user?

{
	"Version": "2012-10-17",
	"Statement": [
		{
			"Effect": "Allow",
			"Action": "s3:ListBucket",
			"Resource": "arn:aws:s3:::arc-data-training"
		},
		{
			"Effect": "Allow",
			"Action": [
				"s3:GetObject",
				"s3:GetObjectVersion"
			],
			"Resource": "arn:aws:s3:::arc-data-training/*"
		}
	]
}
ChatGPT said:

Thought for 12s
Short answer: attach that policy to the SageMaker Studio execution role, not to the human IAM user. That’s the role your Studio apps (JupyterServer/KernelGateway) assume when they run, so S3 access needs to live there.

Here’s what to do:

Find the role Studio uses

AWS Console → SageMaker → Domains → your domain → User profiles → select the user.

Note the Execution role ARN.

Attach the policy to that role

AWS Console → IAM → Roles → open that execution role → Add permissions → Create inline policy → JSON → paste your policy → Save.

Your read-only policy (as posted) is fine for reading from s3://arc-data-training:

{
  "Version": "2012-10-17",
  "Statement": [
    {"Effect": "Allow","Action": "s3:ListBucket","Resource": "arn:aws:s3:::arc-data-training"},
    {"Effect": "Allow","Action": ["s3:GetObject","s3:GetObjectVersion"],"Resource": "arn:aws:s3:::arc-data-training/*"}
  ]
}
Common add-ons (use if needed)
Write data back to the bucket

{"Effect":"Allow","Action":["s3:PutObject","s3:AbortMultipartUpload"],"Resource":"arn:aws:s3:::arc-data-training/*"}
List object versions (if the bucket is versioned and you need to browse versions)

{"Effect":"Allow","Action":"s3:ListBucketVersions","Resource":"arn:aws:s3:::arc-data-training"}
SSE-KMS encryption (if objects use a CMK)

Add to the role policy:

{
  "Effect":"Allow",
  "Action":["kms:Decrypt","kms:Encrypt","kms:GenerateDataKey*","kms:DescribeKey"],
  "Resource":"arn:aws:kms:<region>:<account>:key/<key-id-or-alias>"
}
Ensure the KMS key policy allows this role as a principal.

If access still fails
Bucket policy denies override role permissions. Check arc-data-training’s Bucket policy for any Deny statements or allow-lists. If it’s locked down, add an allow for the role:

{
  "Version":"2012-10-17",
  "Statement":[
    {
      "Sid":"AllowStudioRoleRead",
      "Effect":"Allow",
      "Principal":{"AWS":"arn:aws:iam::<account-id>:role/<studio-execution-role-name>"},
      "Action":["s3:ListBucket"],
      "Resource":"arn:aws:s3:::arc-data-training"
    },
    {
      "Sid":"AllowStudioRoleReadObjects",
      "Effect":"Allow",
      "Principal":{"AWS":"arn:aws:iam::<account-id>:role/<studio-execution-role-name>"},
      "Action":["s3:GetObject","s3:GetObjectVersion"],
      "Resource":"arn:aws:s3:::arc-data-training/*"
    }
  ]
}
Credentials caching: After changing the role, in Studio go to Control Panel → (your user) → Apps, stop/delete the running JupyterServer/KernelGateway apps, then reopen Studio so it picks up the new permissions.

If you paste me the execution role ARN (or confirm whether the bucket uses KMS or needs write access), I’ll hand you the exact, ready-to-paste policy.

You said:
change this code not to access a bucket but a folder in my directory called training


import os
import sys
import json
import boto3
from pathlib import Path
from urllib.parse import urlparse

import torch
import numpy as np
import torch, torch.nn as nn, torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader


def load_jsons_from_folder(s3_uri, profile_name='arcprofile'):
    if not str(s3_uri).startswith("s3://"):
        raise ValueError("Expected an S3 URI like s3://bucket/prefix/")

    o = urlparse(s3_uri)
    bucket = o.netloc
    prefix = o.path.lstrip("/")
    if prefix and not prefix.endswith("/"):
        prefix += "/"

    # use default creds, or pick a profile
    session = boto3.Session(profile_name=profile_name) if profile_name else boto3.Session()
    s3 = session.client("s3")

    # list only the current "folder" (non-recursive)
    paginator = s3.get_paginator("list_objects_v2")
    pages = paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter="/")

    json_keys = []
    for page in pages:
        for obj in page.get("Contents", []):
            key = obj["Key"]
            if key.endswith(".json"):
                json_keys.append(key)

    if not json_keys:
        raise FileNotFoundError(f"No .json objects found under s3://{bucket}/{prefix}")

    # sort by filename to mirror your local behavior
    json_keys.sort(key=lambda k: os.path.basename(k))

    data = {}
    for key in json_keys:
        uri = f"s3://{bucket}/{key}"
        try:
            body = s3.get_object(Bucket=bucket, Key=key)["Body"].read().decode("utf-8")
            obj = json.loads(body)
        except Exception as e:
            print("Failed to read:", uri)
            print(" Error:", e)
            continue

        # same validation as your original
        if ("train" not in obj) or ("test" not in obj):
            print("Skipping (no train/test):", uri)
            continue

        ok = True
        for split in ["train", "test"]:
            if not isinstance(obj[split], list):
                ok = False
                break
            for pairs in obj[split]:
                if ("input" not in pairs) or ("output" not in pairs):
                    ok = False
                    break
            if not ok:
                break
        if not ok:
            print("Skipping (bad format):", uri)
            continue

        name = os.path.splitext(os.path.basename(key))[0]
        data[name] = obj

    return data


def _add_one_to_all_values_in_place(data):
    """
    Adds +1 to every scalar value in each input/output grid across all samples.
    Done BEFORE padding so pad_value=0 remains 0.
    """
    for sample in data.values():
        for split in ["train", "test"]:
            for pairs in sample.get(split, []):
                # input grid
                r = 0
                while r < len(pairs["input"]):
                    c = 0
                    row = pairs["input"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1
                # output grid
                r = 0
                while r < len(pairs["output"]):
                    c = 0
                    row = pairs["output"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1


def get_metrics(data):
    metric_dict = {
        "max_train_len": 0,
        "max_test_len": 0,
        "max_train_input_height": 0,
        "max_test_input_height": 0,
        "max_train_output_height": 0,
        "max_test_output_height": 0,
        "max_train_input_width": 0,
        "max_test_input_width": 0,
        "max_train_output_width": 0,
        "max_test_output_width": 0
    }

    for sample in data.values():
        if (len(sample['train']) > metric_dict['max_train_len']):
            metric_dict['max_train_len'] = len(sample['train'])
        if (len(sample['test']) > metric_dict['max_test_len']):
            metric_dict['max_test_len'] = len(sample['test'])
        for pairs in sample['train']:
            if (len(pairs['input']) > metric_dict['max_train_input_height']):
                metric_dict['max_train_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_train_output_height']):
                metric_dict['max_train_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_train_input_width']):
                    metric_dict['max_train_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_train_output_width']):
                    metric_dict['max_train_output_width'] = len(output)
        for pairs in sample['test']:
            if (len(pairs['input']) > metric_dict['max_test_input_height']):
                metric_dict['max_test_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_test_output_height']):
                metric_dict['max_test_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_test_input_width']):
                    metric_dict['max_test_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_test_output_width']):
                    metric_dict['max_test_output_width'] = len(output)
    return metric_dict


def pad_data(data, metric_dict=None, pad_value=0):
    """
    Pads each sample independently to its own max square size.
    metric_dict is ignored (kept for backward compatibility).
    """
    for sample in data.values():
        # ----- compute per-sample maxima for TRAIN -----
        max_train_input_height = 0
        max_train_input_width  = 0
        max_train_output_height = 0
        max_train_output_width  = 0

        for pairs in sample.get('train', []):
            if len(pairs['input'])  > max_train_input_height:  max_train_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_train_output_height: max_train_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_train_input_width:  max_train_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_train_output_width: max_train_output_width = len(outp)

        # ----- compute per-sample maxima for TEST -----
        max_test_input_height = 0
        max_test_input_width  = 0
        max_test_output_height = 0
        max_test_output_width  = 0

        for pairs in sample.get('test', []):
            if len(pairs['input'])  > max_test_input_height:  max_test_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_test_output_height: max_test_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_test_input_width:  max_test_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_test_output_width: max_test_output_width = len(outp)

        # ----- per-sample square sizes -----
        max_train_size = max(
            max_train_input_height,
            max_train_input_width,
            max_train_output_height,
            max_train_output_width
        )
        max_test_size = max(
            max_test_input_height,
            max_test_input_width,
            max_test_output_height,
            max_test_output_width
        )

        # ----- pad TRAIN for this sample -----
        for pairs in sample.get('train', []):
            # input
            while len(pairs['input']) < max_train_size:
                pairs['input'].append([pad_value] * max_train_size)
            for inp in pairs['input']:
                while len(inp) < max_train_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_train_size:
                pairs['output'].append([pad_value] * max_train_size)
            for outp in pairs['output']:
                while len(outp) < max_train_size:
                    outp.append(pad_value)

        # ----- pad TEST for this sample -----
        for pairs in sample.get('test', []):
            # input
            while len(pairs['input']) < max_test_size:
                pairs['input'].append([pad_value] * max_test_size)
            for inp in pairs['input']:
                while len(inp) < max_test_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_test_size:
                pairs['output'].append([pad_value] * max_test_size)
            for outp in pairs['output']:
                while len(outp) < max_test_size:
                    outp.append(pad_value)

    return data


def _infer_original_size_from_padded(grid, pad_value=0):
    h = 0
    w = 0
    r = 0
    while r < len(grid):
        row = grid[r]
        any_nonpad = False
        last_nonpad = -1
        c = 0
        while c < len(row):
            if row[c] != pad_value:
                any_nonpad = True
                last_nonpad = c
            c += 1
        if any_nonpad:
            if (r + 1) > h:
                h = r + 1
            if (last_nonpad + 1) > w:
                w = last_nonpad + 1
        r += 1
    return (h, w)


def build_sample_level_dataset(data, pad_value=0):
    """
    Build a list of per-sample records.
    NEW: also stores per-pair masks: 1 where value != pad_value, else 0.
    """
    dataset = []
    for sample_name, sample in data.items():
        # containers
        train_pairs = []
        test_pairs = []

        # track original (unpadded) sizes per split
        train_max_h = 0
        train_max_w = 0
        test_max_h = 0
        test_max_w = 0

        # ----- TRAIN -----
        idx = 0
        for pairs in sample['train']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            # original sizes (prefer stored, else infer)
            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            # update split-wide original size (max over inputs/outputs)
            if in_h > train_max_h: train_max_h = in_h
            if out_h > train_max_h: train_max_h = out_h
            if in_w > train_max_w: train_max_w = in_w
            if out_w > train_max_w: train_max_w = out_w

            # tensors
            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            # store pair
            train_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,    # NEW
                "output_mask": out_mask    # NEW
            })
            idx += 1

        # ----- TEST -----
        idx = 0
        for pairs in sample['test']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            if in_h > test_max_h: test_max_h = in_h
            if out_h > test_max_h: test_max_h = out_h
            if in_w > test_max_w: test_max_w = in_w
            if out_w > test_max_w: test_max_w = out_w

            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            test_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,    # NEW
                "output_mask": out_mask    # NEW
            })
            idx += 1

        # assemble sample-level record
        item = {
            "id": str(sample_name),
            "train_pairs": train_pairs,
            "test_pairs": test_pairs,
            "train_original_size": (train_max_h, train_max_w),
            "test_original_size": (test_max_h, test_max_w)
        }
        dataset.append(item)

    return dataset


class ARCSampleDataset(Dataset):
    def __init__(self, sample_list):
        self.data = sample_list

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]

        # stack per-sample pairs into tensors
        train_inputs = torch.stack([p["input"] for p in sample["train_pairs"]])      # [num_train, H, W]
        train_outputs = torch.stack([p["output"] for p in sample["train_pairs"]])    # [num_train, H, W]
        test_inputs = torch.stack([p["input"] for p in sample["test_pairs"]])        # [num_test, H, W]
        test_outputs = torch.stack([p["output"] for p in sample["test_pairs"]])      # [num_test, H, W]

        # NEW: stack masks alongside data
        train_input_masks = torch.stack([p["input_mask"] for p in sample["train_pairs"]])   # [num_train, H, W]
        train_output_masks = torch.stack([p["output_mask"] for p in sample["train_pairs"]]) # [num_train, H, W]
        test_input_masks  = torch.stack([p["input_mask"] for p in sample["test_pairs"]])    # [num_test, H, W]
        test_output_masks = torch.stack([p["output_mask"] for p in sample["test_pairs"]])   # [num_test, H, W]

        return {
            "id": sample["id"],
            "train_inputs": train_inputs,
            "train_outputs": train_outputs,
            "test_inputs": test_inputs,
            "test_outputs": test_outputs,
            "train_input_masks": train_input_masks,   # NEW
            "train_output_masks": train_output_masks, # NEW
            "test_input_masks": test_input_masks,     # NEW
            "test_output_masks": test_output_masks,   # NEW
            "train_original_size": torch.tensor(sample["train_original_size"], dtype=torch.long),
            "test_original_size": torch.tensor(sample["test_original_size"], dtype=torch.long)
        }

def arc_collate_fn_bs1(batch):
    # batch size is guaranteed to be 1; return the single dict unchanged
    return batch[0]


# ----------- main -----------
folder_path = f"s3://arc-data-training/training/"
data = load_jsons_from_folder(folder_path)

# add +1 to every grid value BEFORE padding
_add_one_to_all_values_in_place(data)

metrics = get_metrics(data)
padded_data = pad_data(data, metrics, pad_value=0)
sample_level = build_sample_level_dataset(padded_data, pad_value=0)
arc_dataset = ARCSampleDataset(sample_list=sample_level)

arc_loader = DataLoader(
    arc_dataset,
    batch_size=1,
    shuffle=True,
    collate_fn=arc_collate_fn_bs1,
    num_workers=0,
    pin_memory=False
)

# NEW: print *every* array for *every* sample (can be large!)
for batch in arc_loader:
    print("=== SAMPLE (FULL) ===")
    print("ID:", batch["id"])
    print("train_inputs:", batch["train_inputs"].tolist())
    print("train_outputs:", batch["train_outputs"].tolist())
    print("train_input_masks:", batch["train_input_masks"].tolist())
    print("train_output_masks:", batch["train_output_masks"].tolist())
    print("test_inputs:", batch["test_inputs"].tolist())
    print("test_outputs:", batch["test_outputs"].tolist())
    print("test_input_masks:", batch["test_input_masks"].tolist())
    print("test_output_masks:", batch["test_output_masks"].tolist())
    print("train_original_size:", batch["train_original_size"].tolist())
    print("test_original_size:", batch["test_original_size"].tolist())
    print()

print(type(arc_loader))
ChatGPT said:

Thought for 12s
Here’s a local-filesystem version (no S3/boto3). It reads .json files directly from a folder named training (non-recursive), keeps your validation and downstream logic the same, and fixes a small import issue.

import os
import sys
import json
from pathlib import Path

import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader


def load_jsons_from_folder(dir_path):
    """
    Load all .json files from a local directory (non-recursive).
    Mirrors the S3 behavior: list current folder only, sort by filename,
    validate schema, and return a dict[name] = parsed_json.
    """
    p = Path(dir_path)
    if not p.exists() or not p.is_dir():
        raise FileNotFoundError(f"Directory not found: {p}")

    # list only current folder, non-recursive, .json only
    json_files = [f for f in p.iterdir() if f.is_file() and f.suffix.lower() == ".json"]
    if not json_files:
        raise FileNotFoundError(f"No .json files found in {p}")

    # sort by filename to mirror previous behavior
    json_files.sort(key=lambda f: f.name)

    data = {}
    for f in json_files:
        try:
            body = f.read_text(encoding="utf-8")
            obj = json.loads(body)
        except Exception as e:
            print("Failed to read:", f)
            print(" Error:", e)
            continue

        # same validation as your original
        if ("train" not in obj) or ("test" not in obj):
            print("Skipping (no train/test):", f)
            continue

        ok = True
        for split in ["train", "test"]:
            if not isinstance(obj[split], list):
                ok = False
                break
            for pairs in obj[split]:
                if ("input" not in pairs) or ("output" not in pairs):
                    ok = False
                    break
            if not ok:
                break
        if not ok:
            print("Skipping (bad format):", f)
            continue

        name = f.stem  # filename without extension
        data[name] = obj

    return data


def _add_one_to_all_values_in_place(data):
    """
    Adds +1 to every scalar value in each input/output grid across all samples.
    Done BEFORE padding so pad_value=0 remains 0.
    """
    for sample in data.values():
        for split in ["train", "test"]:
            for pairs in sample.get(split, []):
                # input grid
                r = 0
                while r < len(pairs["input"]):
                    c = 0
                    row = pairs["input"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1
                # output grid
                r = 0
                while r < len(pairs["output"]):
                    c = 0
                    row = pairs["output"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1


def get_metrics(data):
    metric_dict = {
        "max_train_len": 0,
        "max_test_len": 0,
        "max_train_input_height": 0,
        "max_test_input_height": 0,
        "max_train_output_height": 0,
        "max_test_output_height": 0,
        "max_train_input_width": 0,
        "max_test_input_width": 0,
        "max_train_output_width": 0,
        "max_test_output_width": 0
    }

    for sample in data.values():
        if (len(sample['train']) > metric_dict['max_train_len']):
            metric_dict['max_train_len'] = len(sample['train'])
        if (len(sample['test']) > metric_dict['max_test_len']):
            metric_dict['max_test_len'] = len(sample['test'])
        for pairs in sample['train']:
            if (len(pairs['input']) > metric_dict['max_train_input_height']):
                metric_dict['max_train_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_train_output_height']):
                metric_dict['max_train_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_train_input_width']):
                    metric_dict['max_train_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_train_output_width']):
                    metric_dict['max_train_output_width'] = len(output)
        for pairs in sample['test']:
            if (len(pairs['input']) > metric_dict['max_test_input_height']):
                metric_dict['max_test_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_test_output_height']):
                metric_dict['max_test_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_test_input_width']):
                    metric_dict['max_test_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_test_output_width']):
                    metric_dict['max_test_output_width'] = len(output)
    return metric_dict


def pad_data(data, metric_dict=None, pad_value=0):
    """
    Pads each sample independently to its own max square size.
    metric_dict is ignored (kept for backward compatibility).
    """
    for sample in data.values():
        # ----- compute per-sample maxima for TRAIN -----
        max_train_input_height = 0
        max_train_input_width  = 0
        max_train_output_height = 0
        max_train_output_width  = 0

        for pairs in sample.get('train', []):
            if len(pairs['input'])  > max_train_input_height:  max_train_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_train_output_height: max_train_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_train_input_width:  max_train_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_train_output_width: max_train_output_width = len(outp)

        # ----- compute per-sample maxima for TEST -----
        max_test_input_height = 0
        max_test_input_width  = 0
        max_test_output_height = 0
        max_test_output_width  = 0

        for pairs in sample.get('test', []):
            if len(pairs['input'])  > max_test_input_height:  max_test_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_test_output_height: max_test_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_test_input_width:  max_test_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_test_output_width: max_test_output_width = len(outp)

        # ----- per-sample square sizes -----
        max_train_size = max(
            max_train_input_height,
            max_train_input_width,
            max_train_output_height,
            max_train_output_width
        )
        max_test_size = max(
            max_test_input_height,
            max_test_input_width,
            max_test_output_height,
            max_test_output_width
        )

        # ----- pad TRAIN for this sample -----
        for pairs in sample.get('train', []):
            # input
            while len(pairs['input']) < max_train_size:
                pairs['input'].append([pad_value] * max_train_size)
            for inp in pairs['input']:
                while len(inp) < max_train_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_train_size:
                pairs['output'].append([pad_value] * max_train_size)
            for outp in pairs['output']:
                while len(outp) < max_train_size:
                    outp.append(pad_value)

        # ----- pad TEST for this sample -----
        for pairs in sample.get('test', []):
            # input
            while len(pairs['input']) < max_test_size:
                pairs['input'].append([pad_value] * max_test_size)
            for inp in pairs['input']:
                while len(inp) < max_test_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_test_size:
                pairs['output'].append([pad_value] * max_test_size)
            for outp in pairs['output']:
                while len(outp) < max_test_size:
                    outp.append(pad_value)

    return data


def _infer_original_size_from_padded(grid, pad_value=0):
    h = 0
    w = 0
    r = 0
    while r < len(grid):
        row = grid[r]
        any_nonpad = False
        last_nonpad = -1
        c = 0
        while c < len(row):
            if row[c] != pad_value:
                any_nonpad = True
                last_nonpad = c
            c += 1
        if any_nonpad:
            if (r + 1) > h:
                h = r + 1
            if (last_nonpad + 1) > w:
                w = last_nonpad + 1
        r += 1
    return (h, w)


def build_sample_level_dataset(data, pad_value=0):
    """
    Build a list of per-sample records.
    NEW: also stores per-pair masks: 1 where value != pad_value, else 0.
    """
    dataset = []
    for sample_name, sample in data.items():
        # containers
        train_pairs = []
        test_pairs = []

        # track original (unpadded) sizes per split
        train_max_h = 0
        train_max_w = 0
        test_max_h = 0
        test_max_w = 0

        # ----- TRAIN -----
        idx = 0
        for pairs in sample['train']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            # original sizes (prefer stored, else infer)
            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            # update split-wide original size (max over inputs/outputs)
            if in_h > train_max_h: train_max_h = in_h
            if out_h > train_max_h: train_max_h = out_h
            if in_w > train_max_w: train_max_w = in_w
            if out_w > train_max_w: train_max_w = out_w

            # tensors
            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            # store pair
            train_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,    # NEW
                "output_mask": out_mask    # NEW
            })
            idx += 1

        # ----- TEST -----
        idx = 0
        for pairs in sample['test']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            if in_h > test_max_h: test_max_h = in_h
            if out_h > test_max_h: test_max_h = out_h
            if in_w > test_max_w: test_max_w = in_w
            if out_w > test_max_w: test_max_w = out_w

            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            test_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,    # NEW
                "output_mask": out_mask    # NEW
            })
            idx += 1

        # assemble sample-level record
        item = {
            "id": str(sample_name),
            "train_pairs": train_pairs,
            "test_pairs": test_pairs,
            "train_original_size": (train_max_h, train_max_w),
            "test_original_size": (test_max_h, test_max_w)
        }
        dataset.append(item)

    return dataset


class ARCSampleDataset(Dataset):
    def __init__(self, sample_list):
        self.data = sample_list

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]

        # stack per-sample pairs into tensors
        train_inputs = torch.stack([p["input"] for p in sample["train_pairs"]])      # [num_train, H, W]
        train_outputs = torch.stack([p["output"] for p in sample["train_pairs"]])    # [num_train, H, W]
        test_inputs = torch.stack([p["input"] for p in sample["test_pairs"]])        # [num_test, H, W]
        test_outputs = torch.stack([p["output"] for p in sample["test_pairs"]])      # [num_test, H, W]

        # NEW: stack masks alongside data
        train_input_masks = torch.stack([p["input_mask"] for p in sample["train_pairs"]])   # [num_train, H, W]
        train_output_masks = torch.stack([p["output_mask"] for p in sample["train_pairs"]]) # [num_train, H, W]
        test_input_masks  = torch.stack([p["input_mask"] for p in sample["test_pairs"]])    # [num_test, H, W]
        test_output_masks = torch.stack([p["output_mask"] for p in sample["test_pairs"]])   # [num_test, H, W]

        return {
            "id": sample["id"],
            "train_inputs": train_inputs,
            "train_outputs": train_outputs,
            "test_inputs": test_inputs,
            "test_outputs": test_outputs,
            "train_input_masks": train_input_masks,   # NEW
            "train_output_masks": train_output_masks, # NEW
            "test_input_masks": test_input_masks,     # NEW
            "test_output_masks": test_output_masks,   # NEW
            "train_original_size": torch.tensor(sample["train_original_size"], dtype=torch.long),
            "test_original_size": torch.tensor(sample["test_original_size"], dtype=torch.long)
        }

def arc_collate_fn_bs1(batch):
    # batch size is guaranteed to be 1; return the single dict unchanged
    return batch[0]


# ----------- main -----------
# Point to your local folder named "training" (sibling/relative to where you run the script)
folder_path = Path("training")
data = load_jsons_from_folder(folder_path)

# add +1 to every grid value BEFORE padding
_add_one_to_all_values_in_place(data)

metrics = get_metrics(data)
padded_data = pad_data(data, metrics, pad_value=0)
sample_level = build_sample_level_dataset(padded_data, pad_value=0)
arc_dataset = ARCSampleDataset(sample_list=sample_level)

arc_loader = DataLoader(
    arc_dataset,
    batch_size=1,
    shuffle=True,
    collate_fn=arc_collate_fn_bs1,
    num_workers=0,
    pin_memory=False
)

# NEW: print *every* array for *every* sample (can be large!)
for batch in arc_loader:
    print("=== SAMPLE (FULL) ===")
    print("ID:", batch["id"])
    print("train_inputs:", batch["train_inputs"].tolist())
    print("train_outputs:", batch["train_outputs"].tolist())
    print("train_input_masks:", batch["train_input_masks"].tolist())
    print("train_output_masks:", batch["train_output_masks"].tolist())
    print("test_inputs:", batch["test_inputs"].tolist())
    print("test_outputs:", batch["test_outputs"].tolist())
    print("test_input_masks:", batch["test_input_masks"].tolist())
    print("test_output_masks:", batch["test_output_masks"].tolist())
    print("train_original_size:", batch["train_original_size"].tolist())
    print("test_original_size:", batch["test_original_size"].tolist())
    print()

print(type(arc_loader))
Notes:

If your training directory is elsewhere, set folder_path = Path("/full/path/to/training").

To search subfolders too, replace the listing with p.glob("**/*.json") and sort those paths.

You said:
its not a directory its a folder
ChatGPT said:

Thought for 10s
Totally fair—UI calls it a “folder.” In code, folder == directory. Your function already expects a folder named training. If the error wording threw you off, here’s a tiny patch that changes the message and makes the path handling a bit friendlier:

Patch (minimal change):

def load_jsons_from_folder(dir_path):
    """
    Load all .json files from a local folder (non-recursive).
    """
    p = Path(dir_path).expanduser().resolve()
    if not p.exists() or not p.is_dir():
        raise FileNotFoundError(f"Folder not found (expected a folder containing .json files): {p}")

    json_files = [f for f in p.iterdir() if f.is_file() and f.suffix.lower() == ".json"]
    if not json_files:
        raise FileNotFoundError(f"No .json files found in folder: {p}")

    json_files.sort(key=lambda f: f.name)

    data = {}
    for f in json_files:
        try:
            body = f.read_text(encoding="utf-8")
            obj = json.loads(body)
        except Exception as e:
            print("Failed to read:", f)
            print(" Error:", e)
            continue
        # ... (rest unchanged)
Point the code at your folder named training:

# If the folder sits next to your script:
folder_path = (Path(__file__).parent / "training").resolve()

# If you're in a notebook/REPL and your CWD already contains the folder:
# folder_path = Path("training").resolve()

data = load_jsons_from_folder(folder_path)
Quick sanity check (optional)

print("Reading from:", folder_path)
print("Found:", [p.name for p in Path(folder_path).iterdir() if p.suffix.lower()==".json"])
If your “training” folder is elsewhere, just give the absolute path, e.g.:

folder_path = Path(r"C:\Users\you\projects\myapp\training").resolve()  # Windows
# or
folder_path = Path("/home/you/projects/myapp/training").resolve()      # macOS/Linux
That’s it—no S3, just a local folder.

You said:
your file read isn't working make a simple load jsons that reads all of the data



import os
import sys
import json
from pathlib import Path

import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader


def load_jsons_from_folder(dir_path):
    """
    Load all .json files from a local folder (non-recursive).
    """
    p = Path(dir_path).expanduser().resolve()
    if not p.exists() or not p.is_dir():
        raise FileNotFoundError(f"Folder not found (expected a folder containing .json files): {p}")

    json_files = [f for f in p.iterdir() if f.is_file() and f.suffix.lower() == ".json"]
    if not json_files:
        raise FileNotFoundError(f"No .json files found in folder: {p}")

    json_files.sort(key=lambda f: f.name)

    data = {}
    for f in json_files:
        try:
            body = f.read_text(encoding="utf-8")
            obj = json.loads(body)
        except Exception as e:
            print("Failed to read:", f)
            print(" Error:", e)
            continue
        # ... (rest unchanged)



def _add_one_to_all_values_in_place(data):
    """
    Adds +1 to every scalar value in each input/output grid across all samples.
    Done BEFORE padding so pad_value=0 remains 0.
    """
    for sample in data.values():
        for split in ["train", "test"]:
            for pairs in sample.get(split, []):
                # input grid
                r = 0
                while r < len(pairs["input"]):
                    c = 0
                    row = pairs["input"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1
                # output grid
                r = 0
                while r < len(pairs["output"]):
                    c = 0
                    row = pairs["output"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1


def get_metrics(data):
    metric_dict = {
        "max_train_len": 0,
        "max_test_len": 0,
        "max_train_input_height": 0,
        "max_test_input_height": 0,
        "max_train_output_height": 0,
        "max_test_output_height": 0,
        "max_train_input_width": 0,
        "max_test_input_width": 0,
        "max_train_output_width": 0,
        "max_test_output_width": 0
    }

    for sample in data.values():
        if (len(sample['train']) > metric_dict['max_train_len']):
            metric_dict['max_train_len'] = len(sample['train'])
        if (len(sample['test']) > metric_dict['max_test_len']):
            metric_dict['max_test_len'] = len(sample['test'])
        for pairs in sample['train']:
            if (len(pairs['input']) > metric_dict['max_train_input_height']):
                metric_dict['max_train_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_train_output_height']):
                metric_dict['max_train_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_train_input_width']):
                    metric_dict['max_train_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_train_output_width']):
                    metric_dict['max_train_output_width'] = len(output)
        for pairs in sample['test']:
            if (len(pairs['input']) > metric_dict['max_test_input_height']):
                metric_dict['max_test_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_test_output_height']):
                metric_dict['max_test_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_test_input_width']):
                    metric_dict['max_test_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_test_output_width']):
                    metric_dict['max_test_output_width'] = len(output)
    return metric_dict


def pad_data(data, metric_dict=None, pad_value=0):
    """
    Pads each sample independently to its own max square size.
    metric_dict is ignored (kept for backward compatibility).
    """
    for sample in data.values():
        # ----- compute per-sample maxima for TRAIN -----
        max_train_input_height = 0
        max_train_input_width  = 0
        max_train_output_height = 0
        max_train_output_width  = 0

        for pairs in sample.get('train', []):
            if len(pairs['input'])  > max_train_input_height:  max_train_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_train_output_height: max_train_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_train_input_width:  max_train_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_train_output_width: max_train_output_width = len(outp)

        # ----- compute per-sample maxima for TEST -----
        max_test_input_height = 0
        max_test_input_width  = 0
        max_test_output_height = 0
        max_test_output_width  = 0

        for pairs in sample.get('test', []):
            if len(pairs['input'])  > max_test_input_height:  max_test_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_test_output_height: max_test_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_test_input_width:  max_test_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_test_output_width: max_test_output_width = len(outp)

        # ----- per-sample square sizes -----
        max_train_size = max(
            max_train_input_height,
            max_train_input_width,
            max_train_output_height,
            max_train_output_width
        )
        max_test_size = max(
            max_test_input_height,
            max_test_input_width,
            max_test_output_height,
            max_test_output_width
        )

        # ----- pad TRAIN for this sample -----
        for pairs in sample.get('train', []):
            # input
            while len(pairs['input']) < max_train_size:
                pairs['input'].append([pad_value] * max_train_size)
            for inp in pairs['input']:
                while len(inp) < max_train_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_train_size:
                pairs['output'].append([pad_value] * max_train_size)
            for outp in pairs['output']:
                while len(outp) < max_train_size:
                    outp.append(pad_value)

        # ----- pad TEST for this sample -----
        for pairs in sample.get('test', []):
            # input
            while len(pairs['input']) < max_test_size:
                pairs['input'].append([pad_value] * max_test_size)
            for inp in pairs['input']:
                while len(inp) < max_test_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_test_size:
                pairs['output'].append([pad_value] * max_test_size)
            for outp in pairs['output']:
                while len(outp) < max_test_size:
                    outp.append(pad_value)

    return data


def _infer_original_size_from_padded(grid, pad_value=0):
    h = 0
    w = 0
    r = 0
    while r < len(grid):
        row = grid[r]
        any_nonpad = False
        last_nonpad = -1
        c = 0
        while c < len(row):
            if row[c] != pad_value:
                any_nonpad = True
                last_nonpad = c
            c += 1
        if any_nonpad:
            if (r + 1) > h:
                h = r + 1
            if (last_nonpad + 1) > w:
                w = last_nonpad + 1
        r += 1
    return (h, w)


def build_sample_level_dataset(data, pad_value=0):
    """
    Build a list of per-sample records.
    NEW: also stores per-pair masks: 1 where value != pad_value, else 0.
    """
    dataset = []
    for sample_name, sample in data.items():
        # containers
        train_pairs = []
        test_pairs = []

        # track original (unpadded) sizes per split
        train_max_h = 0
        train_max_w = 0
        test_max_h = 0
        test_max_w = 0

        # ----- TRAIN -----
        idx = 0
        for pairs in sample['train']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            # original sizes (prefer stored, else infer)
            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            # update split-wide original size (max over inputs/outputs)
            if in_h > train_max_h: train_max_h = in_h
            if out_h > train_max_h: train_max_h = out_h
            if in_w > train_max_w: train_max_w = in_w
            if out_w > train_max_w: train_max_w = out_w

            # tensors
            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            # store pair
            train_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,    # NEW
                "output_mask": out_mask    # NEW
            })
            idx += 1

        # ----- TEST -----
        idx = 0
        for pairs in sample['test']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            if in_h > test_max_h: test_max_h = in_h
            if out_h > test_max_h: test_max_h = out_h
            if in_w > test_max_w: test_max_w = in_w
            if out_w > test_max_w: test_max_w = out_w

            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            test_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,    # NEW
                "output_mask": out_mask    # NEW
            })
            idx += 1

        # assemble sample-level record
        item = {
            "id": str(sample_name),
            "train_pairs": train_pairs,
            "test_pairs": test_pairs,
            "train_original_size": (train_max_h, train_max_w),
            "test_original_size": (test_max_h, test_max_w)
        }
        dataset.append(item)

    return dataset


class ARCSampleDataset(Dataset):
    def __init__(self, sample_list):
        self.data = sample_list

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]

        # stack per-sample pairs into tensors
        train_inputs = torch.stack([p["input"] for p in sample["train_pairs"]])      # [num_train, H, W]
        train_outputs = torch.stack([p["output"] for p in sample["train_pairs"]])    # [num_train, H, W]
        test_inputs = torch.stack([p["input"] for p in sample["test_pairs"]])        # [num_test, H, W]
        test_outputs = torch.stack([p["output"] for p in sample["test_pairs"]])      # [num_test, H, W]

        # NEW: stack masks alongside data
        train_input_masks = torch.stack([p["input_mask"] for p in sample["train_pairs"]])   # [num_train, H, W]
        train_output_masks = torch.stack([p["output_mask"] for p in sample["train_pairs"]]) # [num_train, H, W]
        test_input_masks  = torch.stack([p["input_mask"] for p in sample["test_pairs"]])    # [num_test, H, W]
        test_output_masks = torch.stack([p["output_mask"] for p in sample["test_pairs"]])   # [num_test, H, W]

        return {
            "id": sample["id"],
            "train_inputs": train_inputs,
            "train_outputs": train_outputs,
            "test_inputs": test_inputs,
            "test_outputs": test_outputs,
            "train_input_masks": train_input_masks,   # NEW
            "train_output_masks": train_output_masks, # NEW
            "test_input_masks": test_input_masks,     # NEW
            "test_output_masks": test_output_masks,   # NEW
            "train_original_size": torch.tensor(sample["train_original_size"], dtype=torch.long),
            "test_original_size": torch.tensor(sample["test_original_size"], dtype=torch.long)
        }

def arc_collate_fn_bs1(batch):
    # batch size is guaranteed to be 1; return the single dict unchanged
    return batch[0]


# ----------- main -----------
# Point to your local folder named "training" (sibling/relative to where you run the script)
folder_path = Path("~/ARC-AGI-Model/src/data_pipeline/ARC_data/data/training")
data = load_jsons_from_folder(folder_path)

# add +1 to every grid value BEFORE padding
_add_one_to_all_values_in_place(data)

metrics = get_metrics(data)
padded_data = pad_data(data, metrics, pad_value=0)
sample_level = build_sample_level_dataset(padded_data, pad_value=0)
arc_dataset = ARCSampleDataset(sample_list=sample_level)

arc_loader = DataLoader(
    arc_dataset,
    batch_size=1,
    shuffle=True,
    collate_fn=arc_collate_fn_bs1,
    num_workers=0,
    pin_memory=False
)

# NEW: print *every* array for *every* sample (can be large!)
for batch in arc_loader:
    print("=== SAMPLE (FULL) ===")
    print("ID:", batch["id"])
    print("train_inputs:", batch["train_inputs"].tolist())
    print("train_outputs:", batch["train_outputs"].tolist())
    print("train_input_masks:", batch["train_input_masks"].tolist())
    print("train_output_masks:", batch["train_output_masks"].tolist())
    print("test_inputs:", batch["test_inputs"].tolist())
    print("test_outputs:", batch["test_outputs"].tolist())
    print("test_input_masks:", batch["test_input_masks"].tolist())
    print("test_output_masks:", batch["test_output_masks"].tolist())
    print("train_original_size:", batch["train_original_size"].tolist())
    print("test_original_size:", batch["test_original_size"].tolist())
    print()

print(type(arc_loader))
ChatGPT said:

Thought for 13s
Got it — here’s a drop-in, super simple loader that just reads all .json files from your training folder (recursively) and returns them in a dict. No schema checks, no skipping.

from pathlib import Path
import json

def load_jsons_from_folder(dir_path):
    """
    Read every .json file under dir_path (recursively) and return a dict
    keyed by the file's relative path (without the .json extension).
    """
    root = Path(dir_path).expanduser().resolve()
    files = sorted(p for p in root.rglob("*.json") if p.is_file())

    if not files:
        raise FileNotFoundError(f"No .json files found under: {root}")

    data = {}
    for p in files:
        key = str(p.relative_to(root).with_suffix(""))  # e.g. "subdir/file"
        try:
            with p.open("r", encoding="utf-8") as fh:
                data[key] = json.load(fh)
        except Exception as e:
            print(f"Failed to read {p}: {e}")

    if not data:
        raise FileNotFoundError(f"Unable to load any .json files under: {root}")

    return data
Use it exactly like you are:
folder_path = Path("~/ARC-AGI-Model/src/data_pipeline/ARC_data/data/training")
data = load_jsons_from_folder(folder_path)
Note: This reads everything. If any JSONs don’t have the expected train/test shape, your later steps will error. If you want, I can add a one-liner filter to keep only files that contain those keys.

You said:
add its actual size also at the end


# NEW: print *every* array for *every* sample (can be large!)
for batch in arc_loader:
    print("=== SAMPLE (FULL) ===")
    print("ID:", batch["id"])
    print("train_inputs:", batch["train_inputs"].tolist())
    print("train_outputs:", batch["train_outputs"].tolist())
    print("train_input_masks:", batch["train_input_masks"].tolist())
    print("train_output_masks:", batch["train_output_masks"].tolist())
    print("test_inputs:", batch["test_inputs"].tolist())
    print("test_outputs:", batch["test_outputs"].tolist())
    print("test_input_masks:", batch["test_input_masks"].tolist())
    print("test_output_masks:", batch["test_output_masks"].tolist())
    print("train_original_size:", batch["train_original_size"].tolist())
    print("test_original_size:", batch["test_original_size"].tolist())
    print()

print(type(arc_loader))
ChatGPT said:

Thought for 13s
You got it — here’s a minimal patch that prints the actual tensor shapes for each sample and a summary at the very end with total samples/batches.

Replace your print section with this:

# NEW: print *every* array for *every* sample (can be large!)
for batch in arc_loader:
    print("=== SAMPLE (FULL) ===")
    print("ID:", batch["id"])
    print("train_inputs:", batch["train_inputs"].tolist())
    print("train_outputs:", batch["train_outputs"].tolist())
    print("train_input_masks:", batch["train_input_masks"].tolist())
    print("train_output_masks:", batch["train_output_masks"].tolist())
    print("test_inputs:", batch["test_inputs"].tolist())
    print("test_outputs:", batch["test_outputs"].tolist())
    print("test_input_masks:", batch["test_input_masks"].tolist())
    print("test_output_masks:", batch["test_output_masks"].tolist())
    print("train_original_size:", batch["train_original_size"].tolist())
    print("test_original_size:", batch["test_original_size"].tolist())

    # >>> Added: actual tensor sizes (after padding)
    print("train_inputs.shape:", tuple(batch["train_inputs"].shape))      # (num_train, H, W)
    print("train_outputs.shape:", tuple(batch["train_outputs"].shape))    # (num_train, H, W)
    print("test_inputs.shape:", tuple(batch["test_inputs"].shape))        # (num_test, H, W)
    print("test_outputs.shape:", tuple(batch["test_outputs"].shape))      # (num_test, H, W)
    print()

# >>> Added: end-of-run summary
print("=== DATASET SUMMARY ===")
print("Total samples:", len(arc_dataset))
print("Total batches per epoch:", len(arc_loader))  # batch_size=1 so this equals samples
print("Batch size:", 1)
This keeps your detailed lists but also shows the actual shapes of the tensors per sample (which reflect the padded square sizes), and prints a concise dataset summary at the end.

You said:
okay reset the prints I just want number of test, number of train, original and new size of the inputs and outputs of them, and one example of them printed once

for batch in arc_loader:
    print("=== SAMPLE (FULL) ===")
    print("ID:", batch["id"])
    print("train_inputs:", batch["train_inputs"].tolist())
    print("train_outputs:", batch["train_outputs"].tolist())
    print("train_input_masks:", batch["train_input_masks"].tolist())
    print("train_output_masks:", batch["train_output_masks"].tolist())
    print("test_inputs:", batch["test_inputs"].tolist())
    print("test_outputs:", batch["test_outputs"].tolist())
    print("test_input_masks:", batch["test_input_masks"].tolist())
    print("test_output_masks:", batch["test_output_masks"].tolist())
    print("train_original_size:", batch["train_original_size"].tolist())
    print("test_original_size:", batch["test_original_size"].tolist())
    print("train_inputs.shape:", tuple(batch["train_inputs"].shape))      # (num_train, H, W)
    print("train_outputs.shape:", tuple(batch["train_outputs"].shape))    # (num_train, H, W)
    print("test_inputs.shape:", tuple(batch["test_inputs"].shape))        # (num_test, H, W)
    print("test_outputs.shape:", tuple(batch["test_outputs"].shape))      # (num_test, H, W)
    print()

print(type(arc_loader))
ChatGPT said:

Thought for 15s
Here’s a concise print block that shows exactly what you asked: #train, #test, the original (max) sizes and new padded sizes of inputs/outputs for each split, plus one example pair from each split.

# -------- concise reporting --------
for batch in arc_loader:
    num_train = batch["train_inputs"].shape[0]
    num_test  = batch["test_inputs"].shape[0]

    # original (max over pairs) sizes recorded earlier
    train_orig_h, train_orig_w = map(int, batch["train_original_size"].tolist())
    test_orig_h,  test_orig_w  = map(int, batch["test_original_size"].tolist())

    # new padded sizes (actual tensor shapes)
    train_in_h, train_in_w   = batch["train_inputs"].shape[1], batch["train_inputs"].shape[2]
    train_out_h, train_out_w = batch["train_outputs"].shape[1], batch["train_outputs"].shape[2]
    test_in_h,  test_in_w    = batch["test_inputs"].shape[1], batch["test_inputs"].shape[2]
    test_out_h, test_out_w   = batch["test_outputs"].shape[1], batch["test_outputs"].shape[2]

    print("=== SAMPLE ===")
    print("ID:", batch["id"])
    print(f"#train: {num_train} | #test: {num_test}")

    print(f"Train original size (max over pairs): ({train_orig_h}, {train_orig_w})")
    print(f"Train padded sizes — input: ({train_in_h}, {train_in_w}), output: ({train_out_h}, {train_out_w})")

    print(f"Test  original size (max over pairs): ({test_orig_h}, {test_orig_w})")
    print(f"Test  padded sizes — input: ({test_in_h}, {test_in_w}), output: ({test_out_h}, {test_out_w})")

    # One example pair from each split (if present)
    if num_train > 0:
        print("--- Example TRAIN pair [0] ---")
        print("input:", batch["train_inputs"][0].tolist())
        print("output:", batch["train_outputs"][0].tolist())

    if num_test > 0:
        print("--- Example TEST pair [0] ---")
        print("input:", batch["test_inputs"][0].tolist())
        print("output:", batch["test_outputs"][0].tolist())

    print()
You said:
how to change this in my env

--NotebookApp.iopub_data_rate_limit=1.0e10 --NotebookApp.rate_limit_window=10
ChatGPT said:

Thought for 17s
Here are the easiest ways to set those Jupyter limits in your environment—pick the one that matches how you launch notebooks.

Quick (one-time) for the current run
Classic Notebook

jupyter notebook \
  --NotebookApp.iopub_data_rate_limit=1.0e10 \
  --NotebookApp.rate_limit_window=10
JupyterLab / Jupyter Server

jupyter lab \
  --ServerApp.iopub_msg_rate_limit=1.0e10 \
  --ServerApp.rate_limit_window=10
If you see a warning like “iopub_data_rate_limit not recognized by ServerApp”, use ServerApp.iopub_msg_rate_limit (Jupyter Server uses msg rate, not data).

Persistent (all future runs)
Create or edit your Jupyter config files in ~/.jupyter/:

Classic Notebook (~/.jupyter/jupyter_notebook_config.py)

c = get_config()
c.NotebookApp.iopub_data_rate_limit = 1.0e10
c.NotebookApp.rate_limit_window = 10.0
JupyterLab / Jupyter Server (~/.jupyter/jupyter_server_config.py)

c = get_config()
c.ServerApp.iopub_msg_rate_limit = 1.0e10
c.ServerApp.rate_limit_window = 10.0
If those files don’t exist:

jupyter notebook --generate-config
jupyter server --generate-config   # for Jupyter Server / JupyterLab
Per–Conda/venv (keep it scoped to your env)
Put the config files inside the env and point Jupyter to them:

# Activate your env first...
export JUPYTER_CONFIG_DIR="$CONDA_PREFIX/etc/jupyter"   # or path to your venv
mkdir -p "$JUPYTER_CONFIG_DIR"
jupyter server --generate-config
# then edit $JUPYTER_CONFIG_DIR/jupyter_server_config.py as above
SageMaker Studio (if that’s your setup)
Open a terminal in Studio.

Edit ~/.jupyter/jupyter_server_config.py:

c = get_config()
c.ServerApp.iopub_msg_rate_limit = 1.0e10
c.ServerApp.rate_limit_window = 10.0
Control Panel → Apps → Stop the JupyterServer, then start it again.

Tip: You trimmed your prints (nice!)—that’s the best long-term fix. The huge limits above will suppress the “IOPub data rate exceeded” errors, but it’s still smart to avoid dumping giant tensors in loops.

You said:
IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--ServerApp.iopub_data_rate_limit.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)
ChatGPT said:

Thought for 11s
You’re hitting the Jupyter Server limits (not the classic Notebook app). Use the ServerApp settings.

Quick (for this run only)
From a terminal in your env, start Jupyter with higher limits:

jupyter lab \
  --ServerApp.iopub_data_rate_limit=1.0e9 \
  --ServerApp.rate_limit_window=10
# (or if you run classic notebook)
jupyter notebook \
  --NotebookApp.iopub_data_rate_limit=1.0e9 \
  --NotebookApp.rate_limit_window=10
Tip: You can also raise the message-count limiter if needed:
--ServerApp.iopub_msg_rate_limit=1.0e9

Persistent (scoped to your env)
Point Jupyter’s config to live inside your env and generate it:

# conda
export JUPYTER_CONFIG_DIR="$CONDA_PREFIX/etc/jupyter"
mkdir -p "$JUPYTER_CONFIG_DIR"
jupyter server --generate-config
Edit $JUPYTER_CONFIG_DIR/jupyter_server_config.py and add:

c = get_config()
c.ServerApp.iopub_data_rate_limit = 1.0e9   # bytes/sec
c.ServerApp.rate_limit_window = 10.0        # seconds
c.ServerApp.iopub_msg_rate_limit = 1.0e9    # optional, msgs/sec
Restart Jupyter Lab/Server.

SageMaker Studio (if that’s what you’re on)
Open a terminal and edit ~/.jupyter/jupyter_server_config.py with the same lines above.

Control Panel → Apps → Stop the JupyterServer (and KernelGateway if needed), then start again.

Even with higher limits, it’s best to avoid huge prints. Your new concise logging should help a lot.

You said:
can I set it in my notebook?
ChatGPT said:

Thought for 16s
Short answer: not for the current session.
ServerApp.iopub_data_rate_limit is a server startup setting. You can write the config from a notebook, but you must restart the Jupyter server for it to take effect.

Do it from a notebook (persists for next restart)
Run this in a cell to create/update your config inside the active env:

import os
from pathlib import Path
from jupyter_core.paths import jupyter_config_dir

DATA_RATE = 1.0e9   # bytes/sec
WINDOW    = 10.0    # seconds

cfg_dir = Path(os.environ.get("JUPYTER_CONFIG_DIR") or jupyter_config_dir())
cfg_dir.mkdir(parents=True, exist_ok=True)

cfg_path = cfg_dir / "jupyter_server_config.py"  # JupyterLab / Jupyter Server
text = (
    "c = get_config()\n"
    f"c.ServerApp.iopub_data_rate_limit = {DATA_RATE}\n"
    f"c.ServerApp.rate_limit_window = {WINDOW}\n"
    # optional, sometimes useful:
    f"# c.ServerApp.iopub_msg_rate_limit = {DATA_RATE}\n"
)

cfg_path.write_text(text, encoding="utf-8")
print(f"Wrote config to: {cfg_path}\nRestart Jupyter Server to apply.")
If you’re on classic Notebook instead of Jupyter Server, write jupyter_notebook_config.py with:

c = get_config()
c.NotebookApp.iopub_data_rate_limit = 1.0e9
c.NotebookApp.rate_limit_window = 10.0
After running the cell, restart Jupyter Lab/Server (or in SageMaker Studio: Control Panel → Apps → Stop/Start JupyterServer).

If you can’t restart right now
Use one of these in-notebook workarounds to avoid hitting the limit:

Print less: keep your trimmed logger; avoid large .tolist() dumps.

Write to a file instead of the browser:

with open("debug_dump.json", "w", encoding="utf-8") as f:
    import json
    json.dump(batch["train_inputs"][0].tolist(), f)
Silence output for a cell:

%%capture
# code that would spam output
Limit tensor/array printing:

import torch, numpy as np
torch.set_printoptions(edgeitems=2, threshold=1000, linewidth=120)
np.set_printoptions(edgeitems=2, threshold=1000, linewidth=120)
If you tell me whether you’re on JupyterLab locally, VS Code, or SageMaker Studio, I can tailor the exact restart steps.

You said:
is there some workaround way I can do?
ChatGPT said:

Thought for 24s
Yep—there are a few workarounds you can do right now in-notebook that avoid touching the server config:

1) Send the heavy output to a file (best fix)
No output goes over IOPub, so you won’t hit the limit.

from contextlib import redirect_stdout
from pathlib import Path

Path("logs").mkdir(exist_ok=True)

with open("logs/batch_dump.txt", "w", encoding="utf-8") as f, redirect_stdout(f):
    # >>> put the noisy prints here <<<
    for batch in arc_loader:
        print("ID:", batch["id"])
        # ... any other huge prints ...

print("Wrote logs/batch_dump.txt")
Then preview a tiny tail in the notebook:

print("\n".join(Path("logs/batch_dump.txt").read_text(encoding="utf-8").splitlines()[-50:]))
2) Capture cell output, then show just the tail
This keeps the flood off the browser during the run.

%%capture cap --no-stderr
# >>> put noisy prints here <<<
for batch in arc_loader:
    print("ID:", batch["id"])
    # ... huge prints ...

# New cell:
from pathlib import Path
Path("logs").mkdir(exist_ok=True)
Path("logs/run.log").write_text(cap.stdout, encoding="utf-8")
print("\n".join(cap.stdout.splitlines()[-50:]))
3) Throttle printing to stay under the limit
If you must print a big blob, print in chunks with a small sleep so bytes/sec < limit.

import time, json

def rate_limited_print(obj, limit_bps=1_000_000, chunk=50_000):
    """Print JSON/text in chunks so we stay under ~limit_bps."""
    s = obj if isinstance(obj, str) else json.dumps(obj)
    pause = chunk / float(limit_bps)  # seconds between chunks
    for i in range(0, len(s), chunk):
        print(s[i:i+chunk], end="", flush=True)
        time.sleep(pause)
    print()

# example
# rate_limited_print(batch["train_inputs"][0].tolist())
4) Overwrite instead of appending (much less output)
Show only a small summary and update it each loop:

from IPython.display import clear_output

for i, batch in enumerate(arc_loader, 1):
    clear_output(wait=True)
    print(f"Sample {i}/{len(arc_dataset)}")
    print(f"#train={batch['train_inputs'].shape[0]}  #test={batch['test_inputs'].shape[0]}")
5) Make big objects smaller before printing
import numpy as np, torch
np.set_printoptions(edgeitems=2, threshold=200, linewidth=120)
torch.set_printoptions(edgeitems=2, threshold=200, linewidth=120)
Pick (1) or (2) if you just need the data saved; they’re the most reliable. Use (3) only if you truly have to stream a large blob to the notebook UI.

You said:
IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--ServerApp.iopub_data_rate_limit.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)
ChatGPT said:

Thought for 18s
Yep — print it once and keep it tiny. Replace your loop with this single-sample summary:

# Print ONCE: counts, original sizes, padded sizes, and one example pair
for batch in arc_loader:
    num_train = int(batch["train_inputs"].shape[0])
    num_test  = int(batch["test_inputs"].shape[0])

    # original (max over pairs before padding)
    train_orig_h, train_orig_w = map(int, batch["train_original_size"].tolist())
    test_orig_h,  test_orig_w  = map(int, batch["test_original_size"].tolist())

    # new padded sizes (actual tensor shapes)
    train_in_h, train_in_w   = batch["train_inputs"].shape[1], batch["train_inputs"].shape[2]
    train_out_h, train_out_w = batch["train_outputs"].shape[1], batch["train_outputs"].shape[2]
    test_in_h,  test_in_w    = batch["test_inputs"].shape[1], batch["test_inputs"].shape[2]
    test_out_h, test_out_w   = batch["test_outputs"].shape[1], batch["test_outputs"].shape[2]

    print("=== SUMMARY (single sample) ===")
    print(f"#train: {num_train} | #test: {num_test}")
    print(f"Train original size (max): ({train_orig_h}, {train_orig_w})")
    print(f"Train padded sizes — input: ({train_in_h}, {train_in_w}), output: ({train_out_h}, {train_out_w})")
    print(f"Test  original size (max): ({test_orig_h}, {test_orig_w})")
    print(f"Test  padded sizes — input: ({test_in_h}, {test_in_w}), output: ({test_out_h}, {test_out_w})")

    # one example from each split (if present)
    if num_train > 0:
        print("\n--- Example TRAIN pair [0] ---")
        print("input:", batch["train_inputs"][0].tolist())
        print("output:", batch["train_outputs"][0].tolist())
    if num_test > 0:
        print("\n--- Example TEST pair [0] ---")
        print("input:", batch["test_inputs"][0].tolist())
        print("output:", batch["test_outputs"][0].tolist())

    break  # <<< ensure we only print once
If you still hit the IOPub limit, dump to a file and print a tiny tail:

from contextlib import redirect_stdout
from pathlib import Path

Path("logs").mkdir(exist_ok=True)
with open("logs/summary.txt", "w", encoding="utf-8") as f, redirect_stdout(f):
    # same loop as above (keep the break)

print("\n".join(Path("logs/summary.txt").read_text(encoding="utf-8").splitlines()[-60:]))
You said:
make the dataloader come from a class

import os
import sys
import json
from pathlib import Path

import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader


def load_jsons_from_folder(dir_path):
    """
    Read every .json file under dir_path (recursively) and return a dict
    keyed by the file's relative path (without the .json extension).
    """
    root = Path(dir_path).expanduser().resolve()
    files = sorted(p for p in root.rglob("*.json") if p.is_file())

    if not files:
        raise FileNotFoundError(f"No .json files found under: {root}")

    data = {}
    for p in files:
        key = str(p.relative_to(root).with_suffix(""))  # e.g. "subdir/file"
        try:
            with p.open("r", encoding="utf-8") as fh:
                data[key] = json.load(fh)
        except Exception as e:
            print(f"Failed to read {p}: {e}")

    if not data:
        raise FileNotFoundError(f"Unable to load any .json files under: {root}")

    return data



def _add_one_to_all_values_in_place(data):
    """
    Adds +1 to every scalar value in each input/output grid across all samples.
    Done BEFORE padding so pad_value=0 remains 0.
    """
    for sample in data.values():
        for split in ["train", "test"]:
            for pairs in sample.get(split, []):
                # input grid
                r = 0
                while r < len(pairs["input"]):
                    c = 0
                    row = pairs["input"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1
                # output grid
                r = 0
                while r < len(pairs["output"]):
                    c = 0
                    row = pairs["output"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1


def get_metrics(data):
    metric_dict = {
        "max_train_len": 0,
        "max_test_len": 0,
        "max_train_input_height": 0,
        "max_test_input_height": 0,
        "max_train_output_height": 0,
        "max_test_output_height": 0,
        "max_train_input_width": 0,
        "max_test_input_width": 0,
        "max_train_output_width": 0,
        "max_test_output_width": 0
    }

    for sample in data.values():
        if (len(sample['train']) > metric_dict['max_train_len']):
            metric_dict['max_train_len'] = len(sample['train'])
        if (len(sample['test']) > metric_dict['max_test_len']):
            metric_dict['max_test_len'] = len(sample['test'])
        for pairs in sample['train']:
            if (len(pairs['input']) > metric_dict['max_train_input_height']):
                metric_dict['max_train_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_train_output_height']):
                metric_dict['max_train_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_train_input_width']):
                    metric_dict['max_train_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_train_output_width']):
                    metric_dict['max_train_output_width'] = len(output)
        for pairs in sample['test']:
            if (len(pairs['input']) > metric_dict['max_test_input_height']):
                metric_dict['max_test_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_test_output_height']):
                metric_dict['max_test_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_test_input_width']):
                    metric_dict['max_test_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_test_output_width']):
                    metric_dict['max_test_output_width'] = len(output)
    return metric_dict


def pad_data(data, metric_dict=None, pad_value=0):
    """
    Pads each sample independently to its own max square size.
    metric_dict is ignored (kept for backward compatibility).
    """
    for sample in data.values():
        # ----- compute per-sample maxima for TRAIN -----
        max_train_input_height = 0
        max_train_input_width  = 0
        max_train_output_height = 0
        max_train_output_width  = 0

        for pairs in sample.get('train', []):
            if len(pairs['input'])  > max_train_input_height:  max_train_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_train_output_height: max_train_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_train_input_width:  max_train_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_train_output_width: max_train_output_width = len(outp)

        # ----- compute per-sample maxima for TEST -----
        max_test_input_height = 0
        max_test_input_width  = 0
        max_test_output_height = 0
        max_test_output_width  = 0

        for pairs in sample.get('test', []):
            if len(pairs['input'])  > max_test_input_height:  max_test_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_test_output_height: max_test_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_test_input_width:  max_test_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_test_output_width: max_test_output_width = len(outp)

        # ----- per-sample square sizes -----
        max_train_size = max(
            max_train_input_height,
            max_train_input_width,
            max_train_output_height,
            max_train_output_width
        )
        max_test_size = max(
            max_test_input_height,
            max_test_input_width,
            max_test_output_height,
            max_test_output_width
        )

        # ----- pad TRAIN for this sample -----
        for pairs in sample.get('train', []):
            # input
            while len(pairs['input']) < max_train_size:
                pairs['input'].append([pad_value] * max_train_size)
            for inp in pairs['input']:
                while len(inp) < max_train_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_train_size:
                pairs['output'].append([pad_value] * max_train_size)
            for outp in pairs['output']:
                while len(outp) < max_train_size:
                    outp.append(pad_value)

        # ----- pad TEST for this sample -----
        for pairs in sample.get('test', []):
            # input
            while len(pairs['input']) < max_test_size:
                pairs['input'].append([pad_value] * max_test_size)
            for inp in pairs['input']:
                while len(inp) < max_test_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_test_size:
                pairs['output'].append([pad_value] * max_test_size)
            for outp in pairs['output']:
                while len(outp) < max_test_size:
                    outp.append(pad_value)

    return data


def _infer_original_size_from_padded(grid, pad_value=0):
    h = 0
    w = 0
    r = 0
    while r < len(grid):
        row = grid[r]
        any_nonpad = False
        last_nonpad = -1
        c = 0
        while c < len(row):
            if row[c] != pad_value:
                any_nonpad = True
                last_nonpad = c
            c += 1
        if any_nonpad:
            if (r + 1) > h:
                h = r + 1
            if (last_nonpad + 1) > w:
                w = last_nonpad + 1
        r += 1
    return (h, w)


def build_sample_level_dataset(data, pad_value=0):
    """
    Build a list of per-sample records.
    NEW: also stores per-pair masks: 1 where value != pad_value, else 0.
    """
    dataset = []
    for sample_name, sample in data.items():
        # containers
        train_pairs = []
        test_pairs = []

        # track original (unpadded) sizes per split
        train_max_h = 0
        train_max_w = 0
        test_max_h = 0
        test_max_w = 0

        # ----- TRAIN -----
        idx = 0
        for pairs in sample['train']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            # original sizes (prefer stored, else infer)
            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            # update split-wide original size (max over inputs/outputs)
            if in_h > train_max_h: train_max_h = in_h
            if out_h > train_max_h: train_max_h = out_h
            if in_w > train_max_w: train_max_w = in_w
            if out_w > train_max_w: train_max_w = out_w

            # tensors
            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            # store pair
            train_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,    # NEW
                "output_mask": out_mask    # NEW
            })
            idx += 1

        # ----- TEST -----
        idx = 0
        for pairs in sample['test']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            if in_h > test_max_h: test_max_h = in_h
            if out_h > test_max_h: test_max_h = out_h
            if in_w > test_max_w: test_max_w = in_w
            if out_w > test_max_w: test_max_w = out_w

            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            test_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,    # NEW
                "output_mask": out_mask    # NEW
            })
            idx += 1

        # assemble sample-level record
        item = {
            "id": str(sample_name),
            "train_pairs": train_pairs,
            "test_pairs": test_pairs,
            "train_original_size": (train_max_h, train_max_w),
            "test_original_size": (test_max_h, test_max_w)
        }
        dataset.append(item)

    return dataset


class ARCSampleDataset(Dataset):
    def __init__(self, sample_list):
        self.data = sample_list

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]

        # stack per-sample pairs into tensors
        train_inputs = torch.stack([p["input"] for p in sample["train_pairs"]])      # [num_train, H, W]
        train_outputs = torch.stack([p["output"] for p in sample["train_pairs"]])    # [num_train, H, W]
        test_inputs = torch.stack([p["input"] for p in sample["test_pairs"]])        # [num_test, H, W]
        test_outputs = torch.stack([p["output"] for p in sample["test_pairs"]])      # [num_test, H, W]

        # NEW: stack masks alongside data
        train_input_masks = torch.stack([p["input_mask"] for p in sample["train_pairs"]])   # [num_train, H, W]
        train_output_masks = torch.stack([p["output_mask"] for p in sample["train_pairs"]]) # [num_train, H, W]
        test_input_masks  = torch.stack([p["input_mask"] for p in sample["test_pairs"]])    # [num_test, H, W]
        test_output_masks = torch.stack([p["output_mask"] for p in sample["test_pairs"]])   # [num_test, H, W]

        return {
            "id": sample["id"],
            "train_inputs": train_inputs,
            "train_outputs": train_outputs,
            "test_inputs": test_inputs,
            "test_outputs": test_outputs,
            "train_input_masks": train_input_masks,   # NEW
            "train_output_masks": train_output_masks, # NEW
            "test_input_masks": test_input_masks,     # NEW
            "test_output_masks": test_output_masks,   # NEW
            "train_original_size": torch.tensor(sample["train_original_size"], dtype=torch.long),
            "test_original_size": torch.tensor(sample["test_original_size"], dtype=torch.long)
        }

def arc_collate_fn_bs1(batch):
    # batch size is guaranteed to be 1; return the single dict unchanged
    return batch[0]


# ----------- main -----------
# Point to your local folder named "training" (sibling/relative to where you run the script)
folder_path = Path("~/ARC-AGI-Model/src/data_pipeline/ARC_data/data/training")
data = load_jsons_from_folder(folder_path)

# add +1 to every grid value BEFORE padding
_add_one_to_all_values_in_place(data)

metrics = get_metrics(data)
padded_data = pad_data(data, metrics, pad_value=0)
sample_level = build_sample_level_dataset(padded_data, pad_value=0)
arc_dataset = ARCSampleDataset(sample_list=sample_level)

arc_loader = DataLoader(
    arc_dataset,
    batch_size=1,
    shuffle=True,
    collate_fn=arc_collate_fn_bs1,
    num_workers=0,
    pin_memory=False
)

# NEW: print *every* array for *every* sample (can be large!)
for batch in arc_loader:
    num_train = int(batch["train_inputs"].shape[0])
    num_test  = int(batch["test_inputs"].shape[0])

    # original (max over pairs before padding)
    train_orig_h, train_orig_w = map(int, batch["train_original_size"].tolist())
    test_orig_h,  test_orig_w  = map(int, batch["test_original_size"].tolist())

    # new padded sizes (actual tensor shapes)
    train_in_h, train_in_w   = batch["train_inputs"].shape[1], batch["train_inputs"].shape[2]
    train_out_h, train_out_w = batch["train_outputs"].shape[1], batch["train_outputs"].shape[2]
    test_in_h,  test_in_w    = batch["test_inputs"].shape[1], batch["test_inputs"].shape[2]
    test_out_h, test_out_w   = batch["test_outputs"].shape[1], batch["test_outputs"].shape[2]

    print("=== SUMMARY (single sample) ===")
    print(f"#train: {num_train} | #test: {num_test}")
    print(f"Train original size (max): ({train_orig_h}, {train_orig_w})")
    print(f"Train padded sizes — input: ({train_in_h}, {train_in_w}), output: ({train_out_h}, {train_out_w})")
    print(f"Test  original size (max): ({test_orig_h}, {test_orig_w})")
    print(f"Test  padded sizes — input: ({test_in_h}, {test_in_w}), output: ({test_out_h}, {test_out_w})")

    # one example from each split (if present)
    if num_train > 0:
        print("\n--- Example TRAIN pair [0] ---")
        print("input:", batch["train_inputs"][0].tolist())
        print("output:", batch["train_outputs"][0].tolist())
    if num_test > 0:
        print("\n--- Example TEST pair [0] ---")
        print("input:", batch["test_inputs"][0].tolist())
        print("output:", batch["test_outputs"][0].tolist())

    break  # <<< ensure we only print once

print(type(arc_loader))
ChatGPT said:

Thought for 7s
Here’s a tiny wrapper class that builds the DataLoader for you. Drop it in, then use ARCDataModule(...).prepare().get_loader() (or iterate over the module directly).

# --- NEW: simple data module that returns a DataLoader ---
class ARCDataModule:
    def __init__(
        self,
        dir_path,
        batch_size=1,
        shuffle=True,
        num_workers=0,
        pin_memory=False,
        pad_value=0,
    ):
        self.dir_path = Path(dir_path).expanduser().resolve()
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.num_workers = num_workers
        self.pin_memory = pin_memory
        self.pad_value = pad_value

        self.dataset = None
        self._loader = None

    def prepare(self):
        # load + preprocess
        data = load_jsons_from_folder(self.dir_path)
        _add_one_to_all_values_in_place(data)
        # pad_data ignores metric_dict, so we can skip get_metrics()
        padded = pad_data(data, metric_dict=None, pad_value=self.pad_value)
        sample_list = build_sample_level_dataset(padded, pad_value=self.pad_value)

        # build dataset + loader
        self.dataset = ARCSampleDataset(sample_list=sample_list)
        self._loader = DataLoader(
            self.dataset,
            batch_size=self.batch_size,
            shuffle=self.shuffle,
            collate_fn=arc_collate_fn_bs1,
            num_workers=self.num_workers,
            pin_memory=self.pin_memory,
        )
        return self  # allow chaining

    def get_loader(self):
        if self._loader is None:
            self.prepare()
        return self._loader

    # optional niceties: use the module like a loader directly
    def __iter__(self):
        return iter(self.get_loader())

    def __len__(self):
        return len(self.dataset) if self.dataset is not None else 0
Use it in your main section
# ----------- main -----------
folder_path = Path("~/ARC-AGI-Model/src/data_pipeline/ARC_data/data/training")

data_module = ARCDataModule(
    dir_path=folder_path,
    batch_size=1,
    shuffle=True,
    num_workers=0,
    pin_memory=False,
    pad_value=0,
).prepare()

arc_loader = data_module.get_loader()   # or: for batch in data_module: ...

# concise summary print (once)
for batch in arc_loader:
    num_train = int(batch["train_inputs"].shape[0])
    num_test  = int(batch["test_inputs"].shape[0])

    train_orig_h, train_orig_w = map(int, batch["train_original_size"].tolist())
    test_orig_h,  test_orig_w  = map(int, batch["test_original_size"].tolist())

    train_in_h, train_in_w   = batch["train_inputs"].shape[1], batch["train_inputs"].shape[2]
    train_out_h, train_out_w = batch["train_outputs"].shape[1], batch["train_outputs"].shape[2]
    test_in_h,  test_in_w    = batch["test_inputs"].shape[1], batch["test_inputs"].shape[2]
    test_out_h, test_out_w   = batch["test_outputs"].shape[1], batch["test_outputs"].shape[2]

    print("=== SUMMARY (single sample) ===")
    print(f"#train: {num_train} | #test: {num_test}")
    print(f"Train original size (max): ({train_orig_h}, {train_orig_w})")
    print(f"Train padded sizes — input: ({train_in_h}, {train_in_w}), output: ({train_out_h}, {train_out_w})")
    print(f"Test  original size (max): ({test_orig_h}, {test_orig_w})")
    print(f"Test  padded sizes — input: ({test_in_h}, {test_in_w}), output: ({test_out_h}, {test_out_w})")

    if num_train > 0:
        print("\n--- Example TRAIN pair [0] ---")
        print("input:", batch["train_inputs"][0].tolist())
        print("output:", batch["train_outputs"][0].tolist())
    if num_test > 0:
        print("\n--- Example TEST pair [0] ---")
        print("input:", batch["test_inputs"][0].tolist())
        print("output:", batch["test_outputs"][0].tolist())
    break
This keeps your existing functions intact, but now the DataLoader “comes from a class.” If you want a train/val/test split at the sample level later, we can extend the module with separate loaders.

You said:
recreate my entire file to include this

import os
import sys
import json
from pathlib import Path

import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader


def load_jsons_from_folder(dir_path):
    """
    Read every .json file under dir_path (recursively) and return a dict
    keyed by the file's relative path (without the .json extension).
    """
    root = Path(dir_path).expanduser().resolve()
    files = sorted(p for p in root.rglob("*.json") if p.is_file())

    if not files:
        raise FileNotFoundError(f"No .json files found under: {root}")

    data = {}
    for p in files:
        key = str(p.relative_to(root).with_suffix(""))  # e.g. "subdir/file"
        try:
            with p.open("r", encoding="utf-8") as fh:
                data[key] = json.load(fh)
        except Exception as e:
            print(f"Failed to read {p}: {e}")

    if not data:
        raise FileNotFoundError(f"Unable to load any .json files under: {root}")

    return data



def _add_one_to_all_values_in_place(data):
    """
    Adds +1 to every scalar value in each input/output grid across all samples.
    Done BEFORE padding so pad_value=0 remains 0.
    """
    for sample in data.values():
        for split in ["train", "test"]:
            for pairs in sample.get(split, []):
                # input grid
                r = 0
                while r < len(pairs["input"]):
                    c = 0
                    row = pairs["input"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1
                # output grid
                r = 0
                while r < len(pairs["output"]):
                    c = 0
                    row = pairs["output"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1


def get_metrics(data):
    metric_dict = {
        "max_train_len": 0,
        "max_test_len": 0,
        "max_train_input_height": 0,
        "max_test_input_height": 0,
        "max_train_output_height": 0,
        "max_test_output_height": 0,
        "max_train_input_width": 0,
        "max_test_input_width": 0,
        "max_train_output_width": 0,
        "max_test_output_width": 0
    }

    for sample in data.values():
        if (len(sample['train']) > metric_dict['max_train_len']):
            metric_dict['max_train_len'] = len(sample['train'])
        if (len(sample['test']) > metric_dict['max_test_len']):
            metric_dict['max_test_len'] = len(sample['test'])
        for pairs in sample['train']:
            if (len(pairs['input']) > metric_dict['max_train_input_height']):
                metric_dict['max_train_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_train_output_height']):
                metric_dict['max_train_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_train_input_width']):
                    metric_dict['max_train_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_train_output_width']):
                    metric_dict['max_train_output_width'] = len(output)
        for pairs in sample['test']:
            if (len(pairs['input']) > metric_dict['max_test_input_height']):
                metric_dict['max_test_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_test_output_height']):
                metric_dict['max_test_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_test_input_width']):
                    metric_dict['max_test_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_test_output_width']):
                    metric_dict['max_test_output_width'] = len(output)
    return metric_dict


def pad_data(data, metric_dict=None, pad_value=0):
    """
    Pads each sample independently to its own max square size.
    metric_dict is ignored (kept for backward compatibility).
    """
    for sample in data.values():
        # ----- compute per-sample maxima for TRAIN -----
        max_train_input_height = 0
        max_train_input_width  = 0
        max_train_output_height = 0
        max_train_output_width  = 0

        for pairs in sample.get('train', []):
            if len(pairs['input'])  > max_train_input_height:  max_train_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_train_output_height: max_train_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_train_input_width:  max_train_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_train_output_width: max_train_output_width = len(outp)

        # ----- compute per-sample maxima for TEST -----
        max_test_input_height = 0
        max_test_input_width  = 0
        max_test_output_height = 0
        max_test_output_width  = 0

        for pairs in sample.get('test', []):
            if len(pairs['input'])  > max_test_input_height:  max_test_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_test_output_height: max_test_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_test_input_width:  max_test_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_test_output_width: max_test_output_width = len(outp)

        # ----- per-sample square sizes -----
        max_train_size = max(
            max_train_input_height,
            max_train_input_width,
            max_train_output_height,
            max_train_output_width
        )
        max_test_size = max(
            max_test_input_height,
            max_test_input_width,
            max_test_output_height,
            max_test_output_width
        )

        # ----- pad TRAIN for this sample -----
        for pairs in sample.get('train', []):
            # input
            while len(pairs['input']) < max_train_size:
                pairs['input'].append([pad_value] * max_train_size)
            for inp in pairs['input']:
                while len(inp) < max_train_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_train_size:
                pairs['output'].append([pad_value] * max_train_size)
            for outp in pairs['output']:
                while len(outp) < max_train_size:
                    outp.append(pad_value)

        # ----- pad TEST for this sample -----
        for pairs in sample.get('test', []):
            # input
            while len(pairs['input']) < max_test_size:
                pairs['input'].append([pad_value] * max_test_size)
            for inp in pairs['input']:
                while len(inp) < max_test_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_test_size:
                pairs['output'].append([pad_value] * max_test_size)
            for outp in pairs['output']:
                while len(outp) < max_test_size:
                    outp.append(pad_value)

    return data


def _infer_original_size_from_padded(grid, pad_value=0):
    h = 0
    w = 0
    r = 0
    while r < len(grid):
        row = grid[r]
        any_nonpad = False
        last_nonpad = -1
        c = 0
        while c < len(row):
            if row[c] != pad_value:
                any_nonpad = True
                last_nonpad = c
            c += 1
        if any_nonpad:
            if (r + 1) > h:
                h = r + 1
            if (last_nonpad + 1) > w:
                w = last_nonpad + 1
        r += 1
    return (h, w)


def build_sample_level_dataset(data, pad_value=0):
    """
    Build a list of per-sample records.
    NEW: also stores per-pair masks: 1 where value != pad_value, else 0.
    """
    dataset = []
    for sample_name, sample in data.items():
        # containers
        train_pairs = []
        test_pairs = []

        # track original (unpadded) sizes per split
        train_max_h = 0
        train_max_w = 0
        test_max_h = 0
        test_max_w = 0

        # ----- TRAIN -----
        idx = 0
        for pairs in sample['train']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            # original sizes (prefer stored, else infer)
            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            # update split-wide original size (max over inputs/outputs)
            if in_h > train_max_h: train_max_h = in_h
            if out_h > train_max_h: train_max_h = out_h
            if in_w > train_max_w: train_max_w = in_w
            if out_w > train_max_w: train_max_w = out_w

            # tensors
            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            # store pair
            train_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,    # NEW
                "output_mask": out_mask    # NEW
            })
            idx += 1

        # ----- TEST -----
        idx = 0
        for pairs in sample['test']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            if in_h > test_max_h: test_max_h = in_h
            if out_h > test_max_h: test_max_h = out_h
            if in_w > test_max_w: test_max_w = in_w
            if out_w > test_max_w: test_max_w = out_w

            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            test_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,    # NEW
                "output_mask": out_mask    # NEW
            })
            idx += 1

        # assemble sample-level record
        item = {
            "id": str(sample_name),
            "train_pairs": train_pairs,
            "test_pairs": test_pairs,
            "train_original_size": (train_max_h, train_max_w),
            "test_original_size": (test_max_h, test_max_w)
        }
        dataset.append(item)

    return dataset


class ARCSampleDataset(Dataset):
    def __init__(self, sample_list):
        self.data = sample_list

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]

        # stack per-sample pairs into tensors
        train_inputs = torch.stack([p["input"] for p in sample["train_pairs"]])      # [num_train, H, W]
        train_outputs = torch.stack([p["output"] for p in sample["train_pairs"]])    # [num_train, H, W]
        test_inputs = torch.stack([p["input"] for p in sample["test_pairs"]])        # [num_test, H, W]
        test_outputs = torch.stack([p["output"] for p in sample["test_pairs"]])      # [num_test, H, W]

        # NEW: stack masks alongside data
        train_input_masks = torch.stack([p["input_mask"] for p in sample["train_pairs"]])   # [num_train, H, W]
        train_output_masks = torch.stack([p["output_mask"] for p in sample["train_pairs"]]) # [num_train, H, W]
        test_input_masks  = torch.stack([p["input_mask"] for p in sample["test_pairs"]])    # [num_test, H, W]
        test_output_masks = torch.stack([p["output_mask"] for p in sample["test_pairs"]])   # [num_test, H, W]

        return {
            "id": sample["id"],
            "train_inputs": train_inputs,
            "train_outputs": train_outputs,
            "test_inputs": test_inputs,
            "test_outputs": test_outputs,
            "train_input_masks": train_input_masks,   # NEW
            "train_output_masks": train_output_masks, # NEW
            "test_input_masks": test_input_masks,     # NEW
            "test_output_masks": test_output_masks,   # NEW
            "train_original_size": torch.tensor(sample["train_original_size"], dtype=torch.long),
            "test_original_size": torch.tensor(sample["test_original_size"], dtype=torch.long)
        }

def arc_collate_fn_bs1(batch):
    # batch size is guaranteed to be 1; return the single dict unchanged
    return batch[0]


# ----------- main -----------
# Point to your local folder named "training" (sibling/relative to where you run the script)
folder_path = Path("~/ARC-AGI-Model/src/data_pipeline/ARC_data/data/training")
data = load_jsons_from_folder(folder_path)

# add +1 to every grid value BEFORE padding
_add_one_to_all_values_in_place(data)

metrics = get_metrics(data)
padded_data = pad_data(data, metrics, pad_value=0)
sample_level = build_sample_level_dataset(padded_data, pad_value=0)
arc_dataset = ARCSampleDataset(sample_list=sample_level)

arc_loader = DataLoader(
    arc_dataset,
    batch_size=1,
    shuffle=True,
    collate_fn=arc_collate_fn_bs1,
    num_workers=0,
    pin_memory=False
)

# NEW: print *every* array for *every* sample (can be large!)
for batch in arc_loader:
    num_train = int(batch["train_inputs"].shape[0])
    num_test  = int(batch["test_inputs"].shape[0])

    # original (max over pairs before padding)
    train_orig_h, train_orig_w = map(int, batch["train_original_size"].tolist())
    test_orig_h,  test_orig_w  = map(int, batch["test_original_size"].tolist())

    # new padded sizes (actual tensor shapes)
    train_in_h, train_in_w   = batch["train_inputs"].shape[1], batch["train_inputs"].shape[2]
    train_out_h, train_out_w = batch["train_outputs"].shape[1], batch["train_outputs"].shape[2]
    test_in_h,  test_in_w    = batch["test_inputs"].shape[1], batch["test_inputs"].shape[2]
    test_out_h, test_out_w   = batch["test_outputs"].shape[1], batch["test_outputs"].shape[2]

    print("=== SUMMARY (single sample) ===")
    print(f"#train: {num_train} | #test: {num_test}")
    print(f"Train original size (max): ({train_orig_h}, {train_orig_w})")
    print(f"Train padded sizes — input: ({train_in_h}, {train_in_w}), output: ({train_out_h}, {train_out_w})")
    print(f"Test  original size (max): ({test_orig_h}, {test_orig_w})")
    print(f"Test  padded sizes — input: ({test_in_h}, {test_in_w}), output: ({test_out_h}, {test_out_w})")

    # one example from each split (if present)
    if num_train > 0:
        print("\n--- Example TRAIN pair [0] ---")
        print("input:", batch["train_inputs"][0].tolist())
        print("output:", batch["train_outputs"][0].tolist())
    if num_test > 0:
        print("\n--- Example TEST pair [0] ---")
        print("input:", batch["test_inputs"][0].tolist())
        print("output:", batch["test_outputs"][0].tolist())

    break  # <<< ensure we only print once

print(type(arc_loader))
ChatGPT said:

Thought for 10s
Here’s your file rebuilt so the DataLoader comes from a class (ARCDataModule), while keeping your existing logic intact and the concise, single-sample summary at the end.

import os
import sys
import json
from pathlib import Path

import torch
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader


# ----------------------------
# Load all JSONs (recursive)
# ----------------------------
def load_jsons_from_folder(dir_path):
    """
    Read every .json file under dir_path (recursively) and return a dict
    keyed by the file's relative path (without the .json extension).
    """
    root = Path(dir_path).expanduser().resolve()
    files = sorted(p for p in root.rglob("*.json") if p.is_file())

    if not files:
        raise FileNotFoundError(f"No .json files found under: {root}")

    data = {}
    for p in files:
        key = str(p.relative_to(root).with_suffix(""))  # e.g. "subdir/file"
        try:
            with p.open("r", encoding="utf-8") as fh:
                data[key] = json.load(fh)
        except Exception as e:
            print(f"Failed to read {p}: {e}")

    if not data:
        raise FileNotFoundError(f"Unable to load any .json files under: {root}")

    return data


# ----------------------------
# Preprocess helpers
# ----------------------------
def _add_one_to_all_values_in_place(data):
    """
    Adds +1 to every scalar value in each input/output grid across all samples.
    Done BEFORE padding so pad_value=0 remains 0.
    """
    for sample in data.values():
        for split in ["train", "test"]:
            for pairs in sample.get(split, []):
                # input grid
                r = 0
                while r < len(pairs["input"]):
                    c = 0
                    row = pairs["input"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1
                # output grid
                r = 0
                while r < len(pairs["output"]):
                    c = 0
                    row = pairs["output"][r]
                    while c < len(row):
                        row[c] = row[c] + 1
                        c += 1
                    r += 1


def get_metrics(data):
    metric_dict = {
        "max_train_len": 0,
        "max_test_len": 0,
        "max_train_input_height": 0,
        "max_test_input_height": 0,
        "max_train_output_height": 0,
        "max_test_output_height": 0,
        "max_train_input_width": 0,
        "max_test_input_width": 0,
        "max_train_output_width": 0,
        "max_test_output_width": 0
    }

    for sample in data.values():
        if (len(sample['train']) > metric_dict['max_train_len']):
            metric_dict['max_train_len'] = len(sample['train'])
        if (len(sample['test']) > metric_dict['max_test_len']):
            metric_dict['max_test_len'] = len(sample['test'])
        for pairs in sample['train']:
            if (len(pairs['input']) > metric_dict['max_train_input_height']):
                metric_dict['max_train_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_train_output_height']):
                metric_dict['max_train_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_train_input_width']):
                    metric_dict['max_train_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_train_output_width']):
                    metric_dict['max_train_output_width'] = len(output)
        for pairs in sample['test']:
            if (len(pairs['input']) > metric_dict['max_test_input_height']):
                metric_dict['max_test_input_height'] = len(pairs['input'])
            if (len(pairs['output']) > metric_dict['max_test_output_height']):
                metric_dict['max_test_output_height'] = len(pairs['output'])
            for inp in pairs['input']:
                if (len(inp) > metric_dict['max_test_input_width']):
                    metric_dict['max_test_input_width'] = len(inp)
            for output in pairs['output']:
                if (len(output) > metric_dict['max_test_output_width']):
                    metric_dict['max_test_output_width'] = len(output)
    return metric_dict


def pad_data(data, metric_dict=None, pad_value=0):
    """
    Pads each sample independently to its own max square size.
    metric_dict is ignored (kept for backward compatibility).
    """
    for sample in data.values():
        # ----- compute per-sample maxima for TRAIN -----
        max_train_input_height = 0
        max_train_input_width  = 0
        max_train_output_height = 0
        max_train_output_width  = 0

        for pairs in sample.get('train', []):
            if len(pairs['input'])  > max_train_input_height:  max_train_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_train_output_height: max_train_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_train_input_width:  max_train_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_train_output_width: max_train_output_width = len(outp)

        # ----- compute per-sample maxima for TEST -----
        max_test_input_height = 0
        max_test_input_width  = 0
        max_test_output_height = 0
        max_test_output_width  = 0

        for pairs in sample.get('test', []):
            if len(pairs['input'])  > max_test_input_height:  max_test_input_height  = len(pairs['input'])
            if len(pairs['output']) > max_test_output_height: max_test_output_height = len(pairs['output'])
            for inp in pairs['input']:
                if len(inp) > max_test_input_width:  max_test_input_width  = len(inp)
            for outp in pairs['output']:
                if len(outp) > max_test_output_width: max_test_output_width = len(outp)

        # ----- per-sample square sizes -----
        max_train_size = max(
            max_train_input_height,
            max_train_input_width,
            max_train_output_height,
            max_train_output_width
        )
        max_test_size = max(
            max_test_input_height,
            max_test_input_width,
            max_test_output_height,
            max_test_output_width
        )

        # ----- pad TRAIN for this sample -----
        for pairs in sample.get('train', []):
            # input
            while len(pairs['input']) < max_train_size:
                pairs['input'].append([pad_value] * max_train_size)
            for inp in pairs['input']:
                while len(inp) < max_train_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_train_size:
                pairs['output'].append([pad_value] * max_train_size)
            for outp in pairs['output']:
                while len(outp) < max_train_size:
                    outp.append(pad_value)

        # ----- pad TEST for this sample -----
        for pairs in sample.get('test', []):
            # input
            while len(pairs['input']) < max_test_size:
                pairs['input'].append([pad_value] * max_test_size)
            for inp in pairs['input']:
                while len(inp) < max_test_size:
                    inp.append(pad_value)
            # output
            while len(pairs['output']) < max_test_size:
                pairs['output'].append([pad_value] * max_test_size)
            for outp in pairs['output']:
                while len(outp) < max_test_size:
                    outp.append(pad_value)

    return data


def _infer_original_size_from_padded(grid, pad_value=0):
    h = 0
    w = 0
    r = 0
    while r < len(grid):
        row = grid[r]
        any_nonpad = False
        last_nonpad = -1
        c = 0
        while c < len(row):
            if row[c] != pad_value:
                any_nonpad = True
                last_nonpad = c
            c += 1
        if any_nonpad:
            if (r + 1) > h:
                h = r + 1
            if (last_nonpad + 1) > w:
                w = last_nonpad + 1
        r += 1
    return (h, w)


def build_sample_level_dataset(data, pad_value=0):
    """
    Build a list of per-sample records.
    NEW: also stores per-pair masks: 1 where value != pad_value, else 0.
    """
    dataset = []
    for sample_name, sample in data.items():
        # containers
        train_pairs = []
        test_pairs = []

        # track original (unpadded) sizes per split
        train_max_h = 0
        train_max_w = 0
        test_max_h = 0
        test_max_w = 0

        # ----- TRAIN -----
        idx = 0
        for pairs in sample['train']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            # original sizes (prefer stored, else infer)
            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            # update split-wide original size (max over inputs/outputs)
            if in_h > train_max_h: train_max_h = in_h
            if out_h > train_max_h: train_max_h = out_h
            if in_w > train_max_w: train_max_w = in_w
            if out_w > train_max_w: train_max_w = out_w

            # tensors
            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            # store pair
            train_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,
                "output_mask": out_mask
            })
            idx += 1

        # ----- TEST -----
        idx = 0
        for pairs in sample['test']:
            inp_grid = pairs['input']
            out_grid = pairs['output']

            if ('orig_input_size' in pairs):
                in_h, in_w = pairs['orig_input_size']
            else:
                in_h, in_w = _infer_original_size_from_padded(inp_grid, pad_value)
            if ('orig_output_size' in pairs):
                out_h, out_w = pairs['orig_output_size']
            else:
                out_h, out_w = _infer_original_size_from_padded(out_grid, pad_value)

            if in_h > test_max_h: test_max_h = in_h
            if out_h > test_max_h: test_max_h = out_h
            if in_w > test_max_w: test_max_w = in_w
            if out_w > test_max_w: test_max_w = out_w

            inp_tensor = torch.tensor(inp_grid).long()
            out_tensor = torch.tensor(out_grid).long()

            # NEW: masks (1 for non-pad, 0 for pad)
            inp_mask = (inp_tensor != pad_value).long()
            out_mask = (out_tensor != pad_value).long()

            test_pairs.append({
                "input": inp_tensor,
                "output": out_tensor,
                "input_mask": inp_mask,
                "output_mask": out_mask
            })
            idx += 1

        # assemble sample-level record
        item = {
            "id": str(sample_name),
            "train_pairs": train_pairs,
            "test_pairs": test_pairs,
            "train_original_size": (train_max_h, train_max_w),
            "test_original_size": (test_max_h, test_max_w)
        }
        dataset.append(item)

    return dataset


# ----------------------------
# Torch dataset
# ----------------------------
class ARCSampleDataset(Dataset):
    def __init__(self, sample_list):
        self.data = sample_list

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        sample = self.data[idx]

        # stack per-sample pairs into tensors
        train_inputs = torch.stack([p["input"] for p in sample["train_pairs"]])      # [num_train, H, W]
        train_outputs = torch.stack([p["output"] for p in sample["train_pairs"]])    # [num_train, H, W]
        test_inputs = torch.stack([p["input"] for p in sample["test_pairs"]])        # [num_test, H, W]
        test_outputs = torch.stack([p["output"] for p in sample["test_pairs"]])      # [num_test, H, W]

        # masks
        train_input_masks = torch.stack([p["input_mask"] for p in sample["train_pairs"]])
        train_output_masks = torch.stack([p["output_mask"] for p in sample["train_pairs"]])
        test_input_masks  = torch.stack([p["input_mask"] for p in sample["test_pairs"]])
        test_output_masks = torch.stack([p["output_mask"] for p in sample["test_pairs"]])

        return {
            "id": sample["id"],
            "train_inputs": train_inputs,
            "train_outputs": train_outputs,
            "test_inputs": test_inputs,
            "test_outputs": test_outputs,
            "train_input_masks": train_input_masks,
            "train_output_masks": train_output_masks,
            "test_input_masks": test_input_masks,
            "test_output_masks": test_output_masks,
            "train_original_size": torch.tensor(sample["train_original_size"], dtype=torch.long),
            "test_original_size": torch.tensor(sample["test_original_size"], dtype=torch.long)
        }


def arc_collate_fn_bs1(batch):
    # batch size is guaranteed to be 1; return the single dict unchanged
    return batch[0]


# ----------------------------
# NEW: Data module wrapper
# ----------------------------
class ARCDataModule:
    """
    Simple wrapper to produce a DataLoader from your folder.
    Usage:
        dm = ARCDataModule("~/path/to/training").prepare()
        loader = dm.get_loader()
        for batch in loader: ...
    """
    def __init__(
        self,
        dir_path,
        batch_size=1,
        shuffle=True,
        num_workers=0,
        pin_memory=False,
        pad_value=0,
    ):
        self.dir_path = Path(dir_path).expanduser().resolve()
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.num_workers = num_workers
        self.pin_memory = pin_memory
        self.pad_value = pad_value

        self.dataset = None
        self._loader = None

    def prepare(self):
        # load + preprocess
        data = load_jsons_from_folder(self.dir_path)
        _add_one_to_all_values_in_place(data)

        # pad each sample independently (metric_dict unused)
        padded = pad_data(data, metric_dict=None, pad_value=self.pad_value)
        sample_list = build_sample_level_dataset(padded, pad_value=self.pad_value)

        # build dataset + loader
        self.dataset = ARCSampleDataset(sample_list=sample_list)
        self._loader = DataLoader(
            self.dataset,
            batch_size=self.batch_size,
            shuffle=self.shuffle,
            collate_fn=arc_collate_fn_bs1,
            num_workers=self.num_workers,
            pin_memory=self.pin_memory,
        )
        return self  # allow chaining

    def get_loader(self):
        if self._loader is None:
            self.prepare()
        return self._loader

    # convenience so the module itself is iterable
    def __iter__(self):
        return iter(self.get_loader())

    def __len__(self):
        return len(self.dataset) if self.dataset is not None else 0


# ----------------------------
# Main
# ----------------------------
if __name__ == "__main__":
    # Point to your local folder named "training"
    folder_path = Path("~/ARC-AGI-Model/src/data_pipeline/ARC_data/data/training")

    data_module = ARCDataModule(
        dir_path=folder_path,
        batch_size=1,
        shuffle=True,
        num_workers=0,
        pin_memory=False,
        pad_value=0,
    ).prepare()

    arc_loader = data_module.get_loader()

    # Concise, single-sample summary to avoid IOPub overflow
    for batch in arc_loader:
        num_train = int(batch["train_inputs"].shape[0])
        num_test  = int(batch["test_inputs"].shape[0])

        # original (max over pairs before padding)
        train_orig_h, train_orig_w = map(int, batch["train_original_size"].tolist())
        test_orig_h,  test_orig_w  = map(int, batch["test_original_size"].tolist())

        # new padded sizes (actual tensor shapes)
        train_in_h, train_in_w   = batch["train_inputs"].shape[1], batch["train_inputs"].shape[2]
        train_out_h, train_out_w = batch["train_outputs"].shape[1], batch["train_outputs"].shape[2]
        test_in_h,  test_in_w    = batch["test_inputs"].shape[1], batch["test_inputs"].shape[2]
        test_out_h, test_out_w   = batch["test_outputs"].shape[1], batch["test_outputs"].shape[2]

        print("=== SUMMARY (single sample) ===")
        print(f"#train: {num_train} | #test: {num_test}")
        print(f"Train original size (max): ({train_orig_h}, {train_orig_w})")
        print(f"Train padded sizes — input: ({train_in_h}, {train_in_w}), output: ({train_out_h}, {train_out_w})")
        print(f"Test  original size (max): ({test_orig_h}, {test_orig_w})")
        print(f"Test  padded sizes — input: ({test_in_h}, {test_in_w}), output: ({test_out_h}, {test_out_w})")

        if num_train > 0:
            print("\n--- Example TRAIN pair [0] ---")
            print("input:", batch["train_inputs"][0].tolist())
            print("output:", batch["train_outputs"][0].tolist())
        if num_test > 0:
            print("\n--- Example TEST pair [0] ---")
            print("input:", batch["test_inputs"][0].tolist())
            print("output:", batch["test_outputs"][0].tolist())
        break

    print(type(arc_loader))


No file chosenNo file chosen
ChatGPT can make mistakes. OpenAI doesn't use Veterans United - Enterprise workspace data to train its models.


SyntaxError: invalid decimal literal (3881276199.py, line 8)