# Task
Add a new instance (represented by a JSON string) to the 'SEC-bench/Seed' dataset from Hugging Face, and then upload the updated instance.

In [2]:
from datasets import load_dataset

**Reasoning**:
Now that `load_dataset` is imported, load the 'SEC-bench/Seed' dataset into a `Dataset` object named `dataset`.



In [3]:
dataset = load_dataset('SEC-bench/Seed')
print("Loaded 'SEC-bench/Seed' dataset.")

Loaded 'SEC-bench/Seed' dataset.


In [4]:
import pandas as pd

In [5]:
df = pd.DataFrame(dataset['cve'])

In [6]:
df.head(3)

Unnamed: 0,instance_id,repo,base_commit,date,project_name,lang,dockerfile,build_sh,work_dir,sanitizer,bug_description,additional_files,candidate_fixes
0,gpac.cve-2023-2838,gpac/gpac,ba59206b3225f0e8e95a27eff41cb1c49ddf9a37,2023-05-22 10:37:24,gpac,c++,FROM hwiwonlee/secb.base:latest\nRUN apt-get u...,#!/bin/bash -eu\n# Minimized build script with...,/src/gpac,address,================= Bug Report (1/1) ===========...,[],"[{'sha': '711e0988', 'url': None}, {'sha': 'c8..."
1,unicorn.cve-2022-29694,unicorn-engine/unicorn,cf18982e1c29d354805863a8e017cddd974e3114,2022-04-16 11:19:41,unicorn,c++,FROM hwiwonlee/secb.base:latest\nRUN apt-get u...,#!/bin/bash -eu\n# Minimized build script with...,/src/unicorn,address,================= Bug Report (1/1) ===========...,[],[{'sha': '3d3deac5e6d38602b689c4fef5dac004f07a...
2,njs.cve-2022-34031,nginx/njs,37dc1e788060ba17cdcd6e3fd2695177c9d7aa38,2022-06-20 23:38:49,njs,c++,FROM hwiwonlee/secb.base:latest\nRUN apt-get u...,#!/bin/bash -eu\n# Minimized build script with...,/src/njs,address,================= Bug Report (1/1) ===========...,[],[{'sha': 'c62a9fb92b102c90a66aa724cb9054183a33...


In [7]:
json_row = '''
{
  "instance_id": "gpac.cve-2025-7797",
  "repo": "gpac/gpac",
  "base_commit": "153ea314b6b053db17164f8bc3c7e1e460938eaa",
  "date": "2025-11-08T00:00:00",
  "project_name": "gpac",
  "lang": "c",
  "dockerfile": "FROM hwiwonlee/secb.base:latest\nRUN apt-get update\nRUN apt-get install -y pkg-config wget\nRUN git clone https://github.com/gpac/gpac gpac\nRUN git -C gpac checkout 153ea314b6b053db17164f8bc3c7e1e460938eaa\nWORKDIR $SRC/gpac\nCOPY build.sh $SRC/",
  "build_sh": "#!/bin/bash -eu\n# Minimized build script with only core build commands\nset -eu\n./configure --enable-debug\nmake -j$(nproc)",
  "work_dir": "/src/gpac",
  "sanitizer": "address",
  "bug_description": "================= Bug Report ==================\n## Source: NVD CVE-2025-7797\n## CVE ID: CVE-2025-7797\n## Title: Null Pointer Dereference in GPAC\n## Description:\nA vulnerability was found in GPAC up to 2.4. It has been rated as problematic. Affected by this issue is the function gf_dash_download_init_segment of the file src/media_tools/dash_client.c. The manipulation of the argument base_init_url leads to null pointer dereference. The attack may be launched remotely. The exploit has been disclosed to the public and may be used.\n\n## Vulnerability Details\n- **Affected Component**: gf_dash_download_init_segment function\n- **Affected File**: src/media_tools/dash_client.c\n- **Vulnerable Parameter**: base_init_url\n- **Issue Type**: Null pointer dereference\n- **Attack Vector**: Remote\n- **Status**: Exploit publicly disclosed\n\n## Patch Information\nPatch commit: 153ea314b6b053db17164f8bc3c7e1e460938eaa",
  "additional_files": [
    {
      "filename": "default.options",
      "content": "[libfuzzer]\ndetect_leaks=0\n"
    }
  ],
  "candidate_fixes": [
    {
      "sha": "153ea314b6b053db17164f8bc3c7e1e460938eaa",
      "url": "https://github.com/gpac/gpac/commit/153ea314b6b053db17164f8bc3c7e1e460938eaa"
    }
  ]
}
'''

In [8]:
# add json_row to df
from io import StringIO
new_instance = pd.read_json(StringIO(json_row))
df = pd.concat([df, new_instance], ignore_index=True)

In [13]:
from datasets import Dataset as HFDataset

# Ensure list-like columns don't mix list and non-list values (fix ArrowInvalid)
def ensure_list(val):
	# Return lists as-is
	if isinstance(val, list):
		return val
	# Wrap dicts into a single-element list
	if isinstance(val, dict):
		return [val]
	# Treat explicit None/NaN as empty lists (avoid ambiguous truth on arrays/lists)
	try:
		if val is None or (not isinstance(val, (list, tuple, set, dict)) and pd.isna(val)):
			return []
	except Exception:
		pass
	# Convert other iterable containers (except strings/bytes) to list
	if hasattr(val, '__iter__') and not isinstance(val, (str, bytes)):
		return list(val)
	# Fallback: wrap scalar into a list
	return [val]

for col in ('additional_files', 'candidate_fixes'):
	if col in df.columns:
		df[col] = df[col].apply(ensure_list)

# convert the pandas DataFrame to a Hugging Face Dataset and replace the 'cve' split
hf_cve = HFDataset.from_pandas(df.reset_index(drop=True))
dataset['cve'] = hf_cve
print("Added new instance to the dataset.")

# expose the updated dataset dict for the next cell that pushes to hub
hf_dataset = dataset

Added new instance to the dataset.


In [14]:
# Push the dataset to Hugging Face Hub
hf_dataset.push_to_hub('SongTonyLi/CVE_Instances', private=False)
print("Dataset uploaded to Hugging Face Hub.")

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 33.76ba/s]
Processing Files (1 / 1): 100%|██████████| 2.88MB / 2.88MB, 4.81MB/s  
New Data Upload: 100%|██████████| 2.88MB / 2.88MB, 4.81MB/s  
Uploading the dataset shards: 100%|██████████| 1/1 [00:01<00:00,  1.43s/ shards]


Dataset uploaded to Hugging Face Hub.
