Goal: This script randomly samples 60 holdout datasets from 98 datasets from pmlb

- The 98 Datasets follow this criteria:
      - <10000 observations
      - Regression
      - Not already in my training data
            - These datasets were already in my training dataset: "4544_GeographicalOriginalofMusic", "solar_flare", "197_cpu_act", "573_cpu_act", "562_cpu_small", "227_cpu_small", "225_puma8NH"




- The 38 datasets will be used for further training/optimization of noisy forest

- The 60 datasets will be the test datasets


Author: David Gormley

Date: 5/30/25

Structure

**1) Sample Datasets**

**2) Why only 98 datasets (vs. 131 on website)**

In [None]:
! pip install pmlb

Collecting pmlb
  Downloading pmlb-1.0.1.post3-py3-none-any.whl.metadata (1.7 kB)
Downloading pmlb-1.0.1.post3-py3-none-any.whl (19 kB)
Installing collected packages: pmlb
Successfully installed pmlb-1.0.1.post3


In [59]:
import random
import pandas as pd
from pmlb import dataset_names, fetch_data

# ─── 1) Settings ─────────────────────────────────────────────
RANDOM_SEED   = 42
HOLDOUT_SIZE  = 60
MAX_SAMPLES   = 10000

# ─── 2) Load metadata ────────────────────────────────────────
meta_url = "https://raw.githubusercontent.com/EpistasisLab/pmlb/master/pmlb/all_summary_stats.tsv"
meta = pd.read_csv(meta_url, sep="\t")

# Filter only regression datasets under sample cap
meta_filtered = meta[
    (meta.task == "regression") &
    (meta.n_instances < MAX_SAMPLES)
]

# ─── 3) Datasets to exclude entirely ─────────────────────────
EXCLUDE = {
    "4544_GeographicalOriginalofMusic",
    "solar_flare",
    "197_cpu_act",
    "573_cpu_act",
    "562_cpu_small",
    "227_cpu_small",
    "225_puma8NH",
    "195_auto_price",
    "207_autoPrice",
}

# ─── 4) Remove excluded and check fetchability ───────────────
candidates = []
failed_fetch = []
for name in meta_filtered.dataset:
    if name in EXCLUDE:
        continue
    try:
        _ = fetch_data(name, return_X_y=False)
        candidates.append(name)
    except Exception:
        failed_fetch.append(name)

print(f"⚠️ Failed to fetch: {len(failed_fetch)} datasets")
print(f"✅ Usable candidates: {len(candidates)}")

# ─── 5) Random split ─────────────────────────────────────────
random.seed(RANDOM_SEED)
holdout = set(random.sample(candidates, HOLDOUT_SIZE))
train   = [n for n in candidates if n not in holdout]

# ─── 6) Save and Print ───────────────────────────────────────
pd.Series(sorted(holdout), name="holdout_datasets").to_csv("holdout_datasets.txt", index=False)
pd.Series(sorted(train), name="train_datasets").to_csv("train_datasets.txt", index=False)

print("\n✔︎ Hold-out datasets:\n", sorted(holdout))
print("\n✔︎ Train datasets:\n", sorted(train))


⚠️ Failed to fetch: 33 datasets
✅ Usable candidates: 98

✔︎ Hold-out datasets:
 ['1027_ESL', '1029_LEV', '1030_ERA', '1089_USCrime', '1096_FacultySalaries', '192_vineyard', '210_cloud', '294_satellite_image', '503_wind', '505_tecator', '522_pm10', '523_analcatdata_neavote', '527_analcatdata_election2000', '529_pollen', '542_pollution', '547_no2', '556_analcatdata_apnea2', '557_analcatdata_apnea1', '560_bodyfat', '561_cpu', '581_fri_c3_500_25', '582_fri_c1_500_25', '583_fri_c1_1000_50', '586_fri_c3_1000_25', '590_fri_c0_1000_50', '591_fri_c1_100_10', '592_fri_c4_1000_25', '594_fri_c2_100_5', '595_fri_c0_1000_10', '596_fri_c2_250_5', '599_fri_c2_1000_5', '603_fri_c0_250_50', '605_fri_c2_250_25', '611_fri_c3_100_5', '612_fri_c1_1000_5', '615_fri_c4_250_10', '616_fri_c4_500_50', '617_fri_c3_500_5', '621_fri_c0_100_10', '624_fri_c0_100_5', '627_fri_c2_500_10', '633_fri_c0_500_25', '634_fri_c2_100_10', '635_fri_c0_250_10', '641_fri_c1_500_10', '644_fri_c4_250_25', '646_fri_c3_500_10', '647_f

2) Why only 98 datasets?

- Expected 138?

    - 34 are missing from API (currently depricated)
          - 1 of these I have already trained on ("solar_flare")
    - 6 datasets I have already trained on ("4544_GeographicalOriginalofMusic","197_cpu_act", "573_cpu_act", "562_cpu_small", "227_cpu_small", "225_puma8NH")

    - 138-34-6 = 98


In [54]:
import requests
import yaml
from pmlb import dataset_names
import requests

def get_dataset_status(name):
    url = f"https://raw.githubusercontent.com/EpistasisLab/pmlb/master/pmlb/datasets/{name}/metadata.yaml"
    r = requests.get(url)
    if r.status_code != 200:
        return "missing"
    meta = yaml.safe_load(r.text)
    return meta.get("status", "unknown")

# Example list — plug in your 138 here
website_datasets = ['1027_ESL',
 '1028_SWD',
 '1029_LEV',
 '1030_ERA',
 '1089_USCrime',
 '1096_FacultySalaries',
 '192_vineyard',
 '197_cpu_act',
 '210_cloud',
 '225_puma8NH',
 '227_cpu_small',
 '228_elusage',
 '229_pwLinear',
 '230_machine_cpu',
 '294_satellite_image',
 '4544_GeographicalOriginalofMusic',
 '485_analcatdata_vehicle',
 '503_wind',
 '505_tecator',
 '519_vinnie',
 '522_pm10',
 '523_analcatdata_neavote',
 '527_analcatdata_election2000',
 '529_pollen',
 '542_pollution',
 '547_no2',
 '556_analcatdata_apnea2',
 '557_analcatdata_apnea1',
 '560_bodyfat',
 '561_cpu',
 '562_cpu_small',
 '573_cpu_act',
 '579_fri_c0_250_5',
 '581_fri_c3_500_25',
 '582_fri_c1_500_25',
 '583_fri_c1_1000_50',
 '584_fri_c4_500_25',
 '586_fri_c3_1000_25',
 '588_fri_c4_1000_100',
 '589_fri_c2_1000_25',
 '590_fri_c0_1000_50',
 '591_fri_c1_100_10',
 '592_fri_c4_1000_25',
 '593_fri_c1_1000_10',
 '594_fri_c2_100_5',
 '595_fri_c0_1000_10',
 '596_fri_c2_250_5',
 '597_fri_c2_500_5',
 '598_fri_c0_1000_25',
 '599_fri_c2_1000_5',
 '601_fri_c1_250_5',
 '602_fri_c3_250_10',
 '603_fri_c0_250_50',
 '604_fri_c4_500_10',
 '605_fri_c2_250_25',
 '606_fri_c2_1000_10',
 '607_fri_c4_1000_50',
 '608_fri_c3_1000_10',
 '609_fri_c0_1000_5',
 '611_fri_c3_100_5',
 '612_fri_c1_1000_5',
 '613_fri_c3_250_5',
 '615_fri_c4_250_10',
 '616_fri_c4_500_50',
 '617_fri_c3_500_5',
 '618_fri_c3_1000_50',
 '620_fri_c1_1000_25',
 '621_fri_c0_100_10',
 '622_fri_c2_1000_50',
 '623_fri_c4_1000_10',
 '624_fri_c0_100_5',
 '626_fri_c2_500_50',
 '627_fri_c2_500_10',
 '628_fri_c3_1000_5',
 '631_fri_c1_500_5',
 '633_fri_c0_500_25',
 '634_fri_c2_100_10',
 '635_fri_c0_250_10',
 '637_fri_c1_500_50',
 '641_fri_c1_500_10',
 '643_fri_c2_500_25',
 '644_fri_c4_250_25',
 '645_fri_c3_500_50',
 '646_fri_c3_500_10',
 '647_fri_c1_250_10',
 '648_fri_c1_250_50',
 '649_fri_c0_500_5',
 '650_fri_c0_500_50',
 '651_fri_c0_100_25',
 '653_fri_c0_250_25',
 '654_fri_c0_500_10',
 '656_fri_c1_100_5',
 '657_fri_c2_250_10',
 '658_fri_c3_250_25',
 '659_sleuth_ex1714',
 '663_rabe_266',
 '665_sleuth_case2002',
 '666_rmftsa_ladata',
 '678_visualizing_environmental',
 '687_sleuth_ex1605',
 '690_visualizing_galaxy',
 '695_chatfield_4',
 '706_sleuth_case1202',
 '712_chscase_geyser1',
 '_deprecated_195_auto_price',
 '_deprecated_207_autoPrice',
 'auto_insurance_losses',
 'auto_insurance_price',
 'first_principles_absorption',
 'first_principles_bode',
 'first_principles_hubble',
 'first_principles_ideal_gas',
 'first_principles_kepler',
 'first_principles_leavitt',
 'first_principles_newton',
 'first_principles_planck',
 'first_principles_rydberg',
 'first_principles_schechter',
 'first_principles_supernovae_zg',
 'first_principles_supernovae_zr',
 'first_principles_tully_fisher',
 'nikuradse_1',
 'nikuradse_2',
 'solar_flare',
 'strogatz_bacres1',
 'strogatz_bacres2',
 'strogatz_barmag1',
 'strogatz_barmag2',
 'strogatz_glider1',
 'strogatz_glider2',
 'strogatz_lv1',
 'strogatz_lv2',
 'strogatz_predprey1',
 'strogatz_predprey2',
 'strogatz_shearflow1',
 'strogatz_shearflow2',
 'strogatz_vdp1',
 'strogatz_vdp2']

# Datasets available via the pmlb package
api_datasets = set(dataset_names)

# Datasets on the website but not in the API
missing_in_api = set(website_datasets) - api_datasets

print("Datasets listed on the website but missing in the API:")
for dataset in missing_in_api:
    print(f"  - {dataset}")


Datasets listed on the website but missing in the API:
  - first_principles_bode
  - strogatz_glider1
  - first_principles_absorption
  - first_principles_leavitt
  - first_principles_planck
  - first_principles_schechter
  - strogatz_bacres1
  - strogatz_lv1
  - first_principles_supernovae_zg
  - strogatz_bacres2
  - first_principles_hubble
  - first_principles_supernovae_zr
  - auto_insurance_price
  - first_principles_kepler
  - first_principles_ideal_gas
  - first_principles_newton
  - nikuradse_1
  - first_principles_tully_fisher
  - solar_flare
  - strogatz_glider2
  - strogatz_shearflow1
  - strogatz_predprey1
  - _deprecated_195_auto_price
  - strogatz_barmag1
  - strogatz_vdp1
  - strogatz_lv2
  - auto_insurance_losses
  - strogatz_predprey2
  - nikuradse_2
  - strogatz_shearflow2
  - strogatz_barmag2
  - first_principles_rydberg
  - strogatz_vdp2
  - _deprecated_207_autoPrice
