Skip to content

Data Requirements and Type Detection

Connor Scully-Allison edited this page Jun 15, 2026 · 1 revision

Data Requirements and Type Detection

Guidepost accepts a single pandas DataFrame, one row per record (e.g. one HPC job). On load, every column is validated, cleaned, and classified into a semantic type that decides where it can be used in the chart. This page explains what is accepted, how columns are classified, and the knobs you have to override the defaults.

See also: Configuration (mapping columns to the chart) and API Reference (load_data, constructor options).

Minimum shape

  • A pandas DataFrame.
  • At least 3 numeric columns (to fill the x, y, and color roles) and 2 categorical columns (to fill the facet and categorical-bar roles).
  • Datetime columns are supported and are typically used on the x-axis.

If your data has fewer than two usable categorical columns, Guidepost injects a synthetic "no grouping" column so the chart still renders; in the configuration dropdowns it appears as n/a. See Configuration.

What happens to your data on load

load_data() runs a validation + cleaning pass and prints a report of anything it changes (suppress it with suppress_warnings=True):

Situation What Guidepost does
Column is entirely NaN/None Dropped, reported under "na_columns".
Column has some nulls Kept. Nulls are skipped per-axis at render time, not dropped. The per-column null counts are reported.
timedelta column Converted to seconds (kept as a continuous variable), reported as converted.
Cell contains a raw numpy array / list Droppedunless you declare it as a list column (see below).
More than 250,000 rows Kept, but a performance warning is printed. Consider subsampling toward < 200k rows.

A synthetic gp_idx index column is added internally to track which rows you select; it is removed from any data you retrieve.

Semantic types

Each surviving column is classified as one of:

  • continuous — a quantitative measure (can be x, y, or color)
  • ordinal — a small-domain / integer measure
  • categorical — a label used for faceting or the categorical bar chart
  • temporal — datetime, used on the x-axis (grouped under "Temporal" in the config dropdowns)
  • list — a multi-valued cell (see List columns)

How the type is decided

Classification is applied in this order (first match wins):

  1. Declared list columnlist (always treated as categorical, exploded to individual values). See below.
  2. Declared categorical_columnscategorical (your override beats every heuristic below).
  3. Pandas Categorical dtypeordinal if the dtype is ordered=True, otherwise categorical.
  4. Boolean dtypecategorical.
  5. Numeric dtype:
    • ID-like name (e.g. JOB_ID, USER_GENID, a bare id/uid/uuid) → categorical. The value is treated as a label, not a measure, so it groups/filters instead of being binned. Common words that merely end in "id" — GRID, VALID, SOLID, HUMID, … — are not matched.
    • Integer dtype, or fewer than 20 unique valuesordinal (e.g. ratings, small counts).
    • Otherwisecontinuous.
  6. Datetime / timedelta dtypecontinuous (temporal; usable on the x-axis).
  7. String/object dtype:
    • If every value is a number with a K/M/B suffix (e.g. "1.5K", "2.3M") → parsed to a float and treated as continuous.
    • Otherwise → categorical.

The widget further separates datetime columns into a "Temporal" group and list columns into a "List" group in the configuration dropdowns, even though both are continuous/categorical under the hood. See Configuration.

List columns (multi-valued cells)

Some HPC fields hold several values per row — for example a node list like ['x1008c0s0b0n0', 'x1008c0s0b1n0'] or a delimited string "x1008c0s0b0n0,x1008c0s0b1n0". Declare these as list columns so Guidepost explodes them into individual values:

gp = Guidepost(
    list_columns=['LOCATION', 'nodelist'],
    list_delimiter=','     # used only to split delimited strings; default ','
)
gp.records = df

Guidepost accepts, per cell:

  • a Python list/tuple/ndarray,
  • a stringified Python literal ("['a', 'b']"), parsed with ast.literal_eval,
  • a delimited string ("a,b,c"), split on list_delimiter.

Values within a cell are de-duplicated (order preserved) so a record is never double-counted. Genuine array-typed columns (e.g. from Parquet list types) are auto-detected as list columns even if you don't declare them.

When a list column is placed on the x-axis, its values get an intelligent ordering so related values sit next to each other, and extra strips/arcs appear under the heatmap. See Main Summary View Heatmap for those interactions.

Useful overrides

Goal How
Force a numeric column to be a label (categorical) the name heuristic missed Guidepost(categorical_columns=['SOME_CODE'])
Parse multi-valued cells Guidepost(list_columns=[...], list_delimiter=',')
Quiet the load report Set gp.suppress_warnings = True before calling gp.load_data(df)

YYYYMMDD "date-ID" columns

Integer columns that encode a date as YYYYMMDD (e.g. START_DATE_ID = 20241231) are automatically converted to real datetimes so they drive an ordered temporal axis instead of being binned numerically. Detection is strict: every non-null value must be a whole number in the range 1900010121001231 and must decode to a valid calendar date (so 20241350 disqualifies the column). Columns you list in categorical_columns are left untouched (your override wins).


Next: Configuration · Understanding the Views · API Reference

Clone this wiki locally