Data Requirements and Type Detection

Guidepost accepts a single pandas DataFrame, one row per record (e.g. one HPC job). On load, every column is validated, cleaned, and classified into a semantic type that decides where it can be used in the chart. This page explains what is accepted, how columns are classified, and the knobs you have to override the defaults.

See also: Configuration (mapping columns to the chart) and API Reference (load_data, constructor options).

Minimum shape

A pandas DataFrame.
At least 3 numeric columns (to fill the x, y, and color roles) and 2 categorical columns (to fill the facet and categorical-bar roles).
Datetime columns are supported and are typically used on the x-axis.

If your data has fewer than two usable categorical columns, Guidepost injects a synthetic "no grouping" column so the chart still renders; in the configuration dropdowns it appears as n/a. See Configuration.

What happens to your data on load

load_data() runs a validation + cleaning pass and prints a report of anything it changes (suppress it with suppress_warnings=True):

Situation	What Guidepost does
Column is entirely `NaN`/`None`	Dropped, reported under "na_columns".
Column has some nulls	Kept. Nulls are skipped per-axis at render time, not dropped. The per-column null counts are reported.
`timedelta` column	Converted to seconds (kept as a continuous variable), reported as converted.
Cell contains a raw `numpy` array / list	Dropped — unless you declare it as a list column (see below).
More than 250,000 rows	Kept, but a performance warning is printed. Consider subsampling toward < 200k rows.

A synthetic gp_idx index column is added internally to track which rows you select; it is removed from any data you retrieve.

Semantic types

Each surviving column is classified as one of:

continuous — a quantitative measure (can be x, y, or color)
ordinal — a small-domain / integer measure
categorical — a label used for faceting or the categorical bar chart
temporal — datetime, used on the x-axis (grouped under "Temporal" in the config dropdowns)
list — a multi-valued cell (see List columns)

How the type is decided

Classification is applied in this order (first match wins):

Declared list column → list (always treated as categorical, exploded to individual values). See below.
Declared categorical_columns → categorical (your override beats every heuristic below).
Pandas Categorical dtype → ordinal if the dtype is ordered=True, otherwise categorical.
Boolean dtype → categorical.
Numeric dtype:
- ID-like name (e.g. JOB_ID, USER_GENID, a bare id/uid/uuid) → categorical. The value is treated as a label, not a measure, so it groups/filters instead of being binned. Common words that merely end in "id" — GRID, VALID, SOLID, HUMID, … — are not matched.
- Integer dtype, or fewer than 20 unique values → ordinal (e.g. ratings, small counts).
- Otherwise → continuous.
Datetime / timedelta dtype → continuous (temporal; usable on the x-axis).
String/object dtype:
- If every value is a number with a K/M/B suffix (e.g. "1.5K", "2.3M") → parsed to a float and treated as continuous.
- Otherwise → categorical.

The widget further separates datetime columns into a "Temporal" group and list columns into a "List" group in the configuration dropdowns, even though both are continuous/categorical under the hood. See Configuration.

List columns (multi-valued cells)

Some HPC fields hold several values per row — for example a node list like ['x1008c0s0b0n0', 'x1008c0s0b1n0'] or a delimited string "x1008c0s0b0n0,x1008c0s0b1n0". Declare these as list columns so Guidepost explodes them into individual values:

gp = Guidepost(
    list_columns=['LOCATION', 'nodelist'],
    list_delimiter=','     # used only to split delimited strings; default ','
)
gp.records = df

Guidepost accepts, per cell:

a Python list/tuple/ndarray,
a stringified Python literal ("['a', 'b']"), parsed with ast.literal_eval,
a delimited string ("a,b,c"), split on list_delimiter.

Values within a cell are de-duplicated (order preserved) so a record is never double-counted. Genuine array-typed columns (e.g. from Parquet list types) are auto-detected as list columns even if you don't declare them.

When a list column is placed on the x-axis, its values get an intelligent ordering so related values sit next to each other, and extra strips/arcs appear under the heatmap. See Main Summary View Heatmap for those interactions.

Useful overrides

Goal	How
Force a numeric column to be a label (categorical) the name heuristic missed	`Guidepost(categorical_columns=['SOME_CODE'])`
Parse multi-valued cells	`Guidepost(list_columns=[...], list_delimiter=',')`
Quiet the load report	Set `gp.suppress_warnings = True` before calling `gp.load_data(df)`

YYYYMMDD "date-ID" columns

Integer columns that encode a date as YYYYMMDD (e.g. START_DATE_ID = 20241231) are automatically converted to real datetimes so they drive an ordered temporal axis instead of being binned numerically. Detection is strict: every non-null value must be a whole number in the range 19000101–21001231 and must decode to a valid calendar date (so 20241350 disqualifies the column). Columns you list in categorical_columns are left untouched (your override wins).

Next: Configuration · Understanding the Views · API Reference

Guidepost Wiki

Home

Getting Started

Getting Started

Data & Configuration

The Views

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Requirements and Type Detection

Data Requirements and Type Detection

Minimum shape

What happens to your data on load

Semantic types

How the type is decided

List columns (multi-valued cells)

Useful overrides

YYYYMMDD "date-ID" columns

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Guidepost Wiki

Clone this wiki locally