-
Notifications
You must be signed in to change notification settings - Fork 0
Data Requirements and Type Detection
Guidepost accepts a single pandas DataFrame, one row per record (e.g. one HPC job). On load, every column is validated, cleaned, and classified into a semantic type that decides where it can be used in the chart. This page explains what is accepted, how columns are classified, and the knobs you have to override the defaults.
See also: Configuration (mapping columns to the chart) and API Reference (load_data, constructor options).
- A
pandasDataFrame. - At least 3 numeric columns (to fill the x, y, and color roles) and 2 categorical columns (to fill the facet and categorical-bar roles).
- Datetime columns are supported and are typically used on the x-axis.
If your data has fewer than two usable categorical columns, Guidepost injects a synthetic "no grouping" column so the chart still renders; in the configuration dropdowns it appears as n/a. See Configuration.
load_data() runs a validation + cleaning pass and prints a report of anything it changes (suppress it with suppress_warnings=True):
| Situation | What Guidepost does |
|---|---|
Column is entirely NaN/None
|
Dropped, reported under "na_columns". |
| Column has some nulls | Kept. Nulls are skipped per-axis at render time, not dropped. The per-column null counts are reported. |
timedelta column |
Converted to seconds (kept as a continuous variable), reported as converted. |
Cell contains a raw numpy array / list |
Dropped — unless you declare it as a list column (see below). |
| More than 250,000 rows | Kept, but a performance warning is printed. Consider subsampling toward < 200k rows. |
A synthetic gp_idx index column is added internally to track which rows you select; it is removed from any data you retrieve.
Each surviving column is classified as one of:
- continuous — a quantitative measure (can be x, y, or color)
- ordinal — a small-domain / integer measure
- categorical — a label used for faceting or the categorical bar chart
- temporal — datetime, used on the x-axis (grouped under "Temporal" in the config dropdowns)
- list — a multi-valued cell (see List columns)
Classification is applied in this order (first match wins):
-
Declared list column →
list(always treated as categorical, exploded to individual values). See below. -
Declared
categorical_columns→categorical(your override beats every heuristic below). -
Pandas
Categoricaldtype →ordinalif the dtype isordered=True, otherwisecategorical. -
Boolean dtype →
categorical. -
Numeric dtype:
-
ID-like name (e.g.
JOB_ID,USER_GENID, a bareid/uid/uuid) →categorical. The value is treated as a label, not a measure, so it groups/filters instead of being binned. Common words that merely end in "id" —GRID,VALID,SOLID,HUMID, … — are not matched. -
Integer dtype, or fewer than 20 unique values →
ordinal(e.g. ratings, small counts). -
Otherwise →
continuous.
-
ID-like name (e.g.
-
Datetime / timedelta dtype →
continuous(temporal; usable on the x-axis). -
String/object dtype:
- If every value is a number with a
K/M/Bsuffix (e.g."1.5K","2.3M") → parsed to a float and treated ascontinuous. - Otherwise →
categorical.
- If every value is a number with a
The widget further separates datetime columns into a "Temporal" group and list columns into a "List" group in the configuration dropdowns, even though both are continuous/categorical under the hood. See Configuration.
Some HPC fields hold several values per row — for example a node list like ['x1008c0s0b0n0', 'x1008c0s0b1n0'] or a delimited string "x1008c0s0b0n0,x1008c0s0b1n0". Declare these as list columns so Guidepost explodes them into individual values:
gp = Guidepost(
list_columns=['LOCATION', 'nodelist'],
list_delimiter=',' # used only to split delimited strings; default ','
)
gp.records = dfGuidepost accepts, per cell:
- a Python
list/tuple/ndarray, - a stringified Python literal (
"['a', 'b']"), parsed withast.literal_eval, - a delimited string (
"a,b,c"), split onlist_delimiter.
Values within a cell are de-duplicated (order preserved) so a record is never double-counted. Genuine array-typed columns (e.g. from Parquet list types) are auto-detected as list columns even if you don't declare them.
When a list column is placed on the x-axis, its values get an intelligent ordering so related values sit next to each other, and extra strips/arcs appear under the heatmap. See Main Summary View Heatmap for those interactions.
| Goal | How |
|---|---|
| Force a numeric column to be a label (categorical) the name heuristic missed | Guidepost(categorical_columns=['SOME_CODE']) |
| Parse multi-valued cells | Guidepost(list_columns=[...], list_delimiter=',') |
| Quiet the load report | Set gp.suppress_warnings = True before calling gp.load_data(df)
|
Integer columns that encode a date as YYYYMMDD (e.g. START_DATE_ID = 20241231) are automatically converted to real datetimes so they drive an ordered temporal axis instead of being binned numerically. Detection is strict: every non-null value must be a whole number in the range 19000101–21001231 and must decode to a valid calendar date (so 20241350 disqualifies the column). Columns you list in categorical_columns are left untouched (your override wins).
Next: Configuration · Understanding the Views · API Reference
Getting Started
Data & Configuration
The Views
- Understanding the Views
- Main Summary View Heatmap
- Histograms Bar Chart and Legend
- Selecting and Exporting Data
Reference