# Tutorial 1: Loading and Sampling Trajectory Data

Real-world mobility files vary widely in structure and formatting. Timestamps may be recorded as UNIX integers or ISO-formatted strings, with or without timezone offsets. Coordinate columns may follow different naming conventions, and files may be stored either as flat CSVs or as partitioned Parquet directories. This notebook demonstrates how `nomad.io.base` standardizes data loading across these variations using two example datasets: a CSV file (`gc-data.csv`) and a partitioned Parquet directory (`gc-data/`).

## Inspecting schemas
Let's start by inspecting the schemas of the datasets we will use with the nomad helper function `table_columns` from the `io` module. This method reports column names for both flat files and partitioned datasets without reading the full content into memory.

In [None]:
from nomad.io import base as loader

print(loader.table_columns("gc-data.csv", format="csv"))
print(loader.table_columns("gc-data/", format="parquet")) # <<< SHOULD BE A PARTITIONED CSV

## Loading data 

Reading data with `pandas` or Parquet readers does not enforce any particular schema, but spatiotemporal data often contains columns that must follow specific formats. The `from_file` function applies consistent type casting, converting temporal fields to `datetime` objects, ensuring coordinates are floats, unix timestamps are integers, and optionally creating a `tz_offset` column to store timezone offsets when parsing datetime strings. This enables compatibility with engines like Spark, in which `Timestamp` objects cannot store timezone information.

To make reproducing the code as easy as possible, we want to abstract away the different possible column names, understanding that in most cases we get the same columns. While renaming is an option in some cases, `nomad` can handle different column names in most methods by simply storing a a `traj_cols` dictionary mapping default column names to the actual column names in the dataset.This also allows downstream functions to know where to find required spatial, temporal, or even tessellation columns without excessive argument-passing.


In [None]:
traj_cols = {"user_id": "identifier", "timestamp": "unix_timestamp", "latitude": "device_lat", "longitude": "device_lon", "datetime": "local_datetime", "date": "date"}
df_mapped = loader.from_file("gc-data.csv", format="csv", traj_cols=traj_cols)
df_mapped.head()

This mapping makes the dataset compatible with nomad tools without modifying its original structure. However, in the case in which a dataset has the default names, possibly due to the columns being renamed, many `nomad` methods will work without passing any mappings or excessive arguments. After inspecting the default column names, we see that the second dataset uses those, and thus the casting of column types and parsing of dates identifies (and is applied) on the appropriate columns. 

In [None]:
from nomad.constants import DEFAULT_SCHEMA
DEFAULT_SCHEMA

In [None]:
# This dataset has default column names, so no traj_cols argument is necessary
df_pq = loader.from_file("gc-data/", format="parquet", parse_dates=True)
df_pq.head()

Even when GPS data is stored in partitioned directories (e.g. date=2024-01-01/), `from_file` can handle it using PyArrow's file reader.

## Working on smaller samples and persistence

Large mobility datasets should typically not be fully loaded into the memory of a machine during interactive analysis, so subsampling by user is a common step in early analyses. nomad's `sample_users` selects a reproducible subset of user IDs, and `sample_from_file` filters the input dataset to include only those records. The resulting sample can be written to disk using `to_file`, partitioned by date in `hive` format to preserve compatibility with distributed engines. Reading the output back with `from_file` confirms that the sample was saved correctly and remains compatible with the same loading functions.

In [None]:
users = loader.sample_users("gc-data/", format="parquet", size=10, seed=300)
sample_df = loader.sample_from_file("gc-data/", users=users, format="parquet", frac_records=0.25)

loader.to_file(sample_df, "/tmp/nomad_sample", format="parquet", partition_by=["date"], existing_data_behavior='delete_matching')

round_trip = loader.from_file("/tmp/nomad_sample", format="parquet")

In [None]:
# The amount of data in the working sample is much smaller, and sufficient for prototyping

In [None]:
print("- Value counts for sample of data:\n")
print(round_trip.user_id.value_counts())
print("\n---------------------------------\n")
print("- Value counts for original data:\n")
print(df_pq.user_id.value_counts())