# **Loading and Sampling Trajectory Data**

## Getting started

Real-world mobility files vary widely in structure and formatting:
- e.g. **Timestamps** may be **UNIX** integers or **ISO-formatted strings**
- May have **timezones**, e.g. -05:00, Z, (GMT+01), -3600
- Coordinates might be **projected** or **geographical**
- Files may be a flat **CSV**, or **partitioned Parquets**, local or **in S3**.

`nomad.io` is here to help.

In [8]:
from nomad.io import base as loader

## **Typical processing**. an example with `pandas`, `geopandas`

Perform preliminary analysis of the data in (`gc-data.csv`).
- **Load** trajectory and geometry data.
- **Plot the data of a user**
- Create **a heatmap** of ping **hotspots**.
- Analyze **gaps** in the user's signals.

In [9]:
import pandas as pd
import geopandas as gpd

df = pd.read_csv("../../tutorials/IC2S2-2025/IC2S2-2025/gc_data.csv")
city = gpd.read_file("../../tutorials/IC2S2-2025/IC2S2-2025/garden_city.geojson")

df.head()

Unnamed: 0,identifier,device_lon,device_lat,unix_timestamp,local_datetime,date,ha
0,cocky_stallman,-38.318802,36.669894,1704114435,2024-01-01 09:07:15-04:00,2024-01-01,8.492856
1,cocky_stallman,-38.318765,36.669905,1704114753,2024-01-01 09:12:33-04:00,2024-01-01,11.336772
2,cocky_stallman,-38.318627,36.669856,1704114792,2024-01-01 09:13:12-04:00,2024-01-01,18.436612
3,cocky_stallman,-38.318661,36.66992,1704114989,2024-01-01 09:16:29-04:00,2024-01-01,27.370737
4,cocky_stallman,-38.318602,36.669823,1704115195,2024-01-01 09:19:55-04:00,2024-01-01,12.506606


## `nomad.io` â€” facilitates type casting and default names

`nomad.io.base.from_file` is basically a `pandas` / `pyarrow` wrapper, trying to simplify the formatting of canonical variables

- dates and datetimes in **ISO format** are cast to `pandas.datetime64`
- **unix timestamps** are cast to integers and **reformatted to seconds**.
- **user identifiers** are cast to strings
- **partition folders** can be read as columns (Hive)
- **timezone handling** parses ISO datetime strings (with or without timezones)

Don't read partitioned data with a for loop! `nomad`'s `from_file` wraps `PyArrow`'s file readers maintaning the same signature.

In [10]:
# For the partitioned dataset
traj_cols = {"user_id": "user_id",
             "timestamp": "timestamp",
             "latitude": "latitude",
             "longitude": "longitude",
             "datetime": "datetime",
             "date": "date"}

file_path = "../../tutorials/IC2S2-2025/IC2S2-2025/gc_data/" # partitioned


df = loader.from_file(file_path, format="csv", traj_cols=traj_cols, parse_dates=True)
print(df.dtypes)

user_id              object
longitude           float64
latitude            float64
timestamp             Int64
datetime     datetime64[ns]
ha                  float64
date                 object
tz_offset             Int64
dtype: object


In [11]:
from nomad.constants import DEFAULT_SCHEMA
print("Canonical column names in nomad")
DEFAULT_SCHEMA

Canonical column names in nomad


{'user_id': 'user_id',
 'latitude': 'latitude',
 'longitude': 'longitude',
 'datetime': 'datetime',
 'start_datetime': 'start_datetime',
 'end_datetime': 'end_datetime',
 'start_timestamp': 'start_timestamp',
 'end_timestamp': 'end_timestamp',
 'timestamp': 'timestamp',
 'date': 'date',
 'utc_date': 'date',
 'x': 'x',
 'y': 'y',
 'geohash': 'geohash',
 'tz_offset': 'tz_offset',
 'duration': 'duration',
 'ha': 'ha',
 'h3_cell': 'h3_cell',
 'location_id': 'location_id'}

```from_file``` automatically detects and reads Parquet files (single or partitioned directories) using ```PyArrow```'s dataset API, applying the same validation, type casting, and timezone handling as for CSV inputs.

In [17]:
traj_cols = {"user_id": "uid",
             "timestamp": "timestamp",
             "latitude": "latitude",
             "longitude": "longitude",
             "date": "date"}

file_path = "../../nomad/data/partitioned_parquet" # partitioned

df = loader.from_file(file_path, format="parquet", traj_cols=traj_cols, parse_dates=True)
print(df.dtypes)

uid           object
timestamp      Int64
latitude     float64
longitude    float64
date          object
dtype: object
