# Loading Trajectory Data

Mobility data comes in many formats: timestamps as unix integers or ISO strings (with timezones), 
coordinates in lat/lon or projected, files as single CSVs or partitioned directories.

`nomad.io.from_file` handles these cases with a single function call.

In [1]:
import glob
import pandas as pd
import nomad.io.base as loader
import nomad.data as data_folder
from pathlib import Path

data_dir = Path(data_folder.__file__).parent

## Pandas vs nomad.io for partitioned data

Partitioned directories (e.g., `date=2024-01-01/`, `date=2024-01-02/`, ...) require a loop with pandas:

In [2]:
csv_files = glob.glob(str(data_dir / "partitioned_csv" / "*" / "*.csv"))
df_list = []
for f in csv_files:
    df_list.append(pd.read_csv(f))
df_pandas = pd.concat(df_list, ignore_index=True)

print(f"Pandas: {len(df_pandas)} rows")
print(df_pandas.dtypes)
print("\nFirst few rows:")
print(df_pandas.head(3))

Pandas: 25835 rows
user_id            object
dev_lat           float64
dev_lon           float64
local_datetime     object
dtype: object

First few rows:
             user_id    dev_lat    dev_lon                 local_datetime
0    wizardly_joliot  38.321711 -36.667334  2024-01-01 14:29:00.000000000
1    wizardly_joliot  38.321676 -36.667365  2024-01-01 14:35:00.000000000
2  wonderful_swirles  38.321017 -36.667869  2024-01-01 15:06:00.000000000


`nomad.io.from_file` handles partitioned directories in one line, plus automatic type casting and column mapping:

In [3]:
traj_cols = {"user_id": "user_id",
             "latitude": "dev_lat",
             "longitude": "dev_lon",
             "datetime": "local_datetime"}

df = loader.from_file(data_dir / "partitioned_csv", format="csv", traj_cols=traj_cols, parse_dates=True)
print(f"nomad.io: {len(df)} rows")
print(df.dtypes)
print("\nFirst few rows:")
print(df.head(3))
print("\nNote: 'local_datetime' is now datetime64[ns], not object!")

nomad.io: 25835 rows
user_id                   object
dev_lat                  float64
dev_lon                  float64
local_datetime    datetime64[ns]
dtype: object

First few rows:
          user_id    dev_lat    dev_lon      local_datetime
0  admiring_curie  38.320444 -36.666827 2024-01-04 02:40:00
1  admiring_curie  38.320438 -36.666755 2024-01-04 03:16:00
2  admiring_curie  38.320434 -36.666877 2024-01-04 19:21:00

Note: 'local_datetime' is now datetime64[ns], not object!




The same pattern works for Parquet files, with the type casting and processing relying on passing to the functions which columns correspond to the default "typical" spatio-temporal column names

In [4]:
traj_cols = {"user_id": "uid", "timestamp": "timestamp", 
             "latitude": "latitude", "longitude": "longitude", "date": "date"}

df = loader.from_file(data_dir / "partitioned_parquet", format="parquet", traj_cols=traj_cols, parse_dates=True)
print(f"Loaded {len(df)} rows")
print(df.dtypes)

Loaded 25835 rows
uid           object
timestamp      Int64
latitude     float64
longitude    float64
date          object
dtype: object


In [5]:
# These are the default canonical columnn names
from nomad.constants import DEFAULT_SCHEMA
print(DEFAULT_SCHEMA.keys())

dict_keys(['user_id', 'latitude', 'longitude', 'datetime', 'start_datetime', 'end_datetime', 'start_timestamp', 'end_timestamp', 'timestamp', 'date', 'utc_date', 'x', 'y', 'geohash', 'tz_offset', 'duration', 'ha', 'h3_cell', 'location_id'])
