### Outline
explain how nomad can address different challenges in the data ingestion process:
1) the data can be partitioned, complicating things for users most familiar with pandas, a simple wrapper function from_file simplifies things, same function for simple csvs or partitioned data
2) column names and formats might vary from dataset to dataset, but spatial analysis functions require spatial and temporal columns, sometimes the time is expected as a unix timestamp, sometimes as a datetime with the local timezone. Similarly, an algorithm might require the latitude and longitude. Users always have the alternative of renaming the data so that those column names match the defaults of those functions, or they can input the right column names (or pass the relevant columns) on functions that have such flexibility. Nothing wrong with that. However, it could be preferrable to not alter the data permanently for analysis, specially if one will perform some filtering or produce a derivative dataset that is expected to be joined later on with the original data. Passing the correct column names every time to processing functions can be burdensome and verbose, and makes code less reusable when applied to a dataset with different column names. nomad addresses this by using an auxiliary dictionary of column names which other processing methods can use to find the appropriate columns. This is somewhat equivalent to passing the column names as keyword arguments, but functions also have a fallback to default column names for certain expected columns (like latitude, longitude, user_id, timestamp, etc).
3) We can demonstrate the flexibiilty that this auxiliary dictionary offers, by loading some device-level data fond in `gc-data.csv`. Beyond being a wrapper for the pandas or pyarrow reader functions, the `io` reader method, `from_file`, also ensures the trajectory columns (coordinates and time columns) are cast to the correct data types, issues warnings when unix timestamps are not in seconds, and raises errors when the data seems to lack spatial or temporal columns that will likely be required in downstream processing. This can be evidenced by comparing the output of simply using `pandas.read_csv` with that of `from_file`, where we see that the right columns have been cast to the right data types:

4) Of particular importance is the standardized handling of datetime strings in iso8601 formats. These can be timezone naive, have timezone information, and even have mixed timezones. For instance, when a trajectory spans multiple regions, or when there are daylight savings changes. nomad tries to simplify the parsing of dates in such cases, with three cases: [code explaining]

5) This last case is important because distributed algorithms relying on Spark do not store timezone information in the timestamp format. This presents a challenge in which analysis related to local datetime is required, but this information is lost. Switching to utc time is always an option which makes naive datetimes comparable, but it makes analysis of day-time, night-time behaviors more complicated when there are mixed timezones. A standard way to deal with timezone data is to strip the timezone information from timestamps and represent it in a separate column as the offset from UTC time in seconds. Thus, for compatibility with Spark workflows, setting `mixed_timezone_behavior = "naive"` will create a `tz_offset` column (when one does not already exist).

6) The flexibility provided by nomad to easily switch between small analyses using a small example of data, which could be stored in a .csv file, for testing code, and then using the same (or similar functions) to scale up in a distributed environment, facilitates a common (and recommended) workflow in which users can easily read data from some users from a large dataset and use standard pandas functionalities, benchmark their code, etc, and then scale up using more resources once they are certain their code is in good shape. This can easily be done with io methods like `sample_users`, `sample_from_file` (which may optionally take a sample of users drawn from somewhere else). This is shown as follows:

7) Finally, a user might want to persist such a sample with care for the data types and, perhaps, recovering the date string format with timezone, which is possible even when this information was saved in the tz_offset column. Notice that this writer function can also seamlessly switch between csv and parquet formats, leveraging pyarrow and pandas. FOr example: 

# **Tutorial 1: Loading and Sampling Trajectory Data**

## Getting started

Real-world mobility files vary widely in structure and formatting. Timestamps may be recorded as UNIX integers or ISO-formatted strings, with or without timezone offsets. Coordinate columns may follow different naming conventions, and files may be stored either as flat CSVs or as partitioned Parquet directories. This notebook demonstrates how `nomad.io.base` standardizes data loading across these variations using two example datasets: a CSV file (`gc-data.csv`) and a partitioned Parquet directory (`gc-data/`). For visualization, we will also use a dataset with building geometries underlying the synthetic data in these examples. Namely, the file `garden_city.geojson`.

## Inspecting schemas
Let's start by inspecting the schemas of the datasets we will use with the nomad helper function `table_columns` from the `io` module. This method reports column names for both flat files and partitioned datasets without reading the full content into memory.

In [None]:
from nomad.io import base as loader

print(loader.table_columns("gc-data.csv", format="csv"))
print(loader.table_columns("gc-data/", format="parquet")) # <<< SHOULD BE A PARTITIONED CSV

## Typical processing with `pandas`, `geopandas`

When analyzing a manageable sample of data, pandas and geopandas provide excellent functionalities that allow you to do preliminary analysis and plotting without many additional tools. Suppose we want to perform some preliminary analysis of the data in (`gc-data.csv`). Suppose we would like to
- Load the trajectory and geometry data from disk. (using `pandas.read_csv()` and `geopandas.read_file()`)
- Plot the data of a user for a given day. (using `geopandas.plot()` and `matplotlib.pyplot.plot()`)
- Create a heatmap of certain areas with a lot of pings. (for this we could use a tessellation, for example `h3`).
- Analyze if there are gaps in the user's signals. (likely with a simple histogram from `matplotlib`)

For example, we can do it like this

In [None]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import h3
from shapely import Polygon

# read data
df = pd.read_csv("gc-data.csv")
# FIX THIS AND DELETE ME!
df.loc[:, ['device_lon', 'device_lat']] = -1 * df.loc[:, ['device_lat', 'device_lon']].values + 0.003 #FIX THIS!!!
city = gpd.read_file("garden_city.geojson")

df.head()

In [None]:
# Plot trajectory data of a single user
user = df['identifier'].iloc[0]
user_df = df.loc[(df['identifier'] == user) & (df['date'] == '2024-01-04')]

# Plot trajectory
fig, ax1 = plt.subplots(figsize=(4,4))
ax1.set_axis_off()

city.plot(ax=ax1, column='type', edgecolor='black', linewidth=0.75, cmap='Set3')
ax1.scatter(user_df['device_lon'], user_df['device_lat'], s=6, alpha=0.75, color='black')

plt.show()

In [None]:
## Plot an h3 heatmap
#  For this task we can leverage `h3.latlng_to_cell` and `h3.cell_to_boundary` to get a cell's polygon
#  'Sharp edge # 1' the order of latitude and longitude depends on the library. We need to switch the order!

# Switch lat, lon to lon, lat and pass to shapely Polygon
def h3_cell_to_polygon(cell):
    coords = h3.cell_to_boundary(cell)
    lat, lon = zip(*coords)
    return Polygon(zip(lon, lat))    

# Cell for each row
def row_to_h3_cell(row, res):
    return h3.latlng_to_cell(lat=row['device_lat'], lng=row['device_lon'], res=res)

df['cell'] = df.apply(row_to_h3_cell, res=12, axis=1)

pings_per_cell = df.groupby('cell').agg(pings=('unix_timestamp', 'count')).reset_index()
pings_per_cell['geometry'] = pings_per_cell['cell'].apply(h3_cell_to_polygon)

h3_gdf = gpd.GeoDataFrame(pings_per_cell, geometry='geometry')

fig, ax2 = plt.subplots(figsize=(5,4))
city.plot(ax=ax2, column='type', edgecolor='black', linewidth=0.75, cmap='Set3')
h3_gdf.plot(column=h3_gdf.pings, cmap='Reds', alpha=0.75, ax=ax2, legend=True)
ax2.set_axis_off()
plt.show()

In [None]:
## Let's find the maximum temporal gap in the trajectory of each user
# A simple pandas groupby on the unix_timestamp column (in seconds since epoch)
def get_max_gap_minutes(times):
    shifted_times = times.shift(1, fill_value=0) # shift gives previous value
    gaps = (times.iloc[1:] - shifted_times.iloc[1:])//60 # gaps in minutes
    return gaps.max()

max_gap = df.groupby('identifier')['unix_timestamp'].apply(get_max_gap_minutes)

fig, ax3 = plt.subplots(figsize=(4,3))
max_gap.hist(ax=ax3, bins=24, color='#8dd3c7')
ax3.set_xlabel('minutes')
ax3.set_title('max temporal gap per user')
plt.show()

## Data ingestion with `nomad` — a reusable pipeline

Here, we explore which advantages (if any) we can get from using `nomad` for the same preliminary analysis. In the case of reading a single csv file, the reader function `nomad.io.base.from_file` is basically a `pandas` wrapper, except that it facilitates the parsing of spatiotemporal columns which are known to follow specific formatting, for instance:

- dates and datetimes in ISO format are cast to `pandas.datetime64`
- unix timestamps are cast to integers
- user identifiers are cast to strings
- coordinates are cast to float

For such typecasting and other methods, `nomad` relies on a user provided mapping from "default" column names, to the column names in the data, namely, the dictionary `traj_cols`. This prevents having to rename columns *ad hoc* to reuse code, and simplifies the number of arguments passed to different methods. 


In [None]:
# These are the possible default column names that could be mapped to data columns to methods in `nomad` 
from nomad.constants import DEFAULT_SCHEMA
DEFAULT_SCHEMA

<a id='hidden-cell'></a>

A problem that can arise when analyzing data using just `pandas` is that the geospatial data is often partitioned, e.g. stored in smaller csv chunks in partitioned directories (e.g. `date=2024-01-01/`). Rather than reading the data with a for loop (and turning the partitioning directories into variables), `nomad`'s `from_file` can read a whole directory with the same function call, by warpping `PyArrow`'s file reader.

Let's replicate the previous analysis starting with `nomad`'s file reader on the partitioned dataset.

In [None]:
# For the single csv dataset
traj_cols = {"user_id": "identifier",
             "timestamp": "unix_timestamp",
             "latitude": "device_lat",
             "longitude": "device_lon",
             "datetime": "local_datetime",
             "date": "date"}
file_path = "gc-data.csv"

df = loader.from_file(file_path, format="csv", traj_cols=traj_cols)
# check data types
#print(df.dtypes)

In [None]:
# For the partitioned dataset
traj_cols = {"user_id": "user_id",
             "timestamp": "timestamp",
             "latitude": "latitude",
             "longitude": "longitude",
             "datetime": "datetime",
             "date": "date"} # the dataset has default column names
file_path = "gc-data/" # partitioned

# Try traj_cols=None. It should work because of default names
# Try mixed_timezone_behavior="utc" or "object", or parse_dates = False and inspect df
df = loader.from_file(file_path, format="parquet", traj_cols=traj_cols, parse_dates=True)
# FIX THIS AND DELETE ME!
df.loc[:, ['longitude', 'latitude']] = -1 * df.loc[:, ['latitude', 'longitude']].values + 0.003 #FIX THIS!!!
# check data types
print(df.dtypes)

In [None]:
## Compute all three statistics as before 

# Trajectory of a single user
user = df[traj_cols['user_id']].iloc[0]
user_df = df.loc[(df[traj_cols['user_id']] == user) & (df[traj_cols['date']] == '2024-01-04')]

# Pings per cell geodataframe
df["cell"] = df.apply(
    lambda r: h3.latlng_to_cell(lat=r[traj_cols["latitude"]], lng=r[traj_cols["longitude"]], res=12),
    axis=1)

pings_per_cell = df.groupby('cell').agg(pings=(traj_cols['timestamp'], 'count')).reset_index()
h3_gdf = gpd.GeoDataFrame(pings_per_cell, geometry=pings_per_cell['cell'].apply(h3_cell_to_polygon))

# Maximum gap for each user
max_gap = df.groupby(traj_cols['user_id'])[traj_cols['timestamp']].apply(get_max_gap_minutes)

In [None]:
## Plotting
fig, (ax1, ax2, ax3) = plt.subplots(figsize=(12,3), ncols=3)

# trajectory of a single user
city.plot(ax=ax1, column='type', edgecolor='black', linewidth=0.75, cmap='Set3')
ax1.scatter(user_df[traj_cols["longitude"]], user_df[traj_cols["latitude"]], s=6, alpha=0.75, color='black')
ax1.set_axis_off()
# heatmap
city.plot(ax=ax2, column='type', edgecolor='black', linewidth=0.75, cmap='Set3')
h3_gdf.plot(column=h3_gdf.pings, cmap='Reds', alpha=0.75, ax=ax2, legend=True)
ax2.set_axis_off()
# gaps
max_gap.hist(ax=ax3, bins=24, color='#8dd3c7')
ax3.set_xlabel('minutes')
ax3.set_title('max temporal gap per user')

plt.tight_layout()
plt.show()

Now, go back to (hidden) [cell 7](#hidden-cell) and try simply changing the file path and column name mapping (traj_cols). The rest of the code works the same. 

## A good practice: prototype on a small sample, scale up later

While a researcher is still exploring and designing their experiments, it can be impractical, time consuming, or outright intractable, to use the entire dataset. Thus, it is recommended to work on a sample of the data, either by selecting some users at random, some records at random, or both. `nomad`'s `io.base.sample_users` selects a reproducible subset of user IDs, while `io.base.sample_from_file` reads the data of only those users, optionally sampling records. The resulting sample can be written to disk using `io.base.to_file`. 

When persisting the sample, we partition by `date` again to preserve the likeness with the original dataset. Reading the output back with `from_file` confirms that the sample was saved correctly and remains compatible with the same loading functions.

In [None]:
file_path = "gc-data/" # has default names
fmt = "parquet"

# full data
df = loader.from_file(file_path, format=fmt)

# sample users
users = loader.sample_users(file_path, format=fmt, size=12, seed=314) # change if user_id has other name
# sample data, pass users
sample_df = loader.sample_from_file(file_path, users=users, format=fmt, frac_records=0.30, seed=314)

## optionally try uncommenting this line
# sample_df = loader.sample_from_file(file_path, users=users, format=fmt, frac_records=0.30, frac_users=0.12, seed=314)

# persist
loader.to_file(sample_df, "/tmp/nomad_sample", format=fmt, partition_by=["date"], existing_data_behavior='overwrite_or_ignore')
round_trip = loader.from_file("/tmp/nomad_sample", format=fmt)

In [None]:
print("- Value counts for sample of data:\n")
print(round_trip.user_id.value_counts())
print("\n---------------------------------\n")
print("- Value counts for original data:\n")
print(df.user_id.value_counts())