# How to use `Source`?
## Synthetic Data

`tab2seq` package has a function that can generate synthetic datasets: `health`, `labour`, `income` and `survey`. Each of these has a unique data structure.

In [None]:
from tab2seq.datasets import generate_synthetic_data
import polars as pl

In [None]:
data_paths = generate_synthetic_data(output_dir="synthetic_data", 
                                     n_entities=10000, 
                                     seed=742, 
                                     registries=["health", "labour", "survey", "income"],
                                     file_format="parquet")
print("Generated synthetic data at:", data_paths)

You can use `polars` to load and look at these datasets.

In [None]:
lf_health = pl.read_parquet(data_paths["health"])
lf_health.head()

In [None]:
lf_labour = pl.read_parquet(data_paths["labour"])
lf_labour.sample(10)

## Sources
A `Source` represents a data table(-s) of a specific event type. This could be a hospital admissions registry,
an income registry, or a labor market record... you name it.

Each `Source` stores the information needed to read and validate that table:
1. where it lives on disk,
2. which column identifies the entity (e.g. a person, firm, or object),
3. which column holds the timestamp, and
4. which columns carry categorical or continuous features.

`Source` heavily relies on the `pydantic` configuration files: makes it straightforward to define new event types simply by writing 
a config, without touching any reading or validation logic.

**Note**: `Source` makes the first filtering and preprocessing steps by removing rows with empty `enitity_ids` 
and rows with empty `timestamp_cols` (in case you specified these).

In [None]:
from tab2seq.source import Source, SourceConfig, SourceCollection

In [None]:
source_H = Source(config=SourceConfig(
    name="health",
    filepath="synthetic_data/health.parquet",
    entity_id_col="entity_id",
    categorical_cols=["diagnosis", "procedure" , "department"],
    continuous_cols=['cost', 'length_of_stay'],
    output_format="parquet",
    timestamp_cols=["date"]
))

print("Number of unique IDs:", len(source_H.get_entity_ids()))
source_H.scan()

In [None]:
# or you could define the Source config separately and then create the Source

config_L = SourceConfig(
    name="labour",
    filepath="synthetic_data/labour.parquet",
    entity_id_col="entity_id",
    categorical_cols=["status", "occupation", "residence_region"],
    continuous_cols=['weekly_hours'],
    output_format="parquet",
    timestamp_cols=["date", "birthday"],
)
source_L = Source(config=config_L)

print("Number of unique IDs:", len(source_L.get_entity_ids()))

In [None]:
# You can also create a SourceCollection to manage multiple sources together
collection = SourceCollection(sources=[source_H, source_L])


print("All unique entity IDs in collection:", len(collection.get_all_entity_ids()))
#You can get access to the individual sources in the collection by running the following:
collection.sources