Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -33,3 +33,8 @@ repos:
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
- repo: https://github.com/tcort/markdown-link-check
rev: 'v3.11.2'
hooks:
- id: markdown-link-check
args: [-q]
6 changes: 3 additions & 3 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ data = load()

Depending on how the data was prepared, load may return a [pandas](https://pandas.pydata.org/), [dask](https://www.dask.org/), [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), or [modin](https://github.com/modin-project/modin) dataframe. The limited choice is due to the fact that it must be supported by [pandera](https://pandera.readthedocs.io/en/stable/).

Not only accessing data will be this easy, but you will also have the [pandera schema model](https://pandera.readthedocs.io/en/stable/schema_models.html) associated with the data. How?
Not only accessing data will be this easy, but you will also have the [pandera DataFrame Model](https://pandera.readthedocs.io/en/stable/dataframe_models.html) associated with the data. How?
```python
from demo_data import Schema
```
Expand All @@ -46,15 +46,15 @@ and use the command `dac pack` (run `dac pack --help` for detailed instructions)
On a high level, the most important elements you must provide are:

* python code to load the data. It should as a DataFrame in one of the supported libraries: [pandas](https://pandas.pydata.org/), [dask](https://www.dask.org/), [pyspark](https://spark.apache.org/docs/latest/api/python/index.html), or [modin](https://github.com/modin-project/modin)
* a [pandera ModelSchema](https://pandera.readthedocs.io/en/stable/schema_models.html) fitting the data that can be loaded
* a [pandera DataFrame Model](https://pandera.readthedocs.io/en/stable/dataframe_models.html) fitting the data that can be loaded
* python dependencies

## What are the advantages of distributing data in this way?

* The code needed to load the data, the data source, and locations are abstracted away from the user.
This mean that the data engineer can start from local files, transition to SQL database, cloud file storage, or kafka topic, without having the user to notice it or need to adapt its code.

* Column names are passed to the user, and can be abstracted from the data source leveraging on the pandera [`Field.alias`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.model_components.Field.html#pandera.model_components.Field). In this way, the user code will not contain hard-coded column names, and changes in data source column names won't impact the user.
* Column names are passed to the user, and can be abstracted from the data source leveraging on the pandera [`Field.alias`](https://pandera.readthedocs.io/en/stable/reference/generated/pandera.api.pandas.model_components.Field.html). In this way, the user code will not contain hard-coded column names, and changes in data source column names won't impact the user.

* Users can build robust code by [writing unit testing for their functions](https://pandera.readthedocs.io/en/stable/data_synthesis_strategies.html) effortlessly.

Expand Down
2 changes: 1 addition & 1 deletion test/data/schema/wrong_syntax.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
iport pandera as pa
iport pandera as pa # noqa: E999
from pandera.typing import Series


Expand Down