Add Pandas `profile.py` analyzer for JSON, CSV/TSV, and Parquet #157

criccomini · 2023-01-30T00:02:06Z

JSON, CSV, TSV, and Parquet files now have a Pandas data profile analyzer for local filesystems, remote object stores (S3), and remote HTTP(S) locations. The profiler also runs against all SQLAlchemy compatible URLs, so TablePath and ViewPath locations are also analyzed with Pandas.

I also took the opportunity to fix a bug in the Frictionless, GenSON, and DuckDB columns.py analyzers, which were using forward reference types for their create_analyzer methods. Since these analyzers both had the same class name, the bug caused the plugins to all resolve to the first analyzer class that it saw. Thus, DuckDB's analyzer was returned for the Frictionless and GenSON analyzers as well.

Future work:

And Pandas's DataFrame.hist() data to the profile.
Add sampling support to the analyzer.

JSON, CSV, TSV, and Parquet files now have a Pandas data profile analyzer for local filesystems, remote object stores (S3), and remote HTTP(S) locations. The profiler also runs against all SQLAlchemy compatible URLs, so `TablePath` and `ViewPath` locations are also analyzed with Pandas. I also took the opportunity to fix a bug in the Frictionless, GenSON, and DuckDB columns.py analyzers, which were using forward reference types for their `create_analyzer` methods. Since these analyzers both had the same class name, the bug caused the plugins to all resolve to the first analyzer class that it saw. Thus, DuckDB's analyzer was returned for the Frictionless and GenSON analyzers as well. Future work: * And Pandas's `DataFrame.hist()` data to the profile. * Add sampling support to the analyzer.

criccomini merged commit d21635a into main Jan 30, 2023

criccomini deleted the add-pandas-stats branch January 30, 2023 00:02

This was referenced Jan 30, 2023

Add DataFrame.hist() data to pandas/profile.py #158

Closed

Support sampling in Pandas ProfileAnalyzer #159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Pandas `profile.py` analyzer for JSON, CSV/TSV, and Parquet #157

Add Pandas `profile.py` analyzer for JSON, CSV/TSV, and Parquet #157

criccomini commented Jan 30, 2023

Add Pandas profile.py analyzer for JSON, CSV/TSV, and Parquet #157

Add Pandas profile.py analyzer for JSON, CSV/TSV, and Parquet #157

Conversation

criccomini commented Jan 30, 2023

Add Pandas `profile.py` analyzer for JSON, CSV/TSV, and Parquet #157

Add Pandas `profile.py` analyzer for JSON, CSV/TSV, and Parquet #157