Typed DataFrames

Pandas DataFrame subclasses that self-organize and serialize robustly.

import typeddfs

Film = typeddfs.typed("Film").require("name", "studio", "year").build()
df = Film.read_csv("file.csv")
assert df.columns.tolist() == ["name", "studio", "year"]
type(df)  # Film

Your types remember how to be read, including columns, dtypes, indices, and custom requirements. No index_cols=, header=, set_index, or astype needed.

Read and write any format:

path = input("input file? [.csv/.tsv/.tab/.json/.xml.bz2/.feather/.snappy.h5/...]")
df = Film.read_file(path)
df.write_file("output.snappy")

Need dataclasses?

instances = df.to_dataclass_instances()
Film.from_dataclass_instances(instances)

Save metadata?

df = df.set_attrs(dataset="piano")
df.write_file("df.csv", attrs=True)
df = Film.read_file("df.csv", attrs=True)
print(df.attrs)  # e.g. {"dataset": "piano")

Make dirs? Don’t overwrite?

df.write_file("df.csv", mkdirs=True, overwrite=False)

Write / verify checksums?

df.write_file("df.csv", file_hash=True)
df = Film.read_file("df.csv", file_hash=True)  # fails if wrong

Get example datasets?

print(ExampleDfs.penguins().df)
#    species     island  bill_length_mm  ...  flipper_length_mm  body_mass_g     sex
# 0    Adelie  Torgersen            39.1  ...              181.0       3750.0    MALE

Pretty-print the obvious way?

df.pretty_print(to="all_data.md.zip")
wiki_txt = df.pretty_print(fmt="mediawiki")

All standard DataFrame methods remain available. Use .of(df) to convert to your type, or .vanilla() for a plain DataFrame.

Read the docs 📚 for more info and examples.

🐛 Pandas serialization bugs fixed

Pandas has several issues with serialization.

See: Fixed issues

Depending on the format and columns, these issues occur:

columns being silently added or dropped,
errors on either read or write of empty DataFrames,
the inability to use DataFrames with indices in Feather,
writing to Parquet failing with half-precision,
lingering partially written files on error,
the buggy xlrd being preferred by read_excel,
the buggy odfpy also being preferred,
writing a file and reading it back results in a different DataFrame,
you can’t write fixed-width format,
and the platform text encoding being used rather than utf-8.
invalid JSON is written via the built-in json library

🎁 Other features

See more in the guided walkthrough ✏️

See: Short feature list

Dtype-aware natural sorting
UTF-8 by default
Near-atomicity of read/write
Matrix-like typed dataframes and methods (e.g. matrix.is_symmetric())
DataFrame-compatible frozen, hashable, ordered collections (dict, list, and set)
Serialize JSON robustly, preserving NaN, inf, −inf, enums, timezones, complex numbers, etc.
Serialize more formats like TOML and INI
Interpreting paths and formats (e.g. FileFormat.split("dir/myfile.csv.gz").compression # gz)
Generate good CLI help text for input DataFrames
Parse/verify/add/update/delete files in a .shasum-like file

💔 Limitations

See: List of limitations

Multi-level columns are not yet supported.
Columns and index levels cannot share names.
Duplicate column names are not supported. (These are strange anyway.)
A typed DF cannot have columns "level_0", "index", or "Unnamed: 0".
inplace is forbidden in some functions; avoid it or use .vanilla().

🔌 Serialization support

TypedDfs provides the methods read_file and write_file, which guess the format from the filename extension. For example, this will convert a gzipped, tab-delimited file to Feather:

TastyDf = typeddfs.typed("TastyDf").build()
TastyDf.read_file("myfile.tab.gz").write_file("myfile.feather")

Pandas does most of the serialization, but some formats require extra packages. Typed-dfs specifies extras to help you get required packages and with compatible versions.

Here are the extras:

feather: Feather (uses: pyarrow)
parquet: Parquet (e.g. .snappy) (uses: pyarrow)
xml (uses: lxml)
excel: Excel and LibreOffice .xlsx/.ods/.xls, etc. (uses: openpyxl, defusedxml)
toml: TOML (uses: tomlkit)
yaml to read/write YAML (uses: ruamel.yaml)
html (uses: html5lib, beautifulsoup4)
xlsb: rare binary Excel file (uses: pyxlsb)
HDF5 {no extra provided} (use: tables or pandas[hdf5])

For example, for Feather and TOML support use: typeddfs[feather,toml] As a shorthand for all formats, use typeddfs[all].

📊 Serialization in-depth

See: Full table

format	changes	packages	extra	sanity	speed	bitrate
Feather	fixed	`pyarrow`	`feather`	+++	+++	+++
Parquet	fixed	`pyarrow`	`parquet` *	++ †	+++	+++
csv/tsv	fixed			-	−	text
flexwf ‡	new			-	−	text
.fwf	+read			-	−	text
json	fixed			-	−−	text
xml	fixed	`lxml`	`xml`	−	−−	text
.properties	new			-	−	text
toml	new	`tomlkit`	`toml`	-	−	text
yaml	new	`ruamel.yaml`	`yaml`	-	-	text
INI	new			--	−	text
.lines	new			-	−	text
.npy				−	+	+++
.npz				−	+	+++
.html		`html5lib,beautifulsoup4`	`html`	−−	−−	text
pickle				-	−−	-
XLSX	fixed	`openpyxl,defusedxml`	`excel`	+	−	-
ODS	fixed	`openpyxl,defusedxml`	`excel`	+	−	-
XLS	fixed	`openpyxl,defusedxml`	`excel`	−−	−	-
XLSB		`pyxlsb`	`xlsb`	−−	−	+
HDF5		`tables`	none	-	−	+
GZIP				N/A	-	++
ZIP §				N/A	-	++
BZIP2				N/A	--	+++
XZ				N/A	--	+++
ZSTD		`zstandard`		N/A	+++	+++

See: serialization notes

* Parquet only supports str, float64, float32, int64, int32, and bool. Other numeric types are automatically converted during write.
† fastparquet can be used instead. It is slower but much smaller.
‡ .flexwf is fixed-width with optional delimiters.
For HDF5 support, use pandas[hdf5]. Wheels for pytables are often unavailable, blocking dependency resolution.
§ ZIP is currently not supported via write_file and read_file.
JSON has inconsistent handling of None. (orjson is more consistent).
XML requires Pandas 1.3+.
Not all JSON, XML, TOML, and HDF5 files can be read.
.ini and .properties can only be written with exactly 2 columns + index levels: a key and a value. INI keys are in the form section.name.
.lines can only be written with exactly 1 column or index level.
.npy and .npz only serialize numpy objects. They are not supported in read_file and write_file.
.html is not supported in read_file and write_file.
Pickle is insecure and not recommended.
Pandas supports odfpy for ODS and xlrd for XLS. In fact, it prefers those. However, they are very buggy; openpyxl is much better.
XLSM, XLTX, XLTM, XLS, and XLSB files can contain macros, which Microsoft Excel will ingest.
XLS is a deprecated format.
XLSB is not fully supported in Pandas.

Feather offers massively better performance over CSV, gzipped CSV, and HDF5 in read speed, write speed, memory overhead, and compression ratios. Parquet typically results in smaller file sizes than Feather at some cost in speed. Feather is the preferred format for most cases.

🔒 Security

Refer to the security policy.

📝 Extra notes

See: Pinned versions

Dependencies in the extras only have version minimums, not maximums. For example, typed-dfs requires pyarrow >= 4. natsort is also only assigned a minimum version number. This means that the result of typed-df’s sort_natural could change. To fix this, pin natsort to a specific major version; e.g. natsort = "^8" with Poetry or natsort>=8,<9 with pip.

🍁 Contributing

Typed-Dfs is licensed under the Apache License, version 2.0. New issues and pull requests are welcome. Please refer to the contributing guide. Generated with Tyrannosaurus.

Name		Name	Last commit message	Last commit date
Latest commit History 117 Commits
.github		.github
docs		docs
tests		tests
typeddfs		typeddfs
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.scrutinizer.yml		.scrutinizer.yml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
readthedocs.yaml		readthedocs.yaml
tox.ini		tox.ini

License

dmyersturnbull/typed-dfs

Folders and files

Latest commit

History

Repository files navigation

Typed DataFrames

🐛 Pandas serialization bugs fixed

🎁 Other features

💔 Limitations

🔌 Serialization support

📊 Serialization in-depth

🔒 Security

📝 Extra notes

🍁 Contributing

About

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Languages