Parquet export IR #183

max-hoffman · 2022-06-01T19:45:35Z

Exporting data through a CSV intermediary is subject to loss
of specificity and type info. This is particularly noticable
for read_pandas, where the resulting dataframe has every column
of type object and NULLs are indistinguishable from zero values.

I used a small hack to export data from Dolt into a DataFrame using parquet
instead of CSV. This requires the pyarrow dependency.

I left TODOs for improvements on the Dolt side that would make
this code cleaner and Dolt issues for the associated features.

There is one bug with NULL datetime values that I added a Dolt issue
for.

Exporting data through a CSV intermediary is subject to loss of specificity and type info. This is particularly noticable for read_pandas, where the resulting dataframe has every column of type `object` and NULLs are indistinguishable from zero values. I used a small hack to export data from Dolt into a DataFrame using parquet instead of CSV. This requires the pyarrow dependency. I left TODOs for improvements on the Dolt side that would make this code cleaner and Dolt issues for the associated features. There is one bug with NULL datetime values that I added a Dolt issue for.

max-hoffman · 2022-06-01T19:45:54Z

re: #179

codecov-commenter · 2022-06-01T21:17:26Z

Codecov Report

Merging #183 (83a5f9e) into main (f3c83cc) will increase coverage by 1.10%.
The diff coverage is 95.65%.

@@            Coverage Diff             @@
##             main     #183      +/-   ##
==========================================
+ Coverage   42.88%   43.98%   +1.10%     
==========================================
  Files          23       23              
  Lines         977      998      +21     
==========================================
+ Hits          419      439      +20     
- Misses        558      559       +1

Impacted Files	Coverage Δ
doltpy/cli/read.py	`97.05% <95.65%> (-2.95%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f3c83cc...83a5f9e. Read the comment docs.

max-hoffman added 2 commits June 1, 2022 12:53

fix fmt

92e572e

revert python changes, use early pyarrow

83a5f9e

max-hoffman merged commit 0043e7e into main Jun 1, 2022

max-hoffman deleted the max/export-pq-ir branch June 1, 2022 21:19

kretes mentioned this pull request Jun 6, 2022

This package has too many nonoptional dependencies #132

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet export IR #183

Parquet export IR #183

max-hoffman commented Jun 1, 2022

max-hoffman commented Jun 1, 2022

codecov-commenter commented Jun 1, 2022 •

edited

Loading

Parquet export IR #183

Parquet export IR #183

Conversation

max-hoffman commented Jun 1, 2022

max-hoffman commented Jun 1, 2022

codecov-commenter commented Jun 1, 2022 • edited Loading

Codecov Report

codecov-commenter commented Jun 1, 2022 •

edited

Loading