Big refactoring to simplify code and implement more parts of the spec #26

quinnj · 2020-09-27T06:22:47Z

In particular this PR:

Changes the flow of writing so that input columns are converted to
their "arrow" equivalent, most of the time lazily, so that the actual
writing is much simpler and just deals with each array type
Introduces an ArrowTypes module that defines the ArrowType trait,
which types can overload to signal what kind of arrow type they should
be converted to. This isn't super fleshed out quite yet as we need to
figure out the acutal requirements for the different arrow types, but
it's a start
Support reading/writing compressed arrow buffers automatically
(reading) and via keyword arg (writing, compress=:lz4 or
compress=:zstd
support nested dict encoding (fixes Support writing dict encodings for nested columns #15)
fixes Unable to round-trip a DataFrame #24; not exactly sure what the issue was, but it doesn't
happen on this branch
reorganizes tests so that for each kind of "table", we do IPC test,
compressed buffers test, and file test
support for nesting "special" types like Char/Symbol; previously they were only allowed as top-level columns, but now they can appear in NamedTuple, vector of vectors, etc.
Fix the Map type; I mistakenly thought Map meant that each element of the column was a Pair, but it actually means each element is a Dict. So it's more similar to the List type than Struct; this fixes the reading/writing for this type.
ensure array types that need hold a reference to their unsafe_wraped bytes; these bytes may be from the original arrow blob or an uncompressed buffer

In particular this PR: * Changes the flow of writing so that input columns are converted to their "arrow" equivalent, most of the time lazily, so that the actual writing is much simpler and just deals with each array type * Introduces an ArrowTypes module that defines the ArrowType trait, which types can overload to signal what kind of arrow type they should be converted to. This isn't super fleshed out quite yet as we need to figure out the acutal requirements for the different arrow types, but it's a start * Support reading/writing compressed arrow buffers automatically (reading) and via keyword arg (writing, `compress=:lz4` or `compress=:zstd` * support nested dict encoding (fixes #15) * fixes #24; not exactly sure what the issue was, but it doesn't happen on this branch * reorganizes tests so that for each kind of "table", we do IPC test, compressed buffers test, and file test

codecov · 2020-10-01T04:27:49Z

Codecov Report

Merging #26 into master will increase coverage by 1.63%.
The diff coverage is 82.12%.

@@            Coverage Diff             @@
##           master      #26      +/-   ##
==========================================
+ Coverage   79.01%   80.64%   +1.63%     
==========================================
  Files          13       14       +1     
  Lines        2063     2557     +494     
==========================================
+ Hits         1630     2062     +432     
- Misses        433      495      +62

Impacted Files	Coverage Δ
src/FlatBuffers/FlatBuffers.jl	`66.25% <0.00%> (-3.88%)`	⬇️
src/metadata/Schema.jl	`81.25% <0.00%> (+3.70%)`	⬆️
src/Arrow.jl	`28.57% <23.07%> (-71.43%)`	⬇️
src/utils.jl	`73.89% <71.12%> (+2.88%)`	⬆️
src/arraytypes.jl	`76.98% <76.03%> (+1.55%)`	⬆️
src/arrowtypes.jl	`80.00% <80.00%> (ø)`
src/metadata/Message.jl	`74.56% <80.00%> (+14.03%)`	⬆️
src/write.jl	`94.23% <93.75%> (+3.61%)`	⬆️
src/table.jl	`93.41% <95.36%> (+0.24%)`	⬆️
src/eltypes.jl	`83.91% <98.80%> (+6.23%)`	⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb39e96...3cb988f. Read the comment docs.

…#26) * Big refactoring to simplify code and implement more parts of the spec In particular this PR: * Changes the flow of writing so that input columns are converted to their "arrow" equivalent, most of the time lazily, so that the actual writing is much simpler and just deals with each array type * Introduces an ArrowTypes module that defines the ArrowType trait, which types can overload to signal what kind of arrow type they should be converted to. This isn't super fleshed out quite yet as we need to figure out the acutal requirements for the different arrow types, but it's a start * Support reading/writing compressed arrow buffers automatically (reading) and via keyword arg (writing, `compress=:lz4` or `compress=:zstd` * support nested dict encoding (fixes #15) * fixes #24; not exactly sure what the issue was, but it doesn't happen on this branch * reorganizes tests so that for each kind of "table", we do IPC test, compressed buffers test, and file test

dmbates mentioned this pull request Sep 28, 2020

Unable to round-trip a DataFrame #24

Closed

quinnj added 2 commits September 29, 2020 00:14

more work on refactor

a70d925

get tests passing

08d2423

quinnj added 4 commits September 30, 2020 22:33

fix some failing test

de1795d

fixes

bd30c5e

fix

59685f9

drop 32-bit for now

3cb988f

quinnj merged commit 1a1d3c5 into master Oct 1, 2020

quinnj deleted the jq/compression branch October 1, 2020 05:03

This was referenced Oct 1, 2020

How to support 64-bit offset array writing #14

Closed

Do we need to handle data compression yet? #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big refactoring to simplify code and implement more parts of the spec #26

Big refactoring to simplify code and implement more parts of the spec #26

quinnj commented Sep 27, 2020

codecov bot commented Oct 1, 2020 •

edited

Loading

Big refactoring to simplify code and implement more parts of the spec #26

Big refactoring to simplify code and implement more parts of the spec #26

Conversation

quinnj commented Sep 27, 2020

codecov bot commented Oct 1, 2020 • edited Loading

Codecov Report

codecov bot commented Oct 1, 2020 •

edited

Loading