Skip to content
Permalink
Browse files
Support AbstractPath where file paths are used (#255)
* Support AbstractPath where file paths are used
* Set package to version 2.2.0

Co-authored-by: Jarrett Revels <jarrettrevels@gmail.com>
  • Loading branch information
omus and jrevels committed Oct 29, 2021
1 parent 56f8f93 commit a3eec89b51f712d916e7c6c8de78153de3430417
Showing 5 changed files with 47 additions and 13 deletions.
@@ -1,7 +1,7 @@
name = "Arrow"
uuid = "69666777-d1a9-59fb-9406-91d4454c9d45"
authors = ["quinnj <quinn.jacobd@gmail.com>"]
version = "2.1.0"
version = "2.2.0"

[deps]
ArrowTypes = "31f734f8-188a-4ce0-8406-c8a06bd891cd"
@@ -23,6 +23,7 @@ BitIntegers = "0.2"
CodecLz4 = "0.4"
CodecZstd = "0.7"
DataAPI = "1"
FilePathsBase = "0.9"
PooledArrays = "0.5, 1.0"
SentinelArrays = "1"
Tables = "1.1"
@@ -31,10 +32,11 @@ julia = "1.3"

[extras]
CategoricalArrays = "324d7699-5711-5eae-9e2f-1d82baa6b597"
FilePathsBase = "48062228-2e41-5def-b9a4-89aafe57970f"
JSON3 = "0f8b85d8-7281-11e9-16c2-39a750bddbf1"
Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
StructTypes = "856f2bd8-1eba-4b0a-8007-ebc267875bd4"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test", "Random", "JSON3", "StructTypes", "CategoricalArrays"]
test = ["CategoricalArrays", "FilePathsBase", "JSON3", "Random", "StructTypes", "Test"]
@@ -6,6 +6,16 @@ The best place to learn about the Apache arrow project is [the website itself](h

The [Arrow.jl](https://github.com/JuliaData/Arrow.jl) Julia package is another implementation, allowing the ability to both read and write data in the arrow format. As a data format, arrow specifies an exact memory layout to be used for columnar table data, and as such, "reading" involves custom Julia objects ([`Arrow.Table`](@ref) and [`Arrow.Stream`](@ref)), which read the *metadata* of an "arrow memory blob", then *wrap* the array data contained therein, having learned the type and size, amongst other properties, from the metadata. Let's take a closer look at what this "reading" of arrow memory really means/looks like.

## Support for generic path-like types

Arrow.jl attempts to support any path-like type wherever a function takes a path as an argument. The Arrow.jl API should generically work as long as the type supports:

- `Base.open(path, mode)::I where I <: IO`

When a custom `IO` subtype is returned (`I`) then the following methods also need to be defined:

- `Base.read(io::I, ::Type{UInt8})` or `Base.read(io::I)`
- `Base.write(io::I, x)`

## Reading arrow data

@@ -173,7 +183,7 @@ Ok, so that's a pretty good rundown of *reading* arrow data, but how do you *pro

### `Arrow.write`

With `Arrow.write`, you provide either an `io::IO` argument or `file::String` to write the arrow data to, as well as a Tables.jl-compatible source that contains the data to be written.
With `Arrow.write`, you provide either an `io::IO` argument or a [`file_path`](#support-for-generic-path-like-types) to write the arrow data to, as well as a Tables.jl-compatible source that contains the data to be written.

What are some examples of Tables.jl-compatible sources? A few examples include:
* `Arrow.write(io, df::DataFrame)`: A `DataFrame` is a collection of indexable columns
@@ -24,11 +24,8 @@ ArrowBlob(bytes::Vector{UInt8}, pos::Int, len::Nothing) = ArrowBlob(bytes, pos,

tobytes(bytes::Vector{UInt8}) = bytes
tobytes(io::IO) = Base.read(io)
function tobytes(str)
f = string(str)
isfile(f) || throw(ArgumentError("$f is not a file"))
return Mmap.mmap(f)
end
tobytes(io::IOStream) = Mmap.mmap(io)
tobytes(file_path) = open(tobytes, file_path, "r")

struct BatchIterator
bytes::Vector{UInt8}
@@ -53,11 +53,11 @@ function write end

write(io_or_file; kw...) = x -> write(io_or_file, x; kw...)

function write(filename::String, tbl; metadata=getmetadata(tbl), colmetadata=nothing, largelists::Bool=false, compress::Union{Nothing, Symbol, LZ4FrameCompressor, ZstdCompressor}=nothing, denseunions::Bool=true, dictencode::Bool=false, dictencodenested::Bool=false, alignment::Int=8, maxdepth::Int=DEFAULT_MAX_DEPTH, ntasks=Inf, file::Bool=true)
open(filename, "w") do io
function write(file_path, tbl; metadata=getmetadata(tbl), colmetadata=nothing, largelists::Bool=false, compress::Union{Nothing, Symbol, LZ4FrameCompressor, ZstdCompressor}=nothing, denseunions::Bool=true, dictencode::Bool=false, dictencodenested::Bool=false, alignment::Int=8, maxdepth::Int=DEFAULT_MAX_DEPTH, ntasks=Inf, file::Bool=true)
open(file_path, "w") do io
write(io, tbl, file, largelists, compress, denseunions, dictencode, dictencodenested, alignment, maxdepth, ntasks, metadata, colmetadata)
end
return filename
return file_path
end

function write(io::IO, tbl; metadata=getmetadata(tbl), colmetadata=nothing, largelists::Bool=false, compress::Union{Nothing, Symbol, LZ4FrameCompressor, ZstdCompressor}=nothing, denseunions::Bool=true, dictencode::Bool=false, dictencodenested::Bool=false, alignment::Int=8, maxdepth::Int=DEFAULT_MAX_DEPTH, ntasks=Inf, file::Bool=false)
@@ -15,7 +15,7 @@
# limitations under the License.

using Test, Arrow, ArrowTypes, Tables, Dates, PooledArrays, TimeZones, UUIDs,
CategoricalArrays, DataAPI
CategoricalArrays, DataAPI, FilePathsBase
using Random: randstring

include(joinpath(dirname(pathof(ArrowTypes)), "../test/tests.jl"))
@@ -71,6 +71,30 @@ end

end # @testset "arrow json integration tests"

@testset "abstract path" begin
# Make a custom path type that simulates how AWSS3.jl's S3Path works
struct CustomPath <: AbstractPath
path::PosixPath
end

Base.read(p::CustomPath) = read(p.path)

io = Arrow.tobuffer((col=[0],))
tt = Arrow.Table(io)

mktempdir() do dir
p = Path(joinpath(dir, "test.arrow"))
Arrow.write(p, tt)
@test isfile(p)

tt2 = Arrow.Table(p)
@test values(tt) == values(tt2)

tt3 = Arrow.Table(CustomPath(p))
@test values(tt) == values(tt3)
end
end # @testset "abstract path"

@testset "misc" begin

# multiple record batches
@@ -167,7 +191,8 @@ tt = Arrow.Table(Arrow.tobuffer(t))
@test tt.a == ["aaaaaaaaaa", "aaaaaaaaaa"]

# 49
@test_throws ArgumentError Arrow.Table("file_that_doesnt_exist")
@test_throws SystemError Arrow.Table("file_that_doesnt_exist")
@test_throws SystemError Arrow.Table(p"file_that_doesnt_exist")

# 52
t = (a=Arrow.DictEncode(string.(1:129)),)

2 comments on commit a3eec89

@omus
Copy link
Contributor Author

@omus omus commented on a3eec89 Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

@JuliaRegistrator JuliaRegistrator commented on a3eec89 Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/47749

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v2.2.0 -m "<description of version>" a3eec89b51f712d916e7c6c8de78153de3430417
git push origin v2.2.0

Please sign in to comment.