Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some tricky data for Arrow #10

Closed
xiaodaigh opened this issue Aug 31, 2020 · 11 comments · Fixed by #11
Closed

Some tricky data for Arrow #10

xiaodaigh opened this issue Aug 31, 2020 · 11 comments · Fixed by #11

Comments

@xiaodaigh
Copy link

xiaodaigh commented Aug 31, 2020

I have this data in the JDF.jl tests. Arrow is currently failing to write it. We could incorporate something like this in the tests.

using JDF
using DataFrames
using Random: randstring
using WeakRefStrings

df = DataFrame([collect(1:100) for i = 1:3000])
df[!, :int_missing] =
    rand([rand(rand([UInt, Int, Float64, Float32, Bool])), missing], nrow(df))

df[!, :missing] .= missing
df[!, :strs] = [randstring(8) for i = 1:nrow(df)]
df[!, :stringarray] = StringVector([randstring(8) for i = 1:nrow(df)])

df[!, :strs_missing] = [rand([missing, randstring(8)]) for i = 1:nrow(df)]
df[!, :stringarray_missing] =
    StringVector([rand([missing, randstring(8)]) for i = 1:nrow(df)])
df[!, :symbol_missing] = [rand([missing, Symbol(randstring(8))]) for i = 1:nrow(df)]
df[!, :char] = getindex.(df[!, :strs], 1)
df[!, :char_missing] = allowmissing(df[!, :char])
df[rand(1:nrow(df), 10), :char_missing] .= missing

@time JDF.save("a.jdf", df) # works with JDF 


# but fails with Arrow
Arrow.write("a.arrow", df)
error msg 4,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Int64,Union{Missing, UInt64},Missing,String,String,Union{Missing, String},Union{Missing, String},Union{Missing, Symbol},Char,Union{Missing, Char}}}, ::DataFrame, ::Dict{Int64,Tuple{Int64,Type,Any}}) at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:258 [9] macro expansion at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:71 [inlined] [10] macro expansion at .\task.jl:332 [inlined] [11] write(::IOStream, ::DataFrame, ::Bool, ::Bool) at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:58 [12] #29 at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:21 [inlined] [13] open(::Arrow.var"#29#30"{Bool,DataFrame}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at .\io.jl:325 [14] open at .\io.jl:323 [inlined] [15] #write#28 at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:20 [inlined] [16] write(::String, ::DataFrame) at C:\Users\RTX2080\.julia\packages\Arrow\FjbLX\src\write.jl:20 [17] top-level scope at REPL[44]:1 [18] include_string(::Function, ::Module, ::String, ::String) at .\loading.jl:1088
@quinnj
Copy link
Member

quinnj commented Aug 31, 2020

It's helpful to post backtraces as that can often quickly show where the error is

@xiaodaigh
Copy link
Author

It prints out alot of stuff. It's like 11pm. I will try to get a MWE together.

@quinnj
Copy link
Member

quinnj commented Aug 31, 2020

Alright, so the error here is because it doesn't know how to map Symbol or Char columns to arrow format. I guess in both cases, we should just map them to strings.

@quinnj
Copy link
Member

quinnj commented Aug 31, 2020

I guess the biggest question is if we're ok not having round-tripping in these cases; i.e. we could allow writing columns of Symbol/Char, but they'll be read back in as strings. @bkamins @nalimilan , any thoughts/preferences here?

@nalimilan
Copy link
Contributor

Not sure. Maybe see what Feather.jl does? Also of interest is what RCall should do with CategoricalArray{T} when !(T<:AbstractString), as R only supports string factors (JuliaInterop/RCall.jl#331).

@bkamins
Copy link
Contributor

bkamins commented Aug 31, 2020

I think the priority is to allow saving things with as little problems as possible. Symbol and Char are normal in Julia.

Given your comment I understand that we have a choice:

  1. either error on them
  2. or convert to String when saving

I would prefer the option 2. maybe with a @warn about conversion

@quinnj
Copy link
Member

quinnj commented Aug 31, 2020

I mean, Feather.jl will just be a wrapper around Arrow.jl pretty soon, so they'll do whatever we do. I also think I like the idea of option 2 w/ a warning. I think people should understand that Arrow.jl isn't meant to be a serialization package, even though it can handle a wide range of types of objects. But even still, I think doing some automatic conversions will be convenient.

@bkamins
Copy link
Contributor

bkamins commented Aug 31, 2020

Agreed

@xiaodaigh
Copy link
Author

xiaodaigh commented Sep 1, 2020

There is a way to store arbitrary metadata right? We should remember them as Julia:Symbol and Julia:Char types. So in Python and r, they get read as String but in Julia we can use the metadata to restore them as Symbol and Char?

@quinnj
Copy link
Member

quinnj commented Sep 1, 2020

Ah, I forgot about that; yes, that's a pretty good idea. I'll have to see how we can work the plumbing in for that.

quinnj added a commit that referenced this issue Sep 1, 2020
Fixes #10. This PR is large mainly because it adds fairly generic
support for custom field metadata. We then use that support to add
custom type extension for Symbol/Char vectors. We could perhaps factor
this a bit more to make it even more generic for custom types, but I
think this gets us pretty darn far.
@quinnj
Copy link
Member

quinnj commented Sep 1, 2020

Ok, took a bit of work, but I think I've got it all working for Symbol/Char. #11

@quinnj quinnj closed this as completed in #11 Sep 1, 2020
quinnj added a commit that referenced this issue Sep 1, 2020
* Allow initial support for custom extension types

Fixes #10. This PR is large mainly because it adds fairly generic
support for custom field metadata. We then use that support to add
custom type extension for Symbol/Char vectors. We could perhaps factor
this a bit more to make it even more generic for custom types, but I
think this gets us pretty darn far.

* Fix dict encoded
quinnj added a commit that referenced this issue Oct 3, 2020
* Allow initial support for custom extension types

Fixes #10. This PR is large mainly because it adds fairly generic
support for custom field metadata. We then use that support to add
custom type extension for Symbol/Char vectors. We could perhaps factor
this a bit more to make it even more generic for custom types, but I
think this gets us pretty darn far.

* Fix dict encoded
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants