-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial support to the list dtype #725
Conversation
This is a way to create a series with nested lists.
Use a custom enum for better encode/decode between Rust and Elixir. Recursive dtypes - such as lists dtype - are now easier to decode.
Also implement the conversion of ExSeriesDtype to Polars' Datatype.
It makes both sides much cleaner, once the conversions are done in a centralized place.
This attempt is a simplified version that uses `List.flatten/1` and check the "leaf" dtype without caring about the height of the nested lists.
@josevalim @cigrainger @billylanchantin I think this is ready for review! :D José asked to be a reviewer, so I requested him, but you all are welcome to review :) And just to give more context, I had to refactor the "translation" of dtypes because I needed to "parse" the |
This is so exciting. I'll give this a proper read through today. |
lib/explorer/backend/data_frame.ex
Outdated
A.container_doc(open, item, close, opts, &to_doc/2) | ||
end | ||
|
||
defp to_doc(item, _opts), do: Explorer.Shared.to_string(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it seems we only call Shared.to_string
on inspect. So my suggestion is to move to_doc
to Explorer.Shared.to_doc
and merge the to_string implementation into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done! 504531c
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful work. Awesome refactoring on the dtype handling. I added just one tiny suggestion and we can ship it.
"cat" -> {:u, 32} | ||
dtype -> raise "cannot convert dtype #{inspect(dtype)} to iotype" | ||
end | ||
Shared.apply_series(series, :s_iotype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No changes required now but, for completeness, we want to support u8, u32, etc as dtypes themselves, which means we will be able to move this function all the way up to Explorer.Series
and remove the callback altogether in the future. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loving the refactor. Tidies things up really nicely! Nothing jumping out at me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incredible! In addition to the functionality, which unlocks a few things like mode
and explode
, I'm really liking the dtype refactor. It did feel like the string representations of the dtypes were repeated too much.
I left a few suggestions, but they're surface level. Please take 'em or leave 'em!
Also, I started to play with a property test in test/explorer/series/list_test.exs
, but I don't think it's necessary for this pass. It may be worth coming back to though, especially if we find issues down the road.
Hey food for thought. This makes df_list = DF.new(a: [[1, 2], [3, 4]], b: [[5], [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]])
DF.print(df_list)
# +-------------------------------------------+
# | Explorer DataFrame: [rows: 2, columns: 2] |
# +---------------------+---------------------+
# | a | b |
# | <list[integer]> | <list[integer]> |
# +=====================+=====================+
# | 1 | 5 |
# | 2 | |
# +---------------------+---------------------+
# | 3 | 6 |
# | 4 | 7 |
# | | 8 |
# | | 9 |
# | | 10 |
# | | 11 |
# | | 12 |
# | | 13 |
# | | 14 |
# | | 15 |
# +---------------------+---------------------+ We may want to display it more like this: df_string = DF.new(a: ["[1, 2]", "[3, 4]"], b: ["[5]", "[6, 7, 8, 9, 10, 11, 12, 13, 14, 15]"])
DF.print(df_string)
# +-------------------------------------------------+
# | Explorer DataFrame: [rows: 2, columns: 2] |
# +----------+--------------------------------------+
# | a | b |
# | <string> | <string> |
# +==========+======================================+
# | [1, 2] | [5] |
# +----------+--------------------------------------+
# | [3, 4] | [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] |
# +----------+--------------------------------------+ I've actually been fiddling with making |
Thank you all! 💜 @billylanchantin oh, good catch! I'm going to merge this one and we can fix that in another PR. Thanks! |
This is the initial work to support "list" datatypes from Arrow/Polars in Explorer.
It may close #296.
Tasks
{:list, :integer}
{:list, :boolean}
{:list, :string}
{:list, :float}
(missing special values, like:nan
){:list, :binary}
{:list, :date}
{:list, :time}
{:list, :category}
{:list, {:datetime, _precision}}
{:list, {:duration, _precision}}
PS: this is the second attempt to add the list dtype. The first one was: #401