Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial support to the list dtype #725

Merged
merged 26 commits into from
Nov 11, 2023
Merged

Conversation

philss
Copy link
Member

@philss philss commented Oct 31, 2023

This is the initial work to support "list" datatypes from Arrow/Polars in Explorer.

It may close #296.

Tasks

  • {:list, :integer}
  • {:list, :boolean}
  • {:list, :string}
  • {:list, :float} (missing special values, like :nan)
  • {:list, :binary}
  • {:list, :date}
  • {:list, :time}
  • {:list, :category}
  • {:list, {:datetime, _precision}}
  • {:list, {:duration, _precision}}
  • support creation of homogeneous recursive series - that is: multiple level series, with the same list dtype
  • support creation of heterogeneous recursive series - that is: multiple level series, with different list dtype
  • support inspection of list series
  • support casting of list series
  • support reads/writes from and to Parquet / IPC / NDJSON files (CSV is not supported).

PS: this is the second attempt to add the list dtype. The first one was: #401

This is a way to create a series with nested lists.
Use a custom enum for better encode/decode between Rust and Elixir.
Recursive dtypes - such as lists dtype - are now easier to decode.
Also implement the conversion of ExSeriesDtype to Polars' Datatype.
It makes both sides much cleaner, once the conversions are done
in a centralized place.
This attempt is a simplified version that uses `List.flatten/1`
and check the "leaf" dtype without caring about the height of
the nested lists.
@philss philss marked this pull request as ready for review November 10, 2023 21:15
@philss
Copy link
Member Author

philss commented Nov 10, 2023

@josevalim @cigrainger @billylanchantin I think this is ready for review! :D

José asked to be a reviewer, so I requested him, but you all are welcome to review :)

And just to give more context, I had to refactor the "translation" of dtypes because I needed to "parse" the {:list, _} dtype that can be recursive. So I moved this translation to the Rust side, and I'm using Rustler features to encode/decode it properly.
Also, I think there are some edge cases in the check_dtypes! function - like lists of different depths -, but I chose to not worry about that for now.

@cigrainger
Copy link
Member

This is so exciting. I'll give this a proper read through today.

A.container_doc(open, item, close, opts, &to_doc/2)
end

defp to_doc(item, _opts), do: Explorer.Shared.to_string(item)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, it seems we only call Shared.to_string on inspect. So my suggestion is to move to_doc to Explorer.Shared.to_doc and merge the to_string implementation into it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! 504531c

Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful work. Awesome refactoring on the dtype handling. I added just one tiny suggestion and we can ship it.

"cat" -> {:u, 32}
dtype -> raise "cannot convert dtype #{inspect(dtype)} to iotype"
end
Shared.apply_series(series, :s_iotype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No changes required now but, for completeness, we want to support u8, u32, etc as dtypes themselves, which means we will be able to move this function all the way up to Explorer.Series and remove the callback altogether in the future. :)

Copy link
Member

@cigrainger cigrainger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loving the refactor. Tidies things up really nicely! Nothing jumping out at me. :shipit:

Copy link
Contributor

@billylanchantin billylanchantin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredible! In addition to the functionality, which unlocks a few things like mode and explode, I'm really liking the dtype refactor. It did feel like the string representations of the dtypes were repeated too much.

I left a few suggestions, but they're surface level. Please take 'em or leave 'em!

Also, I started to play with a property test in test/explorer/series/list_test.exs, but I don't think it's necessary for this pass. It may be worth coming back to though, especially if we find issues down the road.

test/explorer/series/list_test.exs Outdated Show resolved Hide resolved
lib/explorer/shared.ex Outdated Show resolved Hide resolved
@billylanchantin
Copy link
Contributor

Hey food for thought. This makes DF.print behave a little oddly.

df_list = DF.new(a: [[1, 2], [3, 4]], b: [[5], [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]])
DF.print(df_list)
# +-------------------------------------------+
# | Explorer DataFrame: [rows: 2, columns: 2] |
# +---------------------+---------------------+
# |          a          |          b          |
# |   <list[integer]>   |   <list[integer]>   |
# +=====================+=====================+
# | 1                   | 5                   |
# | 2                   |                     |
# +---------------------+---------------------+
# | 3                   | 6                   |
# | 4                   | 7                   |
# |                     | 8                   |
# |                     | 9                   |
# |                     | 10                  |
# |                     | 11                  |
# |                     | 12                  |
# |                     | 13                  |
# |                     | 14                  |
# |                     | 15                  |
# +---------------------+---------------------+

We may want to display it more like this:

df_string = DF.new(a: ["[1, 2]", "[3, 4]"], b: ["[5]", "[6, 7, 8, 9, 10, 11, 12, 13, 14, 15]"])
DF.print(df_string)
# +-------------------------------------------------+
# |    Explorer DataFrame: [rows: 2, columns: 2]    |
# +----------+--------------------------------------+
# |    a     |                  b                   |
# | <string> |               <string>               |
# +==========+======================================+
# | [1, 2]   | [5]                                  |
# +----------+--------------------------------------+
# | [3, 4]   | [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] |
# +----------+--------------------------------------+

I've actually been fiddling with making DF.print more configurable. While I don't think we should do anything about it on this PR, I'll be keeping it in mind as I play with possibilities.

@philss
Copy link
Member Author

philss commented Nov 11, 2023

Thank you all! 💜

@billylanchantin oh, good catch! I'm going to merge this one and we can fix that in another PR. Thanks!

@philss philss merged commit 9950c0b into main Nov 11, 2023
4 checks passed
@philss philss deleted the ps-add-list-dtype-2nd-try branch November 11, 2023 18:08
This was referenced Nov 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add list dtype
4 participants