Add initial support to the list dtype #725

philss · 2023-10-31T02:11:31Z

This is the initial work to support "list" datatypes from Arrow/Polars in Explorer.

It may close #296.

Tasks

PS: this is the second attempt to add the list dtype. The first one was: #401

This is a way to create a series with nested lists.

Use a custom enum for better encode/decode between Rust and Elixir. Recursive dtypes - such as lists dtype - are now easier to decode.

Also implement the conversion of ExSeriesDtype to Polars' Datatype.

It makes both sides much cleaner, once the conversions are done in a centralized place.

This attempt is a simplified version that uses `List.flatten/1` and check the "leaf" dtype without caring about the height of the nested lists.

philss · 2023-11-10T21:35:27Z

@josevalim @cigrainger @billylanchantin I think this is ready for review! :D

José asked to be a reviewer, so I requested him, but you all are welcome to review :)

And just to give more context, I had to refactor the "translation" of dtypes because I needed to "parse" the {:list, _} dtype that can be recursive. So I moved this translation to the Rust side, and I'm using Rustler features to encode/decode it properly.
Also, I think there are some edge cases in the check_dtypes! function - like lists of different depths -, but I chose to not worry about that for now.

cigrainger · 2023-11-11T05:42:19Z

This is so exciting. I'll give this a proper read through today.

josevalim · 2023-11-11T08:48:31Z

lib/explorer/backend/data_frame.ex

+    A.container_doc(open, item, close, opts, &to_doc/2)
+  end
+
+  defp to_doc(item, _opts), do: Explorer.Shared.to_string(item)


Oh, it seems we only call Shared.to_string on inspect. So my suggestion is to move to_doc to Explorer.Shared.to_doc and merge the to_string implementation into it.

Done! 504531c

josevalim

Beautiful work. Awesome refactoring on the dtype handling. I added just one tiny suggestion and we can ship it.

josevalim · 2023-11-11T08:55:32Z

lib/explorer/polars_backend/series.ex

-      "cat" -> {:u, 32}
-      dtype -> raise "cannot convert dtype #{inspect(dtype)} to iotype"
-    end
+    Shared.apply_series(series, :s_iotype)


No changes required now but, for completeness, we want to support u8, u32, etc as dtypes themselves, which means we will be able to move this function all the way up to Explorer.Series and remove the callback altogether in the future. :)

cigrainger

Loving the refactor. Tidies things up really nicely! Nothing jumping out at me.

billylanchantin

This is incredible! In addition to the functionality, which unlocks a few things like mode and explode, I'm really liking the dtype refactor. It did feel like the string representations of the dtypes were repeated too much.

I left a few suggestions, but they're surface level. Please take 'em or leave 'em!

Also, I started to play with a property test in test/explorer/series/list_test.exs, but I don't think it's necessary for this pass. It may be worth coming back to though, especially if we find issues down the road.

test/explorer/series/list_test.exs

lib/explorer/shared.ex

billylanchantin · 2023-11-11T18:03:03Z

Hey food for thought. This makes DF.print behave a little oddly.

df_list = DF.new(a: [[1, 2], [3, 4]], b: [[5], [6, 7, 8, 9, 10, 11, 12, 13, 14, 15]])
DF.print(df_list)
# +-------------------------------------------+
# | Explorer DataFrame: [rows: 2, columns: 2] |
# +---------------------+---------------------+
# |          a          |          b          |
# |   <list[integer]>   |   <list[integer]>   |
# +=====================+=====================+
# | 1                   | 5                   |
# | 2                   |                     |
# +---------------------+---------------------+
# | 3                   | 6                   |
# | 4                   | 7                   |
# |                     | 8                   |
# |                     | 9                   |
# |                     | 10                  |
# |                     | 11                  |
# |                     | 12                  |
# |                     | 13                  |
# |                     | 14                  |
# |                     | 15                  |
# +---------------------+---------------------+

We may want to display it more like this:

df_string = DF.new(a: ["[1, 2]", "[3, 4]"], b: ["[5]", "[6, 7, 8, 9, 10, 11, 12, 13, 14, 15]"])
DF.print(df_string)
# +-------------------------------------------------+
# |    Explorer DataFrame: [rows: 2, columns: 2]    |
# +----------+--------------------------------------+
# |    a     |                  b                   |
# | <string> |               <string>               |
# +==========+======================================+
# | [1, 2]   | [5]                                  |
# +----------+--------------------------------------+
# | [3, 4]   | [6, 7, 8, 9, 10, 11, 12, 13, 14, 15] |
# +----------+--------------------------------------+

I've actually been fiddling with making DF.print more configurable. While I don't think we should do anything about it on this PR, I'll be keeping it in mind as I play with possibilities.

philss · 2023-11-11T18:07:01Z

Thank you all! 💜

@billylanchantin oh, good catch! I'm going to merge this one and we can fix that in another PR. Thanks!

philss added 22 commits October 28, 2023 12:02

WIP: adding lists dtype support

c27fa69

Fix test case for list

e2f8621

WIP: make lists series partially work for integers

d47d2ef

Make list of lists work for more dtypes

5d4095e

Fix formatting

3e2045d

WIP: Accept recursive :list dtype

bba4d47

This is a way to create a series with nested lists.

Accept Vec<Option<ExSeries>> for list series build

7218433

Small fix to "from_list/3" impl

33e03b4

Using "as_ref()" to be able to map the Option

e6d716e

WIP: solve {:list, :numeric} in one level deep

83e460a

Refactor the way we retrieve the dtypes

73fe8f2

Use a custom enum for better encode/decode between Rust and Elixir. Recursive dtypes - such as lists dtype - are now easier to decode.

Move dtype related things to its own file

553fdb9

Also implement the conversion of ExSeriesDtype to Polars' Datatype.

Introduce the usage of ExSeriesDtype in all remaining places

cf8a7ef

It makes both sides much cleaner, once the conversions are done in a centralized place.

Fix case with floats and integers

43af510

Fix algorithm for detecting dtypes of nested lists

bd3d0cb

This attempt is a simplified version that uses `List.flatten/1` and check the "leaf" dtype without caring about the height of the nested lists.

Add more tests covering nested lists

8e5006b

Fix formatting

e3d10cd

Fix preferable dtype selection

104ddac

Add tests covering cast of list dtype series

e896cb9

Add basic inspect for list dtype series

6e80c9f

Fix inspect of series of list dtype

bae7ab1

More tests and refactor of inspect for dataframes

57a6e7e

philss marked this pull request as ready for review November 10, 2023 21:15

philss requested a review from josevalim November 10, 2023 21:35

Remove restriction in doc

36f1700

josevalim reviewed Nov 11, 2023

View reviewed changes

josevalim approved these changes Nov 11, 2023

View reviewed changes

josevalim reviewed Nov 11, 2023

View reviewed changes

cigrainger approved these changes Nov 11, 2023

View reviewed changes

Move "to_doc/2" to Explorer.Shared

504531c

josevalim approved these changes Nov 11, 2023

View reviewed changes

billylanchantin approved these changes Nov 11, 2023

View reviewed changes

test/explorer/series/list_test.exs Outdated Show resolved Hide resolved

lib/explorer/shared.ex Outdated Show resolved Hide resolved

philss added 2 commits November 11, 2023 13:25

Remove duplication in leaf dtype logic

695a948

Remove usage of "String.trim_trailing/1" in tests

145c1e1

philss merged commit 9950c0b into main Nov 11, 2023
4 checks passed

philss deleted the ps-add-list-dtype-2nd-try branch November 11, 2023 18:08

This was referenced Nov 12, 2023

Add mode #453

Merged

compute mode on grouped DataFrame #452

Closed

billylanchantin mentioned this pull request Nov 26, 2023

Add support for grouping into series #741

Merged

costaraphael mentioned this pull request Dec 3, 2023

Add initial support for struct dtype #756

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add initial support to the list dtype #725

Add initial support to the list dtype #725

philss commented Oct 31, 2023 •

edited

philss commented Nov 10, 2023

cigrainger commented Nov 11, 2023

josevalim Nov 11, 2023

philss Nov 11, 2023

josevalim left a comment

josevalim Nov 11, 2023

cigrainger left a comment

billylanchantin left a comment

billylanchantin commented Nov 11, 2023

philss commented Nov 11, 2023

Add initial support to the list dtype #725

Add initial support to the list dtype #725

Conversation

philss commented Oct 31, 2023 • edited

Tasks

philss commented Nov 10, 2023

cigrainger commented Nov 11, 2023

josevalim Nov 11, 2023

Choose a reason for hiding this comment

philss Nov 11, 2023

Choose a reason for hiding this comment

josevalim left a comment

Choose a reason for hiding this comment

josevalim Nov 11, 2023

Choose a reason for hiding this comment

cigrainger left a comment

Choose a reason for hiding this comment

billylanchantin left a comment

Choose a reason for hiding this comment

billylanchantin commented Nov 11, 2023

philss commented Nov 11, 2023

philss commented Oct 31, 2023 •

edited