Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial support for struct dtype #756

Merged
merged 5 commits into from Dec 4, 2023

Conversation

costaraphael
Copy link
Contributor

This PR is heavily inspired by the work done here.

What works at this point:

  • defining basic K/V structures
  • defining nested K/V structures
  • casting
  • inspection
  • importing/exporting from parquet/IPC/NDJSON (CSV is not supported)

The goal is to add initial support to the type now and add extra functionality (like Series.field/2 and DataFrame.unnest/2) in the future.

I'm well aware that I'm opening a PR with some big changes without discussing it first, so if this is not a feature the Explorer team wants to add, or if you want to take a completely different approach than what I did, please feel free to scrape this. 馃檪

Copy link
Contributor Author

@costaraphael costaraphael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple of comments in parts of the code I made some judgment calls and wanted to add a bit more context/ask for feedback.

Also please let me know if I'm missing anything! 馃槃


@type time_unit :: :nanosecond | :microsecond | :millisecond
@type datetime_dtype :: {:datetime, time_unit}
@type duration_dtype :: {:duration, time_unit}
@type list_dtype :: {:list, dtype()}
@type struct_dtype :: {:struct, %{String.t() => dtype()}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with this typespec for a couple of reasons:

  • :struct - to use the same name used on Polars
  • a map as a second element - to keep track of the inner dtypes, so we can pluck them later on
  • using a String.t() instead of atom() for the field name - since we can import large JSON datasets, with potentially random keys, it felt prudent to use a string here to avoid leaking atoms

The trade-offs for these choices are:

  • :struct already has a meaning in Elixir. We could name it :map instead but we would lose the 1:1 naming parity with Polars.
  • The dtype gets very verbose (as you can see on the tests). I think this one is pretty minor though, as I think people won't be dealing with this dtype directly that often.

Please let me know if you have any other ideas regarding the dtype name and how to represent it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I really like this PR and think this missing functionality would be very helpful.

That said, I'm hesitant about the term "struct" since as you say it's overloaded. What do you think about DF.from_rows? Then it would pair with DF.to_rows just like Series.from_list/Series.to_list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is not importing data from Elixir structs, but literally having a struct column inside Explorer. You could even have lists of structs. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Whoops! Thank you, I was skimming 馃槄
  2. I still worry that "struct" is overloaded ("map" may be better?), but now less so. It's sorta like how we have {:f, 32} even though there's no real equivalent in Elixir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was a bit iffy about naming it struct as well because of the pre-existing meaning within Elixir, but also naming it the more Elixir-appropriate map would mean people used to Polars would have to keep this "translation" in mind when using Explorer.

Feels like a "pick your poison" type of choice 馃槄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could solve the naming issue by documenting it (like I mentioned in my other comment). Since we are following Arrow data structs, we could just document that. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philss Yeah if struct aligns with polars and arrow then it's the right call 馃憤 And documenting the name in case of potential confusion is def the way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the extra bits to the docs here: e565494

WDYT?

@@ -573,6 +574,12 @@ pub fn term_from_value<'b>(v: AnyValue, env: Env<'b>) -> Result<Term<'b>, Explor
AnyValue::Duration(v, time_unit) => encode_duration(v, time_unit, env),
AnyValue::Categorical(idx, mapping, _) => Ok(mapping.get(idx).encode(env)),
AnyValue::List(series) => list_from_series(ExSeries::new(series), env),
AnyValue::Struct(_, _, fields) => v
._iter_struct_av()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function looks very undocumented, but it is part of the public API from Polars: https://docs.rs/polars/latest/polars/datatypes/enum.AnyValue.html#method._iter_struct_av

The Python implementation deals with struct values through it as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Maybe it's a unstable API, since normally it would not have the _ at the beginning. We can ask them later if this should be used, or if there is an alternative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I tried dealing with the enum entry directly as well, but it quickly required me to do a bunch of unsafe operations 馃槄

That function is still performing those unsafe calls, but I trust their unsafe Rust more than mine 馃槀

"""
#Explorer.Series<
Polars[2]
struct[1] [%{"a" => 1}, %{"a" => 2}]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed the Polars approach here and only displayed the number of keys in the struct series. This made sense to me, as it would get out of hand pretty quick with large structs and nested values.

Let me know if you disagree and believe this should be more descriptive.

@@ -2,7 +2,7 @@ defmodule Explorer.Shared do
# A collection of **private** helpers shared in Explorer.
@moduledoc false

@non_list_types [
@scalar_types [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With struct type being added the name didn't seem right anymore. Hope I got the intention right here. 馃槄

@@ -27,7 +27,7 @@ defmodule Explorer.Shared do
within lists inside.
"""
def dtypes do
@non_list_types ++ [{:list, :any}]
@scalar_types ++ [{:list, :any}, {:struct, :any}]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed a similar approach from the list type implementation and used :any to denote "a struct of any shape". I'm not 100% happy with it though, please let me know if you have any suggestions.

Enum.map(list, fn item ->
Map.new(item, fn {field, inner_value} ->
inner_dtype = Map.fetch!(dtypes, to_string(field))
[casted_value] = cast_numerics_deep([inner_value], inner_dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels kinda inefficient (it's going to wrap/run through Enum.map/unwrap each value of the list of maps), but I was not too worried about it since I don't see Series.from_list as being the main way people will interact with this dtype. I feel like building a struct series inside a dataframe, or reading a struct series from an external file would be far more common.

Let me know if you disagree though.

lib/explorer/shared.ex Outdated Show resolved Hide resolved
lib/explorer/shared.ex Outdated Show resolved Hide resolved
Copy link
Member

@josevalim josevalim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me, only some minor nits!

Copy link
Member

@philss philss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really great feature and awesome PR! 馃樅

I added some minor comments, but overall it looks great!

lib/explorer/series.ex Show resolved Hide resolved

@type time_unit :: :nanosecond | :microsecond | :millisecond
@type datetime_dtype :: {:datetime, time_unit}
@type duration_dtype :: {:duration, time_unit}
@type list_dtype :: {:list, dtype()}
@type struct_dtype :: {:struct, %{String.t() => dtype()}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could solve the naming issue by documenting it (like I mentioned in my other comment). Since we are following Arrow data structs, we could just document that. WDYT?

lib/explorer/shared.ex Outdated Show resolved Hide resolved
@@ -573,6 +574,12 @@ pub fn term_from_value<'b>(v: AnyValue, env: Env<'b>) -> Result<Term<'b>, Explor
AnyValue::Duration(v, time_unit) => encode_duration(v, time_unit, env),
AnyValue::Categorical(idx, mapping, _) => Ok(mapping.get(idx).encode(env)),
AnyValue::List(series) => list_from_series(ExSeries::new(series), env),
AnyValue::Struct(_, _, fields) => v
._iter_struct_av()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Maybe it's a unstable API, since normally it would not have the _ at the beginning. We can ask them later if this should be used, or if there is an alternative.

test/explorer/series/struct_test.exs Show resolved Hide resolved
@josevalim josevalim merged commit 6c12034 into elixir-explorer:main Dec 4, 2023
4 checks passed
@josevalim
Copy link
Member

Thank you! 馃挌 馃挋 馃挏 馃挍 鉂わ笍

I assume the next step is an API for adding/updating/removing struct fields? :D

@costaraphael
Copy link
Contributor Author

I assume the next step is an API for adding/updating/removing struct fields? :D

Something like that!

My next target would be to allow breaking a struct series into its component series (DataFrame.unnest/2) and building a struct from existing series (Series.build_struct/1). These two operations would unblock any workflow regarding parsing/exporting data with nested values.

Operations within struct series are kinda limited in Polars after that, but the interesting features I can see being added:

  • Plucking a single field as a series (Series.field/2)
  • Renaming struct fields (Series.rename_fields/2)
  • JSON encoding/decoding (Series.json_(encode|decode)/1)

Should I open an issue to keep track of this?

@josevalim
Copy link
Member

No need for an issue, i think for these it is fine to contribute as the need arises :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants