New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add initial support for struct dtype #756
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple of comments in parts of the code I made some judgment calls and wanted to add a bit more context/ask for feedback.
Also please let me know if I'm missing anything! 馃槃
|
||
@type time_unit :: :nanosecond | :microsecond | :millisecond | ||
@type datetime_dtype :: {:datetime, time_unit} | ||
@type duration_dtype :: {:duration, time_unit} | ||
@type list_dtype :: {:list, dtype()} | ||
@type struct_dtype :: {:struct, %{String.t() => dtype()}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with this typespec for a couple of reasons:
:struct
- to use the same name used on Polars- a map as a second element - to keep track of the inner dtypes, so we can pluck them later on
- using a
String.t()
instead ofatom()
for the field name - since we can import large JSON datasets, with potentially random keys, it felt prudent to use a string here to avoid leaking atoms
The trade-offs for these choices are:
:struct
already has a meaning in Elixir. We could name it:map
instead but we would lose the 1:1 naming parity with Polars.- The dtype gets very verbose (as you can see on the tests). I think this one is pretty minor though, as I think people won't be dealing with this dtype directly that often.
Please let me know if you have any other ideas regarding the dtype name and how to represent it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! I really like this PR and think this missing functionality would be very helpful.
That said, I'm hesitant about the term "struct" since as you say it's overloaded. What do you think about DF.from_rows
? Then it would pair with DF.to_rows
just like Series.from_list
/Series.to_list
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is not importing data from Elixir structs, but literally having a struct column inside Explorer. You could even have lists of structs. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Whoops! Thank you, I was skimming 馃槄
- I still worry that "struct" is overloaded ("map" may be better?), but now less so. It's sorta like how we have
{:f, 32}
even though there's no real equivalent in Elixir.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was a bit iffy about naming it struct
as well because of the pre-existing meaning within Elixir, but also naming it the more Elixir-appropriate map
would mean people used to Polars would have to keep this "translation" in mind when using Explorer.
Feels like a "pick your poison" type of choice 馃槄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could solve the naming issue by documenting it (like I mentioned in my other comment). Since we are following Arrow data structs, we could just document that. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@philss Yeah if struct aligns with polars and arrow then it's the right call 馃憤 And documenting the name in case of potential confusion is def the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added the extra bits to the docs here: e565494
WDYT?
@@ -573,6 +574,12 @@ pub fn term_from_value<'b>(v: AnyValue, env: Env<'b>) -> Result<Term<'b>, Explor | |||
AnyValue::Duration(v, time_unit) => encode_duration(v, time_unit, env), | |||
AnyValue::Categorical(idx, mapping, _) => Ok(mapping.get(idx).encode(env)), | |||
AnyValue::List(series) => list_from_series(ExSeries::new(series), env), | |||
AnyValue::Struct(_, _, fields) => v | |||
._iter_struct_av() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks very undocumented, but it is part of the public API from Polars: https://docs.rs/polars/latest/polars/datatypes/enum.AnyValue.html#method._iter_struct_av
The Python implementation deals with struct values through it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Maybe it's a unstable API, since normally it would not have the _
at the beginning. We can ask them later if this should be used, or if there is an alternative.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I tried dealing with the enum entry directly as well, but it quickly required me to do a bunch of unsafe operations 馃槄
That function is still performing those unsafe calls, but I trust their unsafe Rust more than mine 馃槀
""" | ||
#Explorer.Series< | ||
Polars[2] | ||
struct[1] [%{"a" => 1}, %{"a" => 2}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed the Polars approach here and only displayed the number of keys in the struct series. This made sense to me, as it would get out of hand pretty quick with large structs and nested values.
Let me know if you disagree and believe this should be more descriptive.
@@ -2,7 +2,7 @@ defmodule Explorer.Shared do | |||
# A collection of **private** helpers shared in Explorer. | |||
@moduledoc false | |||
|
|||
@non_list_types [ | |||
@scalar_types [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With struct type being added the name didn't seem right anymore. Hope I got the intention right here. 馃槄
@@ -27,7 +27,7 @@ defmodule Explorer.Shared do | |||
within lists inside. | |||
""" | |||
def dtypes do | |||
@non_list_types ++ [{:list, :any}] | |||
@scalar_types ++ [{:list, :any}, {:struct, :any}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I followed a similar approach from the list type implementation and used :any
to denote "a struct of any shape". I'm not 100% happy with it though, please let me know if you have any suggestions.
Enum.map(list, fn item -> | ||
Map.new(item, fn {field, inner_value} -> | ||
inner_dtype = Map.fetch!(dtypes, to_string(field)) | ||
[casted_value] = cast_numerics_deep([inner_value], inner_dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels kinda inefficient (it's going to wrap/run through Enum.map
/unwrap each value of the list of maps), but I was not too worried about it since I don't see Series.from_list
as being the main way people will interact with this dtype. I feel like building a struct series inside a dataframe, or reading a struct series from an external file would be far more common.
Let me know if you disagree though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great to me, only some minor nits!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really great feature and awesome PR! 馃樅
I added some minor comments, but overall it looks great!
|
||
@type time_unit :: :nanosecond | :microsecond | :millisecond | ||
@type datetime_dtype :: {:datetime, time_unit} | ||
@type duration_dtype :: {:duration, time_unit} | ||
@type list_dtype :: {:list, dtype()} | ||
@type struct_dtype :: {:struct, %{String.t() => dtype()}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could solve the naming issue by documenting it (like I mentioned in my other comment). Since we are following Arrow data structs, we could just document that. WDYT?
@@ -573,6 +574,12 @@ pub fn term_from_value<'b>(v: AnyValue, env: Env<'b>) -> Result<Term<'b>, Explor | |||
AnyValue::Duration(v, time_unit) => encode_duration(v, time_unit, env), | |||
AnyValue::Categorical(idx, mapping, _) => Ok(mapping.get(idx).encode(env)), | |||
AnyValue::List(series) => list_from_series(ExSeries::new(series), env), | |||
AnyValue::Struct(_, _, fields) => v | |||
._iter_struct_av() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Maybe it's a unstable API, since normally it would not have the _
at the beginning. We can ask them later if this should be used, or if there is an alternative.
Co-authored-by: Philip Sampaio <philip.sampaio@gmail.com>
Thank you! 馃挌 馃挋 馃挏 馃挍 鉂わ笍 I assume the next step is an API for adding/updating/removing struct fields? :D |
Something like that! My next target would be to allow breaking a struct series into its component series ( Operations within struct series are kinda limited in Polars after that, but the interesting features I can see being added:
Should I open an issue to keep track of this? |
No need for an issue, i think for these it is fine to contribute as the need arises :) |
This PR is heavily inspired by the work done here.
What works at this point:
The goal is to add initial support to the type now and add extra functionality (like
Series.field/2
andDataFrame.unnest/2
) in the future.I'm well aware that I'm opening a PR with some big changes without discussing it first, so if this is not a feature the Explorer team wants to add, or if you want to take a completely different approach than what I did, please feel free to scrape this. 馃檪