Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add initial support for struct dtype #756

Merged
merged 5 commits into from
Dec 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions lib/explorer/polars_backend/native.ex
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,7 @@ defmodule Explorer.PolarsBackend.Native do
def s_from_list_binary(_name, _val), do: err()
def s_from_list_categories(_name, _val), do: err()
def s_from_list_of_series(_name, _val), do: err()
def s_from_list_of_series_as_structs(_name, _val), do: err()
def s_from_binary_f32(_name, _val), do: err()
def s_from_binary_f64(_name, _val), do: err()
def s_from_binary_i32(_name, _val), do: err()
Expand Down
12 changes: 12 additions & 0 deletions lib/explorer/polars_backend/shared.ex
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,18 @@ defmodule Explorer.PolarsBackend.Shared do
Native.s_from_list_of_series(name, series)
end

def from_list(list, {:struct, fields} = _dtype, name) when is_list(list) do
series =
for {column, values} <- Table.to_columns(list) do
column = to_string(column)
inner_type = Map.fetch!(fields, column)

from_list(values, inner_type, column)
end

Native.s_from_list_of_series_as_structs(name, series)
end

def from_list(list, dtype, name) when is_list(list) do
case dtype do
:integer -> Native.s_from_list_i64(name, list)
Expand Down
11 changes: 8 additions & 3 deletions lib/explorer/series.ex
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ defmodule Explorer.Series do
* `:time` - Time type that unwraps to `Elixir.Time`
* `{:list, dtype}` - A recursive dtype that can store lists. Examples: `{:list, :integer}` or
a nested list dtype like `{:list, {:list, :integer}}`.
* `{:struct, %{key => dtype}}` - A recursive dtype that can store Arrow/Polars structs (not to be
confused with Elixir's struct). This type unwraps to Elixir maps with string keys. Examples:
`{:struct, %{"a" => :integer}}` or a nested struct dtype like `{:struct, %{"a" => {:struct, %{"b" => :integer}}}}`.
costaraphael marked this conversation as resolved.
Show resolved Hide resolved

The following data type aliases are also supported:

Expand Down Expand Up @@ -77,7 +80,7 @@ defmodule Explorer.Series do
@numeric_dtypes [:integer | @float_dtypes]
@numeric_or_temporal_dtypes @numeric_dtypes ++ @temporal_dtypes

@io_dtypes Shared.dtypes() -- [:binary, :string, {:list, :any}]
@io_dtypes Shared.dtypes() -- [:binary, :string, {:list, :any}, {:struct, :any}]

@type dtype ::
:binary
Expand All @@ -92,11 +95,13 @@ defmodule Explorer.Series do
| :integer
| :string
| list_dtype
| struct_dtype

@type time_unit :: :nanosecond | :microsecond | :millisecond
@type datetime_dtype :: {:datetime, time_unit}
@type duration_dtype :: {:duration, time_unit}
@type list_dtype :: {:list, dtype()}
@type struct_dtype :: {:struct, %{String.t() => dtype()}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with this typespec for a couple of reasons:

  • :struct - to use the same name used on Polars
  • a map as a second element - to keep track of the inner dtypes, so we can pluck them later on
  • using a String.t() instead of atom() for the field name - since we can import large JSON datasets, with potentially random keys, it felt prudent to use a string here to avoid leaking atoms

The trade-offs for these choices are:

  • :struct already has a meaning in Elixir. We could name it :map instead but we would lose the 1:1 naming parity with Polars.
  • The dtype gets very verbose (as you can see on the tests). I think this one is pretty minor though, as I think people won't be dealing with this dtype directly that often.

Please let me know if you have any other ideas regarding the dtype name and how to represent it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I really like this PR and think this missing functionality would be very helpful.

That said, I'm hesitant about the term "struct" since as you say it's overloaded. What do you think about DF.from_rows? Then it would pair with DF.to_rows just like Series.from_list/Series.to_list.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is not importing data from Elixir structs, but literally having a struct column inside Explorer. You could even have lists of structs. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Whoops! Thank you, I was skimming 馃槄
  2. I still worry that "struct" is overloaded ("map" may be better?), but now less so. It's sorta like how we have {:f, 32} even though there's no real equivalent in Elixir.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was a bit iffy about naming it struct as well because of the pre-existing meaning within Elixir, but also naming it the more Elixir-appropriate map would mean people used to Polars would have to keep this "translation" in mind when using Explorer.

Feels like a "pick your poison" type of choice 馃槄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could solve the naming issue by documenting it (like I mentioned in my other comment). Since we are following Arrow data structs, we could just document that. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@philss Yeah if struct aligns with polars and arrow then it's the right call 馃憤 And documenting the name in case of potential confusion is def the way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added the extra bits to the docs here: e565494

WDYT?


@type t :: %Series{data: Explorer.Backend.Series.t(), dtype: dtype()}
@type lazy_t :: %Series{data: Explorer.Backend.LazySeries.t(), dtype: dtype()}
Expand Down Expand Up @@ -2202,7 +2207,7 @@ defmodule Explorer.Series do

## Supported dtypes

All except `:list`.
All except `:list` and `:struct`.

## Examples

Expand Down Expand Up @@ -2230,7 +2235,7 @@ defmodule Explorer.Series do
@doc type: :aggregation
@spec mode(series :: Series.t()) :: Series.t() | nil
def mode(%Series{dtype: {:list, _} = dtype}),
do: dtype_error("mode/1", dtype, Shared.dtypes() -- [{:list, :any}])
do: dtype_error("mode/1", dtype, Shared.dtypes() -- [{:list, :any}, {:struct, :any}])

def mode(%Series{} = series),
do: Shared.apply_impl(series, :mode)
Expand Down
121 changes: 94 additions & 27 deletions lib/explorer/shared.ex
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ defmodule Explorer.Shared do
# A collection of **private** helpers shared in Explorer.
@moduledoc false

@non_list_types [
@scalar_types [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With struct type being added the name didn't seem right anymore. Hope I got the intention right here. 馃槄

:binary,
:boolean,
:category,
Expand All @@ -27,7 +27,7 @@ defmodule Explorer.Shared do
within lists inside.
"""
def dtypes do
@non_list_types ++ [{:list, :any}]
@scalar_types ++ [{:list, :any}, {:struct, :any}]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I followed a similar approach from the list type implementation and used :any to denote "a struct of any shape". I'm not 100% happy with it though, please let me know if you have any suggestions.

end

@doc """
Expand All @@ -37,7 +37,21 @@ defmodule Explorer.Shared do
if maybe_dtype = normalise_dtype(inner), do: {:list, maybe_dtype}
end

def normalise_dtype(dtype) when dtype in @non_list_types, do: dtype
def normalise_dtype({:struct, inner_types}) do
inner_types
|> Enum.reduce_while(%{}, fn {key, dtype}, normalized_dtypes ->
case normalise_dtype(dtype) do
nil -> {:halt, nil}
dtype -> {:cont, Map.put(normalized_dtypes, key, dtype)}
end
billylanchantin marked this conversation as resolved.
Show resolved Hide resolved
end)
|> then(fn
nil -> nil
normalized_dtypes -> {:struct, normalized_dtypes}
end)
end

def normalise_dtype(dtype) when dtype in @scalar_types, do: dtype
def normalise_dtype(dtype) when dtype in [:float, :f64], do: {:f, 64}
def normalise_dtype(:f32), do: {:f, 32}
def normalise_dtype(_dtype), do: nil
Expand Down Expand Up @@ -220,16 +234,11 @@ defmodule Explorer.Shared do
Enum.reduce(list, initial_type, fn el, type ->
new_type = type(el, type) || type

cond do
leaf_dtype(new_type) == :numeric and leaf_dtype(type) in [:integer, {:f, 32}, {:f, 64}] ->
new_type

new_type != type and type != nil ->
raise ArgumentError,
"the value #{inspect(el)} does not match the inferred series dtype #{inspect(type)}"

true ->
new_type
if new_type_matches?(type, new_type) do
new_type
else
raise ArgumentError,
"the value #{inspect(el)} does not match the inferred series dtype #{inspect(type)}"
end
end)

Expand Down Expand Up @@ -263,6 +272,25 @@ defmodule Explorer.Shared do
defp type(item, _type) when is_nil(item), do: nil
defp type([], _type), do: nil
defp type([_item | _] = items, type), do: {:list, result_list_type(items, type)}

defp type(%{} = item, type) do
preferable_inner_types =
case type do
{:struct, %{} = inner_types} -> inner_types
_ -> %{}
end

inferred_inner_types =
for {key, value} <- item, into: %{} do
key = to_string(key)
inner_type = Map.get(preferable_inner_types, key)

{key, type(value, inner_type) || Map.get(preferable_inner_types, key)}
end

{:struct, inferred_inner_types}
end

defp type(item, _type), do: raise(ArgumentError, "unsupported datatype: #{inspect(item)}")

defp result_list_type(nil, _type), do: nil
Expand All @@ -277,6 +305,25 @@ defmodule Explorer.Shared do
dtype_from_list!(items, leaf_dtype(type))
end

defp new_type_matches?(type, new_type)

defp new_type_matches?(type, type), do: true

defp new_type_matches?(nil, _new_type), do: true

defp new_type_matches?({:struct, types}, {:struct, new_types}) do
Enum.all?(types, fn {key, type} ->
case Map.fetch(new_types, key) do
{:ok, new_type} -> new_type_matches?(type, new_type)
:error -> false
end
end)
end

defp new_type_matches?(type, new_type) do
leaf_dtype(new_type) == :numeric and leaf_dtype(type) in [:integer, {:f, 32}, {:f, 64}]
end

@doc """
Returns the leaf dtype from a {:list, _} dtype, or itself.
"""
Expand All @@ -286,38 +333,43 @@ defmodule Explorer.Shared do
@doc """
Downcasts lists of mixed numeric types (float and int) to float.
"""
def cast_numerics(list, type) when type == :numeric do
{cast_numerics_to_floats(list), {:f, 64}}
end

def cast_numerics(list, {:list, _} = dtype) do
def cast_numerics(list, dtype) do
{cast_numerics_deep(list, dtype), cast_numeric_dtype_to_float(dtype)}
end

def cast_numerics(list, type), do: {list, type}
defp cast_numerics_deep(list, {:struct, dtypes}) when is_list(list) do
Enum.map(list, fn item ->
Map.new(item, fn {field, inner_value} ->
inner_dtype = Map.fetch!(dtypes, to_string(field))
[casted_value] = cast_numerics_deep([inner_value], inner_dtype)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels kinda inefficient (it's going to wrap/run through Enum.map/unwrap each value of the list of maps), but I was not too worried about it since I don't see Series.from_list as being the main way people will interact with this dtype. I feel like building a struct series inside a dataframe, or reading a struct series from an external file would be far more common.

Let me know if you disagree though.


defp cast_numerics_to_floats(list) do
Enum.map(list, fn
item when item in [nil, :infinity, :neg_infinity, :nan] or is_float(item) -> item
item -> item / 1
{field, casted_value}
end)
end)
end

defp cast_numerics_deep(nil, _), do: nil

defp cast_numerics_deep(list, {:list, inner_dtype}) when is_list(list) do
Enum.map(list, fn item -> cast_numerics_deep(item, inner_dtype) end)
end

defp cast_numerics_deep(list, :numeric), do: cast_numerics_to_floats(list)

defp cast_numerics_deep(list, _), do: list

defp cast_numeric_dtype_to_float({:list, :numeric}), do: {:list, {:f, 64}}
defp cast_numeric_dtype_to_float({:list, other} = dtype) when is_atom(other), do: dtype
defp cast_numerics_to_floats(list) do
Enum.map(list, fn
item when item in [nil, :infinity, :neg_infinity, :nan] or is_float(item) -> item
item -> item / 1
end)
end

defp cast_numeric_dtype_to_float({:struct, dtypes}),
do: {:struct, Map.new(dtypes, fn {f, inner} -> {f, cast_numeric_dtype_to_float(inner)} end)}

defp cast_numeric_dtype_to_float({:list, inner}),
do: {:list, cast_numeric_dtype_to_float(inner)}

defp cast_numeric_dtype_to_float(:numeric), do: {:f, 64}
defp cast_numeric_dtype_to_float(other), do: other

@doc """
Expand All @@ -329,6 +381,20 @@ defmodule Explorer.Shared do
Inspect.Algebra.container_doc(open, item, close, opts, &to_doc/2)
end

def to_doc(item, opts) when is_map(item) and not is_struct(item) do
open = Inspect.Algebra.color("%{", :map, opts)
close = Inspect.Algebra.color("}", :map, opts)
arrow = Inspect.Algebra.color(" => ", :map, opts)

Inspect.Algebra.container_doc(open, Enum.to_list(item), close, opts, fn {key, value}, opts ->
Inspect.Algebra.concat([
Inspect.Algebra.color(inspect(key), :string, opts),
arrow,
to_doc(value, opts)
])
end)
end

def to_doc(item, _opts) do
case item do
nil -> "nil"
Expand Down Expand Up @@ -379,6 +445,7 @@ defmodule Explorer.Shared do
def dtype_to_string({:duration, :microsecond}), do: "duration[渭s]"
def dtype_to_string({:duration, :nanosecond}), do: "duration[ns]"
def dtype_to_string({:list, dtype}), do: "list[" <> dtype_to_string(dtype) <> "]"
def dtype_to_string({:struct, fields}), do: "struct[#{map_size(fields)}]"
def dtype_to_string({:f, size}), do: "f" <> Integer.to_string(size)
def dtype_to_string(other) when is_atom(other), do: Atom.to_string(other)

Expand Down
20 changes: 20 additions & 0 deletions native/explorer/src/datatypes/ex_dtypes.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
use crate::ExplorerError;
use polars::datatypes::DataType;
use polars::datatypes::Field;
use polars::datatypes::TimeUnit;
use rustler::NifTaggedEnum;
use std::collections::HashMap;
use std::ops::Deref;

impl rustler::Encoder for Box<ExSeriesDtype> {
Expand Down Expand Up @@ -39,6 +41,7 @@ pub enum ExSeriesDtype {
Datetime(ExTimeUnit),
Duration(ExTimeUnit),
List(Box<ExSeriesDtype>),
Struct(HashMap<String, ExSeriesDtype>),
}

impl TryFrom<&DataType> for ExSeriesDtype {
Expand Down Expand Up @@ -79,6 +82,17 @@ impl TryFrom<&DataType> for ExSeriesDtype {
inner.as_ref(),
)?))),

DataType::Struct(fields) => {
let mut struct_fields = HashMap::new();

for field in fields {
struct_fields
.insert(field.name().to_string(), Self::try_from(field.data_type())?);
}

Ok(ExSeriesDtype::Struct(struct_fields))
}

_ => Err(ExplorerError::Other(format!(
"cannot cast to dtype: {value}"
))),
Expand Down Expand Up @@ -124,6 +138,12 @@ impl TryFrom<&ExSeriesDtype> for DataType {
ExSeriesDtype::List(inner) => {
Ok(DataType::List(Box::new(Self::try_from(inner.as_ref())?)))
}
ExSeriesDtype::Struct(fields) => Ok(DataType::Struct(
fields
.iter()
.map(|(k, v)| Ok(Field::new(k.as_str(), v.try_into()?)))
.collect::<Result<Vec<Field>, Self::Error>>()?,
)),
}
}
}
Expand Down
12 changes: 12 additions & 0 deletions native/explorer/src/encoding.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ use chrono::prelude::*;
use polars::export::arrow::array::GenericBinaryArray;
use polars::prelude::*;
use rustler::{Encoder, Env, NewBinary, OwnedBinary, ResourceArc, Term};
use std::collections::HashMap;
use std::{mem, slice};

use crate::atoms::{
Expand Down Expand Up @@ -573,6 +574,12 @@ pub fn term_from_value<'b>(v: AnyValue, env: Env<'b>) -> Result<Term<'b>, Explor
AnyValue::Duration(v, time_unit) => encode_duration(v, time_unit, env),
AnyValue::Categorical(idx, mapping, _) => Ok(mapping.get(idx).encode(env)),
AnyValue::List(series) => list_from_series(ExSeries::new(series), env),
AnyValue::Struct(_, _, fields) => v
._iter_struct_av()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function looks very undocumented, but it is part of the public API from Polars: https://docs.rs/polars/latest/polars/datatypes/enum.AnyValue.html#method._iter_struct_av

The Python implementation deals with struct values through it as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Maybe it's a unstable API, since normally it would not have the _ at the beginning. We can ask them later if this should be used, or if there is an alternative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I tried dealing with the enum entry directly as well, but it quickly required me to do a bunch of unsafe operations 馃槄

That function is still performing those unsafe calls, but I trust their unsafe Rust more than mine 馃槀

.zip(fields)
.map(|(value, field)| Ok((field.name.as_str(), term_from_value(value, env)?)))
.collect::<Result<HashMap<_, _>, ExplorerError>>()
.map(|map| map.encode(env)),
dt => panic!("cannot encode value {dt:?} to term"),
}
}
Expand Down Expand Up @@ -603,6 +610,11 @@ pub fn list_from_series(s: ExSeries, env: Env) -> Result<Term, ExplorerError> {
})
.collect::<Result<Vec<Term>, ExplorerError>>()
.map(|lists| lists.encode(env)),
DataType::Struct(_fields) => s
.iter()
.map(|value| term_from_value(value, env))
.collect::<Result<Vec<_>, ExplorerError>>()
.map(|values| values.encode(env)),
dt => panic!("to_list/1 not implemented for {dt:?}"),
}
}
Expand Down
1 change: 1 addition & 0 deletions native/explorer/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -399,6 +399,7 @@ rustler::init!(
s_from_list_binary,
s_from_list_categories,
s_from_list_of_series,
s_from_list_of_series_as_structs,
s_from_binary_f32,
s_from_binary_f64,
s_from_binary_i64,
Expand Down
15 changes: 15 additions & 0 deletions native/explorer/src/series.rs
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,21 @@ pub fn s_from_list_of_series(name: &str, series_vec: Vec<Option<ExSeries>>) -> E
ExSeries::new(Series::new(name, lists))
}

#[rustler::nif(schedule = "DirtyCpu")]
pub fn s_from_list_of_series_as_structs(name: &str, series_vec: Vec<ExSeries>) -> ExSeries {
let struct_chunked = StructChunked::new(
name,
series_vec
.into_iter()
.map(|s| s.clone_inner())
.collect::<Vec<_>>()
.as_slice(),
)
.unwrap();

ExSeries::new(struct_chunked.into_series())
}

macro_rules! from_binary {
($name:ident, $type:ty, $bytes:expr) => {
#[rustler::nif(schedule = "DirtyCpu")]
Expand Down