Support datasets/feature columns #301

josevalim · 2021-03-02T09:10:00Z

No description provided.

cigrainger · 2021-03-16T04:57:48Z

@josevalim in reference to your comment in #335 'I am also not sure if a Pandas for Elixir would necessarily build on top of Nx', do you think it would be reasonable to build on top of something like polars via Rustler with some glue to yield Nx tensors from polars and vice versa? Similar to the relationship between Pandas and NumPy?

josevalim · 2021-03-16T06:30:08Z

@cigrainger I was thinking the same and there are bindings for polars in Elixir. I am planning to reach out to the author and collect his thoughts. :)

josevalim · 2021-03-22T09:37:23Z

I have been looking more into this and the scope of this feature. At this point, it is clear we won't provide a complete pandas/arrow backend as part of Nx: it is a very wide domain with a lot of problems to tackle. What is left to decide is how big of a scope we want to push as part of Nx.

The document below documents all of my findings on this topic. It is going to be a long one but hopefully it is worth it.

Tables and formatting

There is at least one scope we want to consider here which is working with tabular data and formatting. For example, if I want to classify the Iris dataset, the data is going to be grouped into columns and we need a way to surface that into Nx. Let's also assume the CSV has already been parsed into lists of tuples. How to convert this data to tensors?

Well, the first step would be to build one-dimensional tensors for each column:

sepal_length = csv |> Enum.map(&elem(&1, 0)) |> Nx.tensor()
sepal_width = csv |> Enum.map(&elem(&1, 1)) |> Nx.tensor()
petal_length = csv |> Enum.map(&elem(&1, 2)) |> Nx.tensor()
petal_width = csv |> Enum.map(&elem(&1, 3)) |> Nx.tensor()
species = csv |> Enum.map(&elem(&1, 4)) |> Nx.tensor()

However, we already run into an issue here. Species is a list of strings. There is one straight-forward solution to this problem, which is to convert each species to an integer. However, once we do this, we lost the data representation aspect: now, when inspecting the tensor, instead of seeing "setosa" or "versicolor" for the specifies, I will see integers like 0 or 1. So I would say, at a bare minimum, we want to add the ability to customize how tensor data is represented:

species = csv |> Enum.map(&elem(&1, 4)) |>  convert_strings_to_integer() |> Nx.tensor()

The second issue is that, by splitting the data above into 5 distinct tensors, now we need to pass this data around as isolated variables or as a five element tuple. Ideally we should be able to pass them together. Furthermore, if I pass all 5 of them to the a defn but the defn code only uses two of them, we should send only the used columns to the GPU. One option for grouping is to allow maps in defns:

%{
  sepal_length: csv |> Enum.map(&elem(&1, 0)) |> Nx.tensor(),
  sepal_width: csv |> Enum.map(&elem(&1, 1)) |> Nx.tensor(),
  petal_length: csv |> Enum.map(&elem(&1, 2)) |> Nx.tensor(),
  petal_width: csv |> Enum.map(&elem(&1, 3)) |> Nx.tensor(),
  species: csv |> Enum.map(&elem(&1, 4)) |> convert_strings_to_integer() |> Nx.tensor()
}

However, by working with maps, we still don't get a nice inspected representation in the terminal. After all, each column will be printed individually. So we need a Nx.Dataset struct so those can be inspected nicely:

Nx.Dataset.new(
  sepal_length: Enum.map(csv, &elem(&1, 0)),
  sepal_width: Enum.map(csv, &elem(&1, 1)),
  petal_length: Enum.map(csv, &elem(&1, 2)),
  petal_width: Enum.map(csv, &elem(&1, 3)),
  species: Enum.map(csv, &elem(&1, 4))
)

Implementation considerations

So far, we have said that:

We need a mechanism to customize how values in tensors are formatted
We need a mechanism to group related tensors and be able to use only some of them inside defn
We need a mechanism to inspect related tensors as a dataset/table

Nx.Dataset can solve both 1, 2, and 3 but the question is: should the customization of tensor values (bullet 1) be a part of Nx.Dataset or of Nx.Tensor? For example, imagine we do this:

ds = Nx.Dataset.new(
  sepal_length: Enum.map(csv, &elem(&1, 0)),
  sepal_width: Enum.map(csv, &elem(&1, 1)),
  petal_length: Enum.map(csv, &elem(&1, 2)),
  petal_width: Enum.map(csv, &elem(&1, 3)),
  species: Enum.map(csv, &elem(&1, 4))
)

ds[:species]

What should ds[:species] return?

The underlying tensor representation, showing 0, 1, 2, and 3 for each species instead of "setosa" or "versicolor"
The underlying tensor representation, showing "setosa" or "versicolor", etc

The issue with 2 is that we need to add data inspection to Nx.Tensor. At first it seems like a positive but how does it work in relation to all of the different operations in the Nx module? For example, imagine I have a Nx.Tensor representing [0, 1, 2] which is representing the strings ["foo", "bar", "baz"]. What happens if I multiply this tensor by 2? Or what happens if I broadcast its dimensions? Therefore it feels the representation should be part of Nx.Dataset - as the representation can quickly lose meaning when moved into the tensor world.

The downside of having the representation only be part of Nx.Dataset is that we have no direct equivalent to Pandas' Series. However, I believe this is a decision we can make later. The options would be:

Introduce a Nx.Series
Actually add the tensor representation to Nx.Tensor
Do not add Nx.Series - instead work with single column datasets

Other operations

Datasets require many other operations: encoding, decoding, joining, selecting, etc. However, I think the focus for Nx.Datasets should be integration with defn, instead of providing the whole range of operations. I.e. the goal is to allow folks to process the data using another library and then feed it into Nx for the machine learning aspect of things.

GIven the focus on defn and given the fact defn only works with tensors of known shapes, many dataset operations are simply not possible in defn. You can't join, filter, etc. The few operations possible are aggregations and group_by, with the latter only achievable on categorical data. In other words, we should consider the focus of Nx.Dataset to be mostly read-only structures, leaving the remaining problems to be tackled in separate tools.

seanmor5 · 2021-03-22T11:05:40Z

My thoughts are that it seems the purpose of Dataset support within defn would be more for writing idiomatic code (e.g. ability to slice on columns, easy conversion from tabular data to tensor representations, etc.) when working with this style of data and for inspection. Based on that, I think going with a mostly read-only approach is the way to go. So I guess essentially datasets would really just add some compile-time sugar into defn?

GIven the focus on defn and given the fact defn only works with tensors of known shapes, many dataset operations are simply not possible in defn. You can't join, filter, etc. The few operations possible are aggregations and group_by, with the latter only achievable on categorical data. In other words, we should consider the focus of Nx.Dataset to be mostly read-only structures, leaving the remaining problems to be tackled in separate tools.

I agree that this scope falls mostly outside of defn and I am wondering if we have a chicken and egg problem here. If we move forward with a dataset abstraction without a clear separate tool we're working to integrate, is it possible we'll end up being too restrictive, or placing unnecessary constraints on tools that need Nx integration?

josevalim · 2021-03-22T16:25:24Z

If we move forward with a dataset abstraction without a clear separate tool we're working to integrate, is it possible we'll end up being too restrictive, or placing unnecessary constraints on tools that need Nx integration?

Exactly one of my concerns too. We will need to make the "hand-off" fairly simple.

arpieb · 2021-03-23T00:54:04Z

I agree that the concept of a pandas-like equivalent is well outside the scope of Nx core, and really should live somewhere else. That being said, it would make sense in the idiomatic vein for Nx to at least define a protocol for a dataset that Nx can operate on (purely numeric) and leave the details of how features get vectorized up to other implementations, like suggested above in re to polars. That way they could be represented by streams, message queues, fixed data sizes, external data sources, a pipeline like Broadway, etc ad nauseum as long as they implement the protocol for their source.

I would even go so far as to say Nx shouldn't even be concerned about trying to transform raw features into numeric data (e.g. categorical data to binary vectors, SDRs, embeddings, etc) and strictly focus on optimizing the mathematical manipulation of complex tensors, leaving the construction of featurized tensors to external libs, else you'll wind up with numpy or scipy... ;)

josevalim · 2021-11-23T07:38:43Z

Closing this for now. With Nx.Container, we already allow dataframes to be given as arguments to defn. :D

josevalim added area:defn Applies to defn area:nx Applies to nx kind:feature New feature or request note:discussion Details or approval are up for discussion labels Mar 2, 2021

josevalim mentioned this issue Mar 14, 2021

Add support for strings #335

Closed

cigrainger mentioned this issue Mar 24, 2021

Introduce a training API elixir-nx/axon#5

Closed

seanmor5 mentioned this issue Mar 30, 2021

Integration with common dataset format elixir-nx/axon#25

Closed

josevalim closed this as completed Nov 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support datasets/feature columns #301

Support datasets/feature columns #301

josevalim commented Mar 2, 2021

cigrainger commented Mar 16, 2021 •

edited

Loading

josevalim commented Mar 16, 2021

josevalim commented Mar 22, 2021 •

edited

Loading

seanmor5 commented Mar 22, 2021

josevalim commented Mar 22, 2021

arpieb commented Mar 23, 2021 •

edited

Loading

josevalim commented Nov 23, 2021

Support datasets/feature columns #301

Support datasets/feature columns #301

Comments

josevalim commented Mar 2, 2021

cigrainger commented Mar 16, 2021 • edited Loading

josevalim commented Mar 16, 2021

josevalim commented Mar 22, 2021 • edited Loading

Tables and formatting

Implementation considerations

Other operations

seanmor5 commented Mar 22, 2021

josevalim commented Mar 22, 2021

arpieb commented Mar 23, 2021 • edited Loading

josevalim commented Nov 23, 2021

cigrainger commented Mar 16, 2021 •

edited

Loading

josevalim commented Mar 22, 2021 •

edited

Loading

arpieb commented Mar 23, 2021 •

edited

Loading