Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support datasets/feature columns #301

Closed
josevalim opened this issue Mar 2, 2021 · 7 comments
Closed

Support datasets/feature columns #301

josevalim opened this issue Mar 2, 2021 · 7 comments
Labels
area:defn Applies to defn area:nx Applies to nx kind:feature New feature or request note:discussion Details or approval are up for discussion

Comments

@josevalim
Copy link
Collaborator

No description provided.

@josevalim josevalim added area:defn Applies to defn area:nx Applies to nx kind:feature New feature or request note:discussion Details or approval are up for discussion labels Mar 2, 2021
@cigrainger
Copy link
Member

cigrainger commented Mar 16, 2021

@josevalim in reference to your comment in #335 'I am also not sure if a Pandas for Elixir would necessarily build on top of Nx', do you think it would be reasonable to build on top of something like polars via Rustler with some glue to yield Nx tensors from polars and vice versa? Similar to the relationship between Pandas and NumPy?

@josevalim
Copy link
Collaborator Author

@cigrainger I was thinking the same and there are bindings for polars in Elixir. I am planning to reach out to the author and collect his thoughts. :)

@josevalim
Copy link
Collaborator Author

josevalim commented Mar 22, 2021

I have been looking more into this and the scope of this feature. At this point, it is clear we won't provide a complete pandas/arrow backend as part of Nx: it is a very wide domain with a lot of problems to tackle. What is left to decide is how big of a scope we want to push as part of Nx.

The document below documents all of my findings on this topic. It is going to be a long one but hopefully it is worth it.

Tables and formatting

There is at least one scope we want to consider here which is working with tabular data and formatting. For example, if I want to classify the Iris dataset, the data is going to be grouped into columns and we need a way to surface that into Nx. Let's also assume the CSV has already been parsed into lists of tuples. How to convert this data to tensors?

Well, the first step would be to build one-dimensional tensors for each column:

sepal_length = csv |> Enum.map(&elem(&1, 0)) |> Nx.tensor()
sepal_width = csv |> Enum.map(&elem(&1, 1)) |> Nx.tensor()
petal_length = csv |> Enum.map(&elem(&1, 2)) |> Nx.tensor()
petal_width = csv |> Enum.map(&elem(&1, 3)) |> Nx.tensor()
species = csv |> Enum.map(&elem(&1, 4)) |> Nx.tensor()

However, we already run into an issue here. Species is a list of strings. There is one straight-forward solution to this problem, which is to convert each species to an integer. However, once we do this, we lost the data representation aspect: now, when inspecting the tensor, instead of seeing "setosa" or "versicolor" for the specifies, I will see integers like 0 or 1. So I would say, at a bare minimum, we want to add the ability to customize how tensor data is represented:

species = csv |> Enum.map(&elem(&1, 4)) |>  convert_strings_to_integer() |> Nx.tensor()

The second issue is that, by splitting the data above into 5 distinct tensors, now we need to pass this data around as isolated variables or as a five element tuple. Ideally we should be able to pass them together. Furthermore, if I pass all 5 of them to the a defn but the defn code only uses two of them, we should send only the used columns to the GPU. One option for grouping is to allow maps in defns:

%{
  sepal_length: csv |> Enum.map(&elem(&1, 0)) |> Nx.tensor(),
  sepal_width: csv |> Enum.map(&elem(&1, 1)) |> Nx.tensor(),
  petal_length: csv |> Enum.map(&elem(&1, 2)) |> Nx.tensor(),
  petal_width: csv |> Enum.map(&elem(&1, 3)) |> Nx.tensor(),
  species: csv |> Enum.map(&elem(&1, 4)) |> convert_strings_to_integer() |> Nx.tensor()
}

However, by working with maps, we still don't get a nice inspected representation in the terminal. After all, each column will be printed individually. So we need a Nx.Dataset struct so those can be inspected nicely:

Nx.Dataset.new(
  sepal_length: Enum.map(csv, &elem(&1, 0)),
  sepal_width: Enum.map(csv, &elem(&1, 1)),
  petal_length: Enum.map(csv, &elem(&1, 2)),
  petal_width: Enum.map(csv, &elem(&1, 3)),
  species: Enum.map(csv, &elem(&1, 4))
)

Implementation considerations

So far, we have said that:

  1. We need a mechanism to customize how values in tensors are formatted
  2. We need a mechanism to group related tensors and be able to use only some of them inside defn
  3. We need a mechanism to inspect related tensors as a dataset/table

Nx.Dataset can solve both 1, 2, and 3 but the question is: should the customization of tensor values (bullet 1) be a part of Nx.Dataset or of Nx.Tensor? For example, imagine we do this:

ds = Nx.Dataset.new(
  sepal_length: Enum.map(csv, &elem(&1, 0)),
  sepal_width: Enum.map(csv, &elem(&1, 1)),
  petal_length: Enum.map(csv, &elem(&1, 2)),
  petal_width: Enum.map(csv, &elem(&1, 3)),
  species: Enum.map(csv, &elem(&1, 4))
)

ds[:species]

What should ds[:species] return?

  1. The underlying tensor representation, showing 0, 1, 2, and 3 for each species instead of "setosa" or "versicolor"
  2. The underlying tensor representation, showing "setosa" or "versicolor", etc

The issue with 2 is that we need to add data inspection to Nx.Tensor. At first it seems like a positive but how does it work in relation to all of the different operations in the Nx module? For example, imagine I have a Nx.Tensor representing [0, 1, 2] which is representing the strings ["foo", "bar", "baz"]. What happens if I multiply this tensor by 2? Or what happens if I broadcast its dimensions? Therefore it feels the representation should be part of Nx.Dataset - as the representation can quickly lose meaning when moved into the tensor world.

The downside of having the representation only be part of Nx.Dataset is that we have no direct equivalent to Pandas' Series. However, I believe this is a decision we can make later. The options would be:

  1. Introduce a Nx.Series
  2. Actually add the tensor representation to Nx.Tensor
  3. Do not add Nx.Series - instead work with single column datasets

Other operations

Datasets require many other operations: encoding, decoding, joining, selecting, etc. However, I think the focus for Nx.Datasets should be integration with defn, instead of providing the whole range of operations. I.e. the goal is to allow folks to process the data using another library and then feed it into Nx for the machine learning aspect of things.

GIven the focus on defn and given the fact defn only works with tensors of known shapes, many dataset operations are simply not possible in defn. You can't join, filter, etc. The few operations possible are aggregations and group_by, with the latter only achievable on categorical data. In other words, we should consider the focus of Nx.Dataset to be mostly read-only structures, leaving the remaining problems to be tackled in separate tools.

@seanmor5
Copy link
Collaborator

My thoughts are that it seems the purpose of Dataset support within defn would be more for writing idiomatic code (e.g. ability to slice on columns, easy conversion from tabular data to tensor representations, etc.) when working with this style of data and for inspection. Based on that, I think going with a mostly read-only approach is the way to go. So I guess essentially datasets would really just add some compile-time sugar into defn?

GIven the focus on defn and given the fact defn only works with tensors of known shapes, many dataset operations are simply not possible in defn. You can't join, filter, etc. The few operations possible are aggregations and group_by, with the latter only achievable on categorical data. In other words, we should consider the focus of Nx.Dataset to be mostly read-only structures, leaving the remaining problems to be tackled in separate tools.

I agree that this scope falls mostly outside of defn and I am wondering if we have a chicken and egg problem here. If we move forward with a dataset abstraction without a clear separate tool we're working to integrate, is it possible we'll end up being too restrictive, or placing unnecessary constraints on tools that need Nx integration?

@josevalim
Copy link
Collaborator Author

If we move forward with a dataset abstraction without a clear separate tool we're working to integrate, is it possible we'll end up being too restrictive, or placing unnecessary constraints on tools that need Nx integration?

Exactly one of my concerns too. We will need to make the "hand-off" fairly simple.

@arpieb
Copy link

arpieb commented Mar 23, 2021

I agree that the concept of a pandas-like equivalent is well outside the scope of Nx core, and really should live somewhere else. That being said, it would make sense in the idiomatic vein for Nx to at least define a protocol for a dataset that Nx can operate on (purely numeric) and leave the details of how features get vectorized up to other implementations, like suggested above in re to polars. That way they could be represented by streams, message queues, fixed data sizes, external data sources, a pipeline like Broadway, etc ad nauseum as long as they implement the protocol for their source.

I would even go so far as to say Nx shouldn't even be concerned about trying to transform raw features into numeric data (e.g. categorical data to binary vectors, SDRs, embeddings, etc) and strictly focus on optimizing the mathematical manipulation of complex tensors, leaving the construction of featurized tensors to external libs, else you'll wind up with numpy or scipy... ;)

@josevalim
Copy link
Collaborator Author

Closing this for now. With Nx.Container, we already allow dataframes to be given as arguments to defn. :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:defn Applies to defn area:nx Applies to nx kind:feature New feature or request note:discussion Details or approval are up for discussion
Projects
None yet
Development

No branches or pull requests

4 participants