-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support datasets/feature columns #301
Comments
@josevalim in reference to your comment in #335 'I am also not sure if a Pandas for Elixir would necessarily build on top of Nx', do you think it would be reasonable to build on top of something like polars via Rustler with some glue to yield Nx tensors from polars and vice versa? Similar to the relationship between Pandas and NumPy? |
@cigrainger I was thinking the same and there are bindings for polars in Elixir. I am planning to reach out to the author and collect his thoughts. :) |
I have been looking more into this and the scope of this feature. At this point, it is clear we won't provide a complete pandas/arrow backend as part of Nx: it is a very wide domain with a lot of problems to tackle. What is left to decide is how big of a scope we want to push as part of Nx. The document below documents all of my findings on this topic. It is going to be a long one but hopefully it is worth it. Tables and formattingThere is at least one scope we want to consider here which is working with tabular data and formatting. For example, if I want to classify the Iris dataset, the data is going to be grouped into columns and we need a way to surface that into Nx. Let's also assume the CSV has already been parsed into lists of tuples. How to convert this data to tensors? Well, the first step would be to build one-dimensional tensors for each column: sepal_length = csv |> Enum.map(&elem(&1, 0)) |> Nx.tensor()
sepal_width = csv |> Enum.map(&elem(&1, 1)) |> Nx.tensor()
petal_length = csv |> Enum.map(&elem(&1, 2)) |> Nx.tensor()
petal_width = csv |> Enum.map(&elem(&1, 3)) |> Nx.tensor()
species = csv |> Enum.map(&elem(&1, 4)) |> Nx.tensor() However, we already run into an issue here. Species is a list of strings. There is one straight-forward solution to this problem, which is to convert each species to an integer. However, once we do this, we lost the data representation aspect: now, when inspecting the tensor, instead of seeing "setosa" or "versicolor" for the specifies, I will see integers like 0 or 1. So I would say, at a bare minimum, we want to add the ability to customize how tensor data is represented: species = csv |> Enum.map(&elem(&1, 4)) |> convert_strings_to_integer() |> Nx.tensor() The second issue is that, by splitting the data above into 5 distinct tensors, now we need to pass this data around as isolated variables or as a five element tuple. Ideally we should be able to pass them together. Furthermore, if I pass all 5 of them to the a %{
sepal_length: csv |> Enum.map(&elem(&1, 0)) |> Nx.tensor(),
sepal_width: csv |> Enum.map(&elem(&1, 1)) |> Nx.tensor(),
petal_length: csv |> Enum.map(&elem(&1, 2)) |> Nx.tensor(),
petal_width: csv |> Enum.map(&elem(&1, 3)) |> Nx.tensor(),
species: csv |> Enum.map(&elem(&1, 4)) |> convert_strings_to_integer() |> Nx.tensor()
} However, by working with maps, we still don't get a nice inspected representation in the terminal. After all, each column will be printed individually. So we need a Nx.Dataset.new(
sepal_length: Enum.map(csv, &elem(&1, 0)),
sepal_width: Enum.map(csv, &elem(&1, 1)),
petal_length: Enum.map(csv, &elem(&1, 2)),
petal_width: Enum.map(csv, &elem(&1, 3)),
species: Enum.map(csv, &elem(&1, 4))
) Implementation considerationsSo far, we have said that:
ds = Nx.Dataset.new(
sepal_length: Enum.map(csv, &elem(&1, 0)),
sepal_width: Enum.map(csv, &elem(&1, 1)),
petal_length: Enum.map(csv, &elem(&1, 2)),
petal_width: Enum.map(csv, &elem(&1, 3)),
species: Enum.map(csv, &elem(&1, 4))
)
ds[:species] What should
The issue with 2 is that we need to add data inspection to Nx.Tensor. At first it seems like a positive but how does it work in relation to all of the different operations in the Nx module? For example, imagine I have a Nx.Tensor representing [0, 1, 2] which is representing the strings ["foo", "bar", "baz"]. What happens if I multiply this tensor by 2? Or what happens if I broadcast its dimensions? Therefore it feels the representation should be part of Nx.Dataset - as the representation can quickly lose meaning when moved into the tensor world. The downside of having the representation only be part of Nx.Dataset is that we have no direct equivalent to Pandas' Series. However, I believe this is a decision we can make later. The options would be:
Other operationsDatasets require many other operations: encoding, decoding, joining, selecting, etc. However, I think the focus for Nx.Datasets should be integration with GIven the focus on |
My thoughts are that it seems the purpose of Dataset support within
I agree that this scope falls mostly outside of |
Exactly one of my concerns too. We will need to make the "hand-off" fairly simple. |
I agree that the concept of a pandas-like equivalent is well outside the scope of Nx core, and really should live somewhere else. That being said, it would make sense in the idiomatic vein for Nx to at least define a protocol for a dataset that Nx can operate on (purely numeric) and leave the details of how features get vectorized up to other implementations, like suggested above in re to polars. That way they could be represented by streams, message queues, fixed data sizes, external data sources, a pipeline like Broadway, etc ad nauseum as long as they implement the protocol for their source. I would even go so far as to say Nx shouldn't even be concerned about trying to transform raw features into numeric data (e.g. categorical data to binary vectors, SDRs, embeddings, etc) and strictly focus on optimizing the mathematical manipulation of complex tensors, leaving the construction of featurized tensors to external libs, else you'll wind up with numpy or scipy... ;) |
Closing this for now. With Nx.Container, we already allow dataframes to be given as arguments to |
No description provided.
The text was updated successfully, but these errors were encountered: