Add KNNImputer by srzeszut · Pull Request #303 · elixir-nx/scholar

srzeszut · 2024-10-20T13:28:02Z

I have added the KNNImputer and I am currently implementing tests to ensure that it behaves as expected across various scenarios, including edge cases.

lib/scholar/impute/knn_imputer.ex

josevalim · 2024-10-21T07:16:01Z

lib/scholar/impute/knn_imputer.ex

+    if opts[:missing_values] != :nan and
+         Nx.any(Nx.is_nan(x)) == Nx.tensor(1, type: :u8) do
+      raise ArgumentError,
+            ":missing_values other than :nan possible only if there is no Nx.Constant.nan() in the array"
+    end
+


This check does not really work in Nx. If you call fit inside Nx.Defn.jit, then x is an expression, and we can't read its values to find out if there is a nan or not. The best we can do is to remove this check and document it.

I found this check in simple imputer
https://github.com/elixir-nx/scholar/blob/main/lib/scholar/impute/simple_imputer.ex
Are you sure it won't work?

It is also broken there. :)

I have fixed it there: c024c5b

josevalim · 2024-10-21T07:17:44Z

lib/scholar/impute/knn_imputer.ex

+
+    all_nan_rows_count = Nx.sum(all_nan_rows)
+
+    if num_neighbors > rows - 1 - Nx.to_number(all_nan_rows_count) do


Same here, this code won't work because, when you have an expression, you can't get a number from it. Can we remove this check? What happens if we don't check for this condition?

You can test this by calling fit after jitting it with Nx.Defn.fit.

lib/scholar/impute/knn_imputer.ex

josevalim · 2024-10-21T07:21:11Z

lib/scholar/impute/knn_imputer.ex

+
+    # if potential neighbor has nan in nan_col, we don't want to calculate distance and the case if potential_neighbour is the row to impute
+    {potential_neighbor} =
+      if potential_neighbor[nan_col] == Nx.Constants.nan() do


I am not sure if this check is guaranteed to work, given two NaNs are not guaranteed to be equal. Using Nx.is_nan would be more appropriate.

lib/scholar/impute/knn_imputer.ex

msluszniak · 2024-10-21T12:36:56Z

lib/scholar/impute/knn_imputer.ex

+
+    x =
+      if opts[:missing_values] != :nan,
+        do: Nx.select(Nx.equal(x, opts[:missing_values]), Nx.Constants.nan(), x),


Use Nx.is_nan here NaN is not equal to itself

lib/scholar/impute/knn_imputer.ex

msluszniak · 2024-10-21T12:57:35Z

lib/scholar/impute/knn_imputer.ex

+    coordinates = coordinates - 1
+
+    # inputes zeros in nan_col to calculate distance with squared_euclidean
+    new_row = Nx.indexed_put(row, Nx.new_axis(nan_col, 0), Nx.tensor(0))


Generally, when you write in defn, you don't need to wrap this zero in Nx.tensor. I prefer to explicitly use Nx.<type> or Nx.tensor(x, type: type) to indicate the type of the tensor. Now, there are some cases where imputter has fixed type like :f32. I think that this might cause undesired upcasts when e.g. I have tensor of type :bf16. So I suggest to check if there are any unwanted casts / upcast.

I changed it but I don't know how to change this line
row_distances = Nx.iota({rows}, type: {:f, 32})
because i don't know what the type calculated distance will be at this point

msluszniak · 2024-10-21T13:02:04Z

lib/scholar/impute/knn_imputer.ex

+
+    # if row has all nans we skip it
+    {weight, potential_neighbor} =
+      if present_coordinates == 0 do


As mentioned in comment up, try to replace "bare" numbers with typed tensors

msluszniak · 2024-10-21T13:03:24Z

lib/scholar/impute/knn_imputer.ex

@@ -0,0 +1,256 @@
+defmodule Scholar.Impute.KNNImputer do


I think it should be written with double t KNNImputter like formatter etc.

msluszniak

Thanks for the PR, I dropped some comments :))

krstopro · 2024-10-22T16:21:43Z

Hi @srzeszut and thanks for the pull request. I’m traveling now and don’t have my laptop with me. Will be back this Sunday, so I will have a look probably next week.

srzeszut · 2024-10-27T10:40:27Z

Thanks for the review, I apply suggested changes and left some comments.

josevalim · 2024-10-28T10:06:33Z

lib/scholar/impute/knn_imputter.ex

+
+    if num_neighbors > rows - 1 - Nx.to_number(all_nan_rows_count) do
+      raise ArgumentError,
+            "Number of neighbors rows must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value)"


error messages start in lowercase. :)

Suggested change

"Number of neighbors rows must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value)"

"number of neighbors rows must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value)"

josevalim · 2024-10-28T10:07:12Z

lib/scholar/impute/knn_imputter.ex

+
+    all_nan_rows_count = Nx.sum(all_nan_rows)
+
+    if num_neighbors > rows - 1 - Nx.to_number(all_nan_rows_count) do


Can you please add some tests? In particular, please add a test where you call jit this function and then you call it: Nx.Defn.jit(...).(arg1, arg2). It should reveal some errors around here. :)

I added tests and checked it. I removed those checks and added them in the description

josevalim · 2024-10-29T19:01:37Z

lib/scholar/impute/knn_imputter.ex

+    `n_neighbors` nearest neighbors found in the training set. Two samples are
+    close if the features that neither is missing are close.


Suggested change

`n_neighbors` nearest neighbors found in the training set. Two samples are

close if the features that neither is missing are close.

`n_neighbors` nearest neighbors found in the training set. Two samples are

close if the features that neither is missing are close.

josevalim · 2024-10-29T19:01:57Z

lib/scholar/impute/knn_imputter.ex

+
+  Preconditions:
+    * `number_of_neighbors` is a positive integer.
+    *  number of neighbors must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value) otherwise it is better to use simple imputter


Please try to break this long line :)

lib/scholar/impute/knn_imputter.ex

josevalim · 2024-10-29T19:05:32Z

test/scholar/impute/knn_imputter_test.exs

+    test "Wrong impute rank" do
+      x = Nx.tensor([1, 2, 2, 3])
+
+      assert_raise ArgumentError,
+                   "Wrong input rank. Expected: 2, got: 1",
+                   fn ->
+                     KNNImputter.fit(x, missing_values: 1, number_of_neighbors: 2)
+                   end
+    end
+
+    test "Invalid n_neighbors value" do


Test names start in lowercase :)

Suggested change

test "Wrong impute rank" do

x = Nx.tensor([1, 2, 2, 3])

assert_raise ArgumentError,

"Wrong input rank. Expected: 2, got: 1",

fn ->

KNNImputter.fit(x, missing_values: 1, number_of_neighbors: 2)

end

end

test "Invalid n_neighbors value" do

test "invalid impute rank" do

x = Nx.tensor([1, 2, 2, 3])

assert_raise ArgumentError,

"Wrong input rank. Expected: 2, got: 1",

fn ->

KNNImputter.fit(x, missing_values: 1, number_of_neighbors: 2)

end

end

test "invalid n_neighbors value" do

josevalim

I dropped the last round of nitpicks and we are good to go!

krstopro

First review. Some features we might wanna have:

Make k-NN algorithm configurable.
Make the metric configurable.

You can leave these for another pull request. Have a look at e.g. KNNClassifier how it is done over there.
I should have another look tonight.

krstopro · 2024-10-30T09:39:38Z

lib/scholar/impute/knn_imputter.ex

+      The default value expects there are no NaNs in the input tensor.
+      """
+    ],
+    number_of_neighbors: [


I would suggest changing this to num_neighbors to be consistent with the rest of Scholar.

krstopro

Several minor comments for now. I have to go through the code at least once more as I don't exactly understand the logic here.

krstopro · 2024-10-30T14:36:48Z

lib/scholar/impute/knn_imputter.ex

+
+    x =
+      if opts[:missing_values] != :nan,
+        do: Nx.select(Nx.equal(x, opts[:missing_values]), Nx.Constants.nan(), x),


You should be able to use == instead of Nx.equal/2.

This is a deftransform, so Nx.equal is the proper function. == will be Elixir.Kernel.==

krstopro · 2024-10-30T14:42:59Z

lib/scholar/impute/knn_imputter.ex

+    placeholder_value = Nx.Constants.nan() |> Nx.tensor()
+
+    statistics = knn_impute(x, placeholder_value, num_neighbors: num_neighbors)
+    missing_values = opts[:missing_values]


I would move this line above so that you don't access opts[:missing_values] multiple times.

krstopro · 2024-10-30T14:49:00Z

lib/scholar/impute/knn_imputter.ex

+
+    {_, values_to_impute} =
+      while {{row = 0, mask, num_neighbors, num_rows, x}, values_to_impute},
+            Nx.less(row, num_rows) do


You can use < instead of Nx.less/2 over here.

krstopro · 2024-10-30T14:49:14Z

lib/scholar/impute/knn_imputter.ex

+            Nx.less(row, num_rows) do
+        {_, values_to_impute} =
+          while {{col = 0, mask, num_neighbors, num_cols, row, x}, values_to_impute},
+                Nx.less(col, num_cols) do


krstopro · 2024-10-30T14:52:05Z

lib/scholar/impute/knn_imputter.ex

+        {_, values_to_impute} =
+          while {{col = 0, mask, num_neighbors, num_cols, row, x}, values_to_impute},
+                Nx.less(col, num_cols) do
+            if mask[row][col] > 0 do


I think if mask[row][col] do should work here.

polvalente · 2024-10-30T19:40:44Z

lib/scholar/impute/knn_imputter.ex

+    * `number_of_neighbors` is a positive integer.
+    *  number of neighbors must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value) otherwise it is better to use simple imputter


Suggested change

* `number_of_neighbors` is a positive integer.

* number of neighbors must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value) otherwise it is better to use simple imputter

* The number of neighbors must be less than the number of valid rows - 1.

A valid row is a row with more than 1 non-NaN values. Otherwise it is better to use a simpler imputer.

polvalente · 2024-10-30T19:40:58Z

lib/scholar/impute/knn_imputter.ex

+  Preconditions:
+    * `number_of_neighbors` is a positive integer.
+    *  number of neighbors must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value) otherwise it is better to use simple imputter
+    *  when you set a value different than :nan in `missing_values` there should be no NaNs in the input tensor


Suggested change

* when you set a value different than :nan in `missing_values` there should be no NaNs in the input tensor

* When you set a value different than `:nan` in `missing_values` there should be no NaNs in the input tensor

polvalente · 2024-10-30T19:41:48Z

lib/scholar/impute/knn_imputter.ex

+    * `:missing_values` - the same value as in `:missing_values`
+
+    * `:statistics` - The imputation fill value for each feature. Computing statistics can result in
+    [`Nx.Constant.nan/0`](https://hexdocs.pm/nx/Nx.Constants.html#nan/0) values.


Suggested change

[`Nx.Constant.nan/0`](https://hexdocs.pm/nx/Nx.Constants.html#nan/0) values.

[`Nx.Constants.nan/0`](https://hexdocs.pm/nx/Nx.Constants.html#nan/0) values.

Do you need the explicit linking in hexdoc?

polvalente · 2024-10-30T19:42:23Z

lib/scholar/impute/knn_imputter.ex

+
+    The function returns a struct with the following parameters:
+
+    * `:missing_values` - the same value as in `:missing_values`


Suggested change

* `:missing_values` - the same value as in `:missing_values`

* `:missing_values` - the same value as in the `:missing_values` option

polvalente · 2024-10-30T19:43:58Z

lib/scholar/impute/knn_imputter.ex

+
+    num_neighbors = opts[:number_of_neighbors]
+
+    placeholder_value = Nx.Constants.nan() |> Nx.tensor()


Suggested change

placeholder_value = Nx.Constants.nan() |> Nx.tensor()

placeholder_value = Nx.Constants.nan()

you probably want to pass the input type here to avoid upcasts

polvalente · 2024-10-30T19:45:39Z

lib/scholar/impute/knn_imputter.ex

+
+  opts_schema = [
+    missing_values: [
+      type: {:or, [:float, :integer, {:in, [:nan]}]},


Suggested change

type: {:or, [:float, :integer, {:in, [:nan]}]},

type: {:or, [:float, :integer, {:in, [:nan]}]},

I believe this should allow :infinity and :neg_infinity too for completeness

polvalente · 2024-10-30T19:50:59Z

lib/scholar/impute/knn_imputter.ex

+              indices =
+                [Nx.stack(row), Nx.stack(col)]
+                |> Nx.concatenate()
+                |> Nx.stack()


Suggested change

indices =

[Nx.stack(row), Nx.stack(col)]

|> Nx.concatenate()

|> Nx.stack()

indices = Nx.stack([row, col]) |> Nx.reshape({1, 2})

If I read the code correctly, row and col are scalars and this should yield the same result

polvalente · 2024-10-30T19:52:41Z

lib/scholar/impute/knn_imputter.ex

+                |> Nx.concatenate()
+                |> Nx.stack()
+
+              values_to_impute = Nx.indexed_put(values_to_impute, indices, Nx.stack(neighbor_avg))


Suggested change

values_to_impute = Nx.indexed_put(values_to_impute, indices, Nx.stack(neighbor_avg))

values_to_impute = Nx.put_slice(values_to_impute, [row, col], Nx.reshape(neighbor_avg, {1, 1}))

I think this is even simpler

polvalente · 2024-10-30T20:00:11Z

lib/scholar/impute/knn_imputter.ex

+    {_, row_distances} =
+      while {{i = 0, x, row_with_value_to_fill, rows, nan_row, nan_col}, row_distances},
+            Nx.less(i, rows) do
+        potential_donor = x[i]
+
+        distance =
+          if i == nan_row do
+            Nx.Constants.infinity(Nx.type(row_with_value_to_fill))
+          else
+            nan_euclidian(row_with_value_to_fill, nan_col, potential_donor)
+          end
+
+        row_distances = Nx.indexed_put(row_distances, Nx.new_axis(i, 0), distance)
+        {{i + 1, x, row_with_value_to_fill, rows, nan_row, nan_col}, row_distances}
+      end


try this:

potential_donors = Nx.vectorize(x, :rows) distances = nan_euclidean(row_with_value_to_fill, nan_col, potential_donors) |> Nx.devectorize() row_distances = Nx.indexed_put(distances, [i], Nx.Constants.infinity())

srzeszut · 2024-11-28T11:49:43Z

Thanks for all the comments, I applied your suggested changes to the code.

Knn imputer

mix format

josevalim · 2024-11-28T21:10:29Z

💚 💙 💜 💛 ❤️

srzeszut added 5 commits October 20, 2024 15:17

add KNNImputer

d6c7a55

fix doctests

47b4a65

mix format

eb8f245

change placeholder_value to tensor

642b15e

fix doctest

520633a

josevalim reviewed Oct 21, 2024

View reviewed changes

lib/scholar/impute/knn_imputer.ex Outdated Show resolved Hide resolved

josevalim reviewed Oct 21, 2024

View reviewed changes

lib/scholar/impute/knn_imputer.ex Outdated Show resolved Hide resolved

josevalim reviewed Oct 21, 2024

View reviewed changes

lib/scholar/impute/knn_imputer.ex Show resolved Hide resolved

josevalim reviewed Oct 21, 2024

View reviewed changes

josevalim requested review from krstopro and msluszniak October 21, 2024 07:22

msluszniak reviewed Oct 21, 2024

View reviewed changes

lib/scholar/impute/knn_imputer.ex Outdated Show resolved Hide resolved

msluszniak reviewed Oct 21, 2024

View reviewed changes

lib/scholar/impute/knn_imputer.ex Outdated Show resolved Hide resolved

msluszniak reviewed Oct 21, 2024

View reviewed changes

srzeszut added 2 commits October 27, 2024 11:20

apply suggested changes

926a1c7

added type tensors

a3e0eba

josevalim reviewed Oct 28, 2024

View reviewed changes

srzeszut and others added 2 commits October 28, 2024 12:49

Merge branch 'elixir-nx:main' into main

4757dd1

added tests and remove not working checks

108475d

josevalim reviewed Oct 29, 2024

View reviewed changes

lib/scholar/impute/knn_imputter.ex Outdated Show resolved Hide resolved

josevalim reviewed Oct 29, 2024

View reviewed changes

krstopro reviewed Oct 30, 2024

View reviewed changes

change errors and refactor

1a3aae7

krstopro reviewed Oct 30, 2024

View reviewed changes

polvalente reviewed Oct 30, 2024

View reviewed changes

srzeszut added 3 commits November 27, 2024 13:26

apply suggestions

366584e

apply suggestions

f4b6c39

apply suggested changes

e23a9dd

srzeszut and others added 3 commits November 28, 2024 12:51

Merge pull request #1 from srzeszut/knn_imputer

071fb27

Knn imputer

mix format

d5913eb

Merge pull request #2 from srzeszut/knn_imputer

ac7fc1a

mix format

josevalim approved these changes Nov 28, 2024

View reviewed changes

msluszniak approved these changes Nov 28, 2024

View reviewed changes

josevalim merged commit c11afad into elixir-nx:main Nov 28, 2024


		all_nan_rows_count = Nx.sum(all_nan_rows)

		if num_neighbors > rows - 1 - Nx.to_number(all_nan_rows_count) do

	"Number of neighbors rows must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value)"
	"number of neighbors rows must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value)"

		`n_neighbors` nearest neighbors found in the training set. Two samples are
		close if the features that neither is missing are close.

		* `number_of_neighbors` is a positive integer.
		* number of neighbors must be less than number valid of rows - 1 (valid row is row with more than 1 non nan value) otherwise it is better to use simple imputter

	* when you set a value different than :nan in `missing_values` there should be no NaNs in the input tensor
	* When you set a value different than `:nan` in `missing_values` there should be no NaNs in the input tensor

	[`Nx.Constant.nan/0`](https://hexdocs.pm/nx/Nx.Constants.html#nan/0) values.
	[`Nx.Constants.nan/0`](https://hexdocs.pm/nx/Nx.Constants.html#nan/0) values.


		The function returns a struct with the following parameters:

		* `:missing_values` - the same value as in `:missing_values`


		num_neighbors = opts[:number_of_neighbors]

		placeholder_value = Nx.Constants.nan() \|> Nx.tensor()

	placeholder_value = Nx.Constants.nan() \|> Nx.tensor()
	placeholder_value = Nx.Constants.nan()

	type: {:or, [:float, :integer, {:in, [:nan]}]},
	type: {:or, [:float, :integer, {:in, [:nan]}]},

	values_to_impute = Nx.indexed_put(values_to_impute, indices, Nx.stack(neighbor_avg))
	values_to_impute = Nx.put_slice(values_to_impute, [row, col], Nx.reshape(neighbor_avg, {1, 1}))

Comments

Conversation

srzeszut commented Oct 20, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

msluszniak Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msluszniak left a comment

Choose a reason for hiding this comment

Uh oh!

krstopro commented Oct 22, 2024

Uh oh!

srzeszut commented Oct 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

josevalim left a comment

Choose a reason for hiding this comment

Uh oh!

krstopro left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

krstopro left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msluszniak Oct 21, 2024 •

edited

Loading

srzeszut commented Oct 27, 2024 •

edited

Loading

krstopro left a comment •

edited

Loading