Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-24810] Add Estimator and Model for the k-means clustering algorithm #27

Closed
wants to merge 1 commit into from

Conversation

lindong28
Copy link
Member

@lindong28 lindong28 commented Nov 6, 2021

What is the purpose of the change

This PR adds the Estimator and Model classes for the KMeans algorithm.

Brief change log

This PR mades the following changes:

  • Added KMeans, KMeansModel and a few other supporting classes.
  • Added KMeansTest to validate the behavior of KMeans and KMeansModel.

Verifying this change

The changes are validated by unit tests in the KMeansTest.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (Java doc)

@lindong28
Copy link
Member Author

@gaoyunhaii Could you review this PR?

@lindong28 lindong28 force-pushed the FLINK-24810 branch 2 times, most recently from fc1dd58 to 572eaf5 Compare November 15, 2021 15:09
@lindong28
Copy link
Member Author

Thanks for your review @gaoyunhaii @yunfengzhou-hub! All comments have been addressed.

Copy link
Contributor

@gaoyunhaii gaoyunhaii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very thanks @lindong28 for opening the PR! Will merge after the test get green~

public KMeansModel fit(Table... inputs) {
StreamTableEnvironment tEnv =
(StreamTableEnvironment) ((TableImpl) inputs[0]).getTableEnvironment();
DataStream<DenseVector> points =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe DataStream<Vector> is enough here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simply making this change will break compilation. Maybe we can make this work by also changing many other places. But I am not sure it is useful to make this change given that we don't support SparseVector yet.

Maybe we can make this change after SparseVector is supported. Then we can write unit tests to validate that KMeans (as well as other algorithms) can support both SparseVector and DenseVector as inputs.

@lindong28 lindong28 deleted the FLINK-24810 branch November 16, 2021 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants