Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

knn classification #310

Merged
merged 3 commits into from
Nov 15, 2023
Merged

knn classification #310

merged 3 commits into from
Nov 15, 2023

Conversation

s-weil
Copy link
Contributor

@s-weil s-weil commented Oct 27, 2023

Thank you for contributing to FSharp.Stats. Please take the time to tell us a bit more about your PR.

Closes 300

Please list the changes introduced in this PR

Description

The KNN algorithm is implemented in different versions [see ML\Unsupervised\KNN.fs], namely

  • for arrays
  • sequences
  • via a classification object (with convencience methods)
    Code documentation, examples and unit tests are provided.

NOTE: further specialisations of the algorithm for improved performance exist and versions are possible (such as a vector version). Moreover one could consider the feature of parallelized predictions (may only make sense in case of a big point cloud).

  • The project builds without problems on your machine
  • Added unit tests regarding the added features

@codecov-commenter
Copy link

codecov-commenter commented Oct 27, 2023

Codecov Report

Attention: 32 lines in your changes are missing coverage. Please review.

Comparison is base (4719e96) 47.00% compared to head (8f887a5) 47.16%.
Report is 2 commits behind head on developer.

Files Patch % Lines
src/FSharp.Stats/ML/Unsupervised/KNN.fs 28.20% 25 Missing and 3 partials ⚠️
tests/FSharp.Stats.Tests/ML.fs 94.28% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           developer     #310      +/-   ##
=============================================
+ Coverage      47.00%   47.16%   +0.15%     
=============================================
  Files            148      149       +1     
  Lines          16458    16567     +109     
  Branches        2219     2230      +11     
=============================================
+ Hits            7736     7813      +77     
- Misses          8052     8077      +25     
- Partials         670      677       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

(x : 'a)
: 'l option =

if Seq.isEmpty points || Seq.length points <> Seq.length labels || k <= 0 then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to pass on the consumer the need to put both points and labels in a single sequence of tuples?

It would make the interface more conforming to the array implementation and save two calls to Seq.length.

Copy link
Contributor Author

@s-weil s-weil Oct 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for the valuable feedback!
Indeed, I also believe it would be clearer to have a single sequence of tuples, and the array version uses the type 'LabeledPoint' (which is nothing more than a more descriptive tuple).

The reason for this outcome, was the motivation by the existing Python versions, see e.g. here (Step 2)
which uses the seperate arguments, as well as the need to construct the tuple seq in case you started with seperate data (ofc. the converse argument also holds).

That being said, I am happy to change it as suggested. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @smoothdeveloper that it would be beneficial if both points and labels are passed as single sequence. This reduces user errors by accidently resorting one of the collections. If you could make this small adjustment, I'll be happy to merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
i replaced the seperated params by a tupled sequence.
also i removed the tupe LabeledPoint by just tuples, because it seemed as just overhead


open FSharp.Stats.DistanceMetrics

module Array =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, it would be better to put RequireQualifiedAccess attribute on the Array and Seq modules, as it is possible to open KNN.Array/Seq otherwise.

I'm assuming:

  • those members aren't destined for raw usage from client code
  • the attribute on KNN is to discourage ambiguous usage of predict to surface in client code
  • we are encouraging the usage of Classifier type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, thanks!

So indeed I added the RequireQualifiedAccess to prevent misuse of the Classifier and predict functions.
Moving the RequireQualifiedAccess to Array and Seq would leave the Classifier 'unprotected' (currently it needs to constructued with KNN.Classifier) and I could move it to Array too or some new module or simply rename it to KnnClassifier.

The intention of the Classifier versus the Array.predict / Seq.predict versions was a more Python style version more functional style use. Moreover the Classifier provides convencience methods to construct the data in the required format, and to run single and multiple predictions.
Ultimately I am not sure of what the best versions are in terms of best 'user experience'.

Happy for your thoughts and feedback!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the option to have this convenience layer! I think it is okay to have the requiredAccess added to KNN even if one could open KNN.Array. It allows to have Classifier protected and prevent confusion

member this.fit(lps : LabeledPoint<'a, 'l> array) =
this.labeledPoints <- lps

member this.fit(points : 'a array, labels : 'l array) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggesting adding a comment that explains it will fail if both arrays aren't of same length

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will add it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have XML comments here 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added XML comments


open FSharp.Stats.DistanceMetrics

module Array =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the option to have this convenience layer! I think it is okay to have the requiredAccess added to KNN even if one could open KNN.Array. It allows to have Classifier protected and prevent confusion

(x : 'a)
: 'l option =

if Seq.isEmpty points || Seq.length points <> Seq.length labels || k <= 0 then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @smoothdeveloper that it would be beneficial if both points and labels are passed as single sequence. This reduces user errors by accidently resorting one of the collections. If you could make this small adjustment, I'll be happy to merge

member this.fit(lps : LabeledPoint<'a, 'l> array) =
this.labeledPoints <- lps

member this.fit(points : 'a array, labels : 'l array) =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to have XML comments here 👍

@bvenn
Copy link
Member

bvenn commented Nov 15, 2023

Thanks for this awsome and well prepared addition @s-weil!

@bvenn bvenn merged commit f9971c5 into fslaborg:developer Nov 15, 2023
2 checks passed
@s-weil s-weil deleted the feature/knn branch November 15, 2023 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kNN classification
4 participants