Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https://archive.ics.uci.edu/ml/datasets/Fertility). The data was first normalized, also using PySpark. Euclidean Distance was used as the similarity measure. The optimal k found for both datasets was 5. The iris dataset had a test accuracy of 97% and the fertility dataset had a test accuracy of 88%.
-
Notifications
You must be signed in to change notification settings - Fork 0
ZachPetroff/KNN-With-Spark
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
Implementation of K-Nearest Neighbors Algorithm Using PySpark
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published