Skip to content

amitkgupta/nearest_neighbour

Repository files navigation

k-NN Algorithm in Golang, Haskell, Rust, F#, Julia, OCaml, Octave and Factor

This repository contains naive implementations of the k-NN algorithm in several languages (for k = 1), and an extra one in Golang with some optimizations.

Blog Chain

This was inspired by a chain of blog posts:

This repository adds the naive implementation in Golang and Haskell, an additional implementation in Golang with some optimizations, and thanks to this guy, a couple implementations in Julia and Octave.

Comparison

Performance comparisons between the naive implementations in each language were performed on a freshly spun up c3.xlarge EC2 instance as follows:

  1. Install Golang, Haskell, Rust, F#, OCaml and Octave, either with apt-get or building from source ./configure && make && make install # ...etc. Install the latest nightly Julia build with dpkg. Download Factor.
  2. Write the (naive) code for Golang and Haskell. Copy-paste the code for Rust, F#, OCaml, Julia, Octave, and Factor.
  3. Compile executable binaries for Haskell, Rust, F#, and OCaml. Run the Factor code in the scratchpad REPL and the Octave code in the Octave REPL. Run the Golang code with go run and the Julia code with julia.

Results

  1. Julia: 1.557s - 3.432s*
  2. Octave: 2.540s
  3. Golang: 4.701s
  4. Factor: 6.358s
  5. OCaml: 12.757s
  6. F#: 23.507s
  7. Rust: 78.138s
  8. Haskell: 91.581s

Julia has an asterisk because its initial run is slower than all subsequent runs. I'll have to rethink how to make this a fair experiment...

Golang

$ time go run golang-k-nn.go

0.944

real  0m4.701s
user  0m4.582s
sys 0m0.136s

Haskell

$ time ./haskell-k-nn

Percentage correct: 472

real  1m31.581s
user  1m29.191s
sys 0m2.384s

Rust

$ time ./rust-k-nn

Percentage correct: 94.4%

real	1m18.138s
user	1m17.980s
sys	0m0.155s

F#

$ time ./fsharp-k-nn.exe

start...
Percentage correct:94.400000

real	0m23.507s
user	0m22.751s
sys	0m0.798s

Julia

$ julia julia/julia-k-nn.jl

Percentage correct: 94.4%
elapsed time: 3.438748733 seconds (656566752 bytes allocated, 2.03% gc time)
Percentage correct: 94.4%
elapsed time: 1.560125781 seconds (566037680 bytes allocated, 6.06% gc time)

OCaml

$ time ./ocaml-k-nn

Percentage correct:94.400000

real	0m12.757s
user	0m12.500s
sys	0m0.257s

Octave

$ cp octave/octave-k-nn.m octaveknn.m
$ octave

octave:1> octaveknn;
Percentage correct: 94.400000%
Elapsed time is 2.53968 seconds.

$ rm octaveknn.m

Factor

$ mkdir -p $FACTOR_HOME/work/k-nn
$ cp factor/factor-k-nn.factor $FACTOR_HOME/work/k-nn
$ cp *.csv $FACTOR_HOME/work/k-nn
$ $FACTOR_HOME/factor

IN: scratchpad USE: factor-k-nn
Loading resource:work/k-nn/factor-k-nn.factor
Loading resource:basis/formatting/formatting.factor
Loading resource:basis/formatting/formatting-docs.factor

IN: scratchpad gc [ k-nn ] time
Percentage correct: 94.400000
Running time: 6.357621145 seconds

Optimized implementation in Golang

For Golang, an additional implementation is given which is signficantly faster, but suffers no loss in accuracy. It involves two optimizations:

  1. Short-circuit distance calculations between a test case and a training case that are necessarily suboptimal. In other words, if you know the distance to one potential nearest neighbour is 100, and half-way through calculating the distance to another potential nearest neighbour you already have a distance-so-far of 105, stop calculating and move on to the next candidate for nearest neighbour.
  2. Use goroutines to parallelize the computations. The way this was done was not ideal, because the parallelism isn't in the classification algorithm itself, instead it parellelizes the classification of the members of the validation sample. However, it's is easy enough to "do it right", and what's currently there is good enough to see how significant the gains are when firing on all your cores.
$ time go run golang-k-nn-speedup.go

0.944

real  0m1.375s
user  0m3.314s
sys 0m0.117s

Contributing

  1. Tell me if I should use special compiler flags to improve performance for some of the languages.
  2. Tell my why this experiment is invalid.
  3. Improve naive implementations without changing the spirit of the algorithm (e.g. use eager evaluation in Haskell).
  4. Add optimized implementations of k-NN which improve performance at no cost to accuracy.
  5. Add implementations for other languages (with compilation and/or run instructions).

TODO

  • Make the Go stuff a useable package
  • Explore Accuracy vs Time tradeoffs of only considering some of the training set or some of the pixels when classifying a test case
  • Make it easy to experiment with a matrix of different options
  • Do it in C
  • Do it in Python with scikit-learn

About

Golang implementation of k-NN algorithm (for k = 1)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published