k-NN Algorithm in Golang, Haskell, Rust, F#, Julia, OCaml, Octave and Factor
This repository contains naive implementations of the k-NN algorithm in several languages (for k = 1), and an extra one in Golang with some optimizations.
This was inspired by a chain of blog posts:
- the last one implemented it in Factor
- the one before that did it in Rust.
- that one was inspired by a blog post which had it in F# and OCaml, and a follow-up which improves the first OCaml implementation.
This repository adds the naive implementation in Golang and Haskell, an additional implementation in Golang with some optimizations, and thanks to this guy, a couple implementations in Julia and Octave.
Performance comparisons between the naive implementations in each language were performed on a freshly spun up c3.xlarge EC2 instance as follows:
- Install Golang, Haskell, Rust, F#, OCaml and Octave, either with
apt-getor building from source
./configure && make && make install # ...etc. Install the latest nightly Julia build with
dpkg. Download Factor.
- Write the (naive) code for Golang and Haskell. Copy-paste the code for Rust, F#, OCaml, Julia, Octave, and Factor.
- Compile executable binaries for Haskell, Rust, F#, and OCaml. Run the Factor code in the
scratchpadREPL and the Octave code in the Octave REPL. Run the Golang code with
go runand the Julia code with
- Julia: 1.557s - 3.432s*
- Octave: 2.540s
- Golang: 4.701s
- Factor: 6.358s
- OCaml: 12.757s
- F#: 23.507s
- Rust: 78.138s
- Haskell: 91.581s
Julia has an asterisk because its initial run is slower than all subsequent runs. I'll have to rethink how to make this a fair experiment...
$ time go run golang-k-nn.go 0.944 real 0m4.701s user 0m4.582s sys 0m0.136s
$ time ./haskell-k-nn Percentage correct: 472 real 1m31.581s user 1m29.191s sys 0m2.384s
$ time ./rust-k-nn Percentage correct: 94.4% real 1m18.138s user 1m17.980s sys 0m0.155s
$ time ./fsharp-k-nn.exe start... Percentage correct:94.400000 real 0m23.507s user 0m22.751s sys 0m0.798s
$ julia julia/julia-k-nn.jl Percentage correct: 94.4% elapsed time: 3.438748733 seconds (656566752 bytes allocated, 2.03% gc time) Percentage correct: 94.4% elapsed time: 1.560125781 seconds (566037680 bytes allocated, 6.06% gc time)
$ time ./ocaml-k-nn Percentage correct:94.400000 real 0m12.757s user 0m12.500s sys 0m0.257s
$ cp octave/octave-k-nn.m octaveknn.m $ octave octave:1> octaveknn; Percentage correct: 94.400000% Elapsed time is 2.53968 seconds. $ rm octaveknn.m
$ mkdir -p $FACTOR_HOME/work/k-nn $ cp factor/factor-k-nn.factor $FACTOR_HOME/work/k-nn $ cp *.csv $FACTOR_HOME/work/k-nn $ $FACTOR_HOME/factor IN: scratchpad USE: factor-k-nn Loading resource:work/k-nn/factor-k-nn.factor Loading resource:basis/formatting/formatting.factor Loading resource:basis/formatting/formatting-docs.factor IN: scratchpad gc [ k-nn ] time Percentage correct: 94.400000 Running time: 6.357621145 seconds
Optimized implementation in Golang
For Golang, an additional implementation is given which is signficantly faster, but suffers no loss in accuracy. It involves two optimizations:
- Short-circuit distance calculations between a test case and a training case that are necessarily suboptimal. In other words, if you know the distance to one potential nearest neighbour is 100, and half-way through calculating the distance to another potential nearest neighbour you already have a distance-so-far of 105, stop calculating and move on to the next candidate for nearest neighbour.
- Use goroutines to parallelize the computations. The way this was done was not ideal, because the parallelism isn't in the classification algorithm itself, instead it parellelizes the classification of the members of the validation sample. However, it's is easy enough to "do it right", and what's currently there is good enough to see how significant the gains are when firing on all your cores.
$ time go run golang-k-nn-speedup.go 0.944 real 0m1.375s user 0m3.314s sys 0m0.117s
- Tell me if I should use special compiler flags to improve performance for some of the languages.
- Tell my why this experiment is invalid.
- Improve naive implementations without changing the spirit of the algorithm (e.g. use eager evaluation in Haskell).
- Add optimized implementations of k-NN which improve performance at no cost to accuracy.
- Add implementations for other languages (with compilation and/or run instructions).
- Make the Go stuff a useable package
- Explore Accuracy vs Time tradeoffs of only considering some of the training set or some of the pixels when classifying a test case
- Make it easy to experiment with a matrix of different options
- Do it in C
- Do it in Python with scikit-learn