## Benchmarking Perceptron


#### About profiling julia code

- https://thirld.com/blog/2015/05/30/julia-profiling-cheat-sheet/

#### Examples of speeding up code

There is a small number of "tricks" that can be applied to speed up execution time and save memory allocations. This is paramount for enjoying C like speed with julia code.

- https://discourse.julialang.org/t/speed-up-this-code-game/3666

In [1]:
peakflops()

7.021506268094257e10

In [2]:
versioninfo()

Julia Version 0.6.0-rc1.0
Commit 6bdb3950bd (2017-05-07 00:00 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i7-4650U CPU @ 1.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)


In [3]:
using MNIST
using BenchmarkTools

In [4]:
source_path = join(push!(split(pwd(),"/")[1:end-1],"source/" ),"/")

if !contains(==,LOAD_PATH, source_path) 
    push!(LOAD_PATH, source_path)
end

using MulticlassPerceptron4
using MulticlassPerceptron3
using MulticlassPerceptron2
using MulticlassPerceptron1

percep1 = MulticlassPerceptron1.MPerceptron(Float32, 10, 784)
percep2 = MulticlassPerceptron2.MPerceptron(Float32, 10, 784)
percep3 = MulticlassPerceptron3.MPerceptron(Float32, 10, 784)
percep4 = MulticlassPerceptron4.MPerceptron(Float32, 10, 784)

n_classes = 10
n_features = 784

784

In [5]:
X_train, y_train = MNIST.traindata();
X_test, y_test = MNIST.testdata();
y_train = y_train + 1
y_test = y_test + 1;

T = Float32
X_train = Array{T}((X_train - minimum(X_train))/(maximum(X_train) - minimum(X_train)))
y_train = Array{Int64}(y_train)
X_test = Array{T}(X_test - minimum(X_test))/(maximum(X_test) - minimum(X_test)) 
y_test = Array{Int64}(y_test);

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:64[22m[22m
 [2] [1mArray[22m[22m[1m([22m[22m::Type{Float64}, ::Int64, ::Int64[1m)[22m[22m at [1m./deprecated.jl:51[22m[22m
 [3] [1mtraindata[22m[22m[1m([22m[22m[1m)[22m[22m at [1m/Users/david/.julia/v0.6/MNIST/src/MNIST.jl:88[22m[22m
 [4] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:498[22m[22m
 [5] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[22m[22m at [1m/Users/david/.julia/v0.6/IJulia/src/execute_request.jl:156[22m[22m
 [6] [1meventloop[22m[22m[1m([22m[22m::ZMQ.Socket[1m)[22m[22m at [1m/Users/david/.julia/v0.6/IJulia/src/eventloop.jl:8[22m[22m
 [7] [1m(::IJulia.##9#12)[22m[22m[1m([22m[22m[1m)[22m[22m at [1m./task.jl:335[22m[22m
while loading In[5], in expression starting on line 1
Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::S

In [6]:
@benchmark MulticlassPerceptron1.fit!(percep1, X_train, y_train, 1, 0.0001)

BenchmarkTools.Trial: 
  memory estimate:  578.33 MiB
  allocs estimate:  653326
  --------------
  minimum time:     768.152 ms (11.51% GC)
  median time:      807.828 ms (11.13% GC)
  mean time:        805.262 ms (11.15% GC)
  maximum time:     828.906 ms (9.91% GC)
  --------------
  samples:          7
  evals/sample:     1

#### MulticlassPerceptron2

- Using views instead of copying examples

In [7]:
@benchmark MulticlassPerceptron2.fit!(percep2, X_train, y_train, 1, 0.0001)

BenchmarkTools.Trial: 
  memory estimate:  166.20 MiB
  allocs estimate:  402395
  --------------
  minimum time:     177.274 ms (14.12% GC)
  median time:      183.643 ms (13.81% GC)
  mean time:        185.941 ms (13.47% GC)
  maximum time:     212.337 ms (9.90% GC)
  --------------
  samples:          27
  evals/sample:     1

#### MulticlassPerceptron3

- Using views instead of copying examples
- using inbounds


In [8]:
@benchmark MulticlassPerceptron3.fit!(percep3, X_train, y_train, 1, 0.0001)

BenchmarkTools.Trial: 
  memory estimate:  137.62 MiB
  allocs estimate:  162630
  --------------
  minimum time:     163.708 ms (11.87% GC)
  median time:      173.860 ms (12.80% GC)
  mean time:        175.112 ms (13.11% GC)
  maximum time:     188.375 ms (12.59% GC)
  --------------
  samples:          29
  evals/sample:     1

#### MulticlassPerceptron4

- Using views instead of copying examples
- using views
- prealocated vector for predicting all datapoints
- using .* sintax for loop fusion

In [9]:
@benchmark MulticlassPerceptron4.fit!(percep4, X_train, y_train, 1, 0.0001)

BenchmarkTools.Trial: 
  memory estimate:  50.28 MiB
  allocs estimate:  215918
  --------------
  minimum time:     126.900 ms (4.86% GC)
  median time:      136.944 ms (5.96% GC)
  mean time:        139.021 ms (6.53% GC)
  maximum time:     154.809 ms (6.75% GC)
  --------------
  samples:          36
  evals/sample:     1

#### MulticlassPerceptron5

**What else can be improved?**

**Can we push the code to memory estimate 0 ?**

**Are we really using the BLAS at the fullest potential?**
