## Benchmarking Perceptron


#### About profiling julia code

- https://thirld.com/blog/2015/05/30/julia-profiling-cheat-sheet/

#### Examples of speeding up code

There is a small number of "tricks" that can be applied to speed up execution time and save memory allocations. This is paramount for enjoying C like speed with julia code.

- https://discourse.julialang.org/t/speed-up-this-code-game/3666

In [None]:
peakflops()

In [1]:
versioninfo()

Julia Version 0.6.0-rc2.0
Commit 68e911be53 (2017-05-18 02:31 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Xeon(R) CPU E5-1620 v2 @ 3.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)


In [2]:
using MNIST
using BenchmarkTools

source_path = join(push!(split(pwd(),"/")[1:end-1],"source/" ),"/")

if !contains(==,LOAD_PATH, source_path) 
    push!(LOAD_PATH, source_path)
end

using MulticlassPerceptron4
using MulticlassPerceptron3
using MulticlassPerceptron2
using MulticlassPerceptron1

percep1 = MulticlassPerceptron1.MPerceptron(Float32, 10, 784)
percep2 = MulticlassPerceptron2.MPerceptron(Float32, 10, 784)
percep3 = MulticlassPerceptron3.MPerceptron(Float32, 10, 784)
percep4 = MulticlassPerceptron4.MPerceptron(Float32, 10, 784)

n_classes = 10
n_features = 784

784

In [3]:
X_train, y_train = MNIST.traindata();
X_test, y_test = MNIST.testdata();
y_train = y_train + 1
y_test = y_test + 1;

T = Float32
X_train = Array{T}((X_train - minimum(X_train))/(maximum(X_train) - minimum(X_train)))
y_train = Array{Int64}(y_train)
X_test = Array{T}(X_test - minimum(X_test))/(maximum(X_test) - minimum(X_test)) 
y_test = Array{Int64}(y_test);

Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m::String, ::Symbol[1m)[22m[22m at [1m./deprecated.jl:64[22m[22m
 [2] [1mArray[22m[22m[1m([22m[22m::Type{Float64}, ::Int64, ::Int64[1m)[22m[22m at [1m./deprecated.jl:51[22m[22m
 [3] [1mtraindata[22m[22m[1m([22m[22m[1m)[22m[22m at [1m/Users/macpro/.julia/v0.6/MNIST/src/MNIST.jl:88[22m[22m
 [4] [1minclude_string[22m[22m[1m([22m[22m::String, ::String[1m)[22m[22m at [1m./loading.jl:515[22m[22m
 [5] [1mexecute_request[22m[22m[1m([22m[22m::ZMQ.Socket, ::IJulia.Msg[1m)[22m[22m at [1m/Users/macpro/.julia/v0.6/IJulia/src/execute_request.jl:156[22m[22m
 [6] [1meventloop[22m[22m[1m([22m[22m::ZMQ.Socket[1m)[22m[22m at [1m/Users/macpro/.julia/v0.6/IJulia/src/eventloop.jl:8[22m[22m
 [7] [1m(::IJulia.##9#12)[22m[22m[1m([22m[22m[1m)[22m[22m at [1m./task.jl:335[22m[22m
while loading In[3], in expression starting on line 1
Stacktrace:
 [1] [1mdepwarn[22m[22m[1m([22m[22m

In [4]:
@benchmark MulticlassPerceptron1.fit!(percep1, X_train, y_train, 1, 0.0001)

BenchmarkTools.Trial: 
  memory estimate:  585.20 MiB
  allocs estimate:  655558
  --------------
  minimum time:     1.057 s (4.24% GC)
  median time:      1.059 s (4.67% GC)
  mean time:        1.068 s (4.45% GC)
  maximum time:     1.102 s (3.81% GC)
  --------------
  samples:          5
  evals/sample:     1

#### MulticlassPerceptron2

- Using views instead of copying examples

In [5]:
@benchmark MulticlassPerceptron2.fit!(percep2, X_train, y_train, 1, 0.0001)

BenchmarkTools.Trial: 
  memory estimate:  195.33 MiB
  allocs estimate:  411863
  --------------
  minimum time:     746.240 ms (1.56% GC)
  median time:      751.973 ms (1.70% GC)
  mean time:        753.685 ms (1.80% GC)
  maximum time:     760.666 ms (1.67% GC)
  --------------
  samples:          7
  evals/sample:     1

#### MulticlassPerceptron3

- Using views instead of copying examples
- using inbounds


In [6]:
@benchmark MulticlassPerceptron3.fit!(percep3, X_train, y_train, 1, 0.0001)

BenchmarkTools.Trial: 
  memory estimate:  163.58 MiB
  allocs estimate:  171066
  --------------
  minimum time:     725.615 ms (1.62% GC)
  median time:      731.445 ms (1.84% GC)
  mean time:        732.291 ms (2.07% GC)
  maximum time:     741.522 ms (2.39% GC)
  --------------
  samples:          7
  evals/sample:     1

#### MulticlassPerceptron4

- Using views instead of copying examples
- using views
- prealocated vector for predicting all datapoints
- using .* sintax for loop fusion

In [7]:
@benchmark MulticlassPerceptron4.fit!(percep4, X_train, y_train, 1, 0.0001)

BenchmarkTools.Trial: 
  memory estimate:  61.61 MiB
  allocs estimate:  240446
  --------------
  minimum time:     710.364 ms (0.62% GC)
  median time:      715.198 ms (0.64% GC)
  mean time:        717.288 ms (0.67% GC)
  maximum time:     730.101 ms (0.77% GC)
  --------------
  samples:          7
  evals/sample:     1

#### MulticlassPerceptron5

**What else can be improved?**

**Can we push the code to memory estimate 0 ?**

**Are we really using the BLAS at the fullest potential?**
