In [5]:
using Random
using StatsBase: sample, quantile
using BenchmarkTools

In [43]:
function get_edges(X::AbstractMatrix{T}, nbins, rng = Random.MersenneTwister()) where {T}
    nobs = min(size(X, 1), 1000 * nbins)
    obs = rand(rng, 1:size(X, 1), nobs)
    edges = Vector{Vector{T}}(undef, size(X, 2))
    for i = 1:size(X, 2)
        edges[i] = quantile(view(X, obs, i), (1:nbins) / nbins)
        if length(edges[i]) == 0
            edges[i] = [minimum(view(X, obs, i))]
        end
    end
    return edges
end

get_edges (generic function with 2 methods)

Let us create a dataset containing as rows examples and as columns features

In [7]:
n_obs = 1_000_000
n_feat = 128

X = Random.rand(Float32,(n_obs, n_feat));

Note that we the function that creates the edges of the histogram does the following:

- For each feature (`i` index) it takes `nobs` observations where the indices of those observations are shared across features. Then it uses `quantile` to get the `nbins` bins. 

Let us see how this is done for feature `i`.

In [8]:
i = 1
rng = Random.MersenneTwister()
nbins = 64
nobs = min(size(X, 1), 1000 * nbins)
obs = rand(rng, 1:size(X, 1), nobs);
edges = quantile(view(X, obs, i), (1:nbins) / nbins)

64-element Vector{Float64}:
 0.015616481192409992
 0.03124706633388996
 0.04708057828247547
 0.06305253878235817
 0.0787111148238182
 0.09383239969611168
 0.10966158472001553
 0.1262708231806755
 0.14198730792850256
 0.15775251388549805
 0.17343730572611094
 0.1880575492978096
 0.2040867004543543
 ⋮
 0.8283856455236673
 0.8434279188513756
 0.858756854198873
 0.8750673457980156
 0.8907013712450862
 0.9065513703972101
 0.9229941554367542
 0.9390546642243862
 0.9537015976384282
 0.969012301415205
 0.9844628907740116
 0.999945342540741

The reader can note that, if the number of features is fixed, the cost of `get_edges` does not increase with more data (after a minimum threshold). This happens because the number of observations used in the function is cut at `1000 * nbins`. 

```
nobs = min(size(X, 1), 1000 * nbins)
```

So unless the dataset is quite small, the computation cost will not scale after reaching `1000 * nbins` examples in the training data.


In [9]:
n_obs = 10_000
X = Random.rand(Float32,(n_obs, n_feat));
nobs = min(size(X, 1), 1000 * nbins)
println("Observations used with X with $n_obs data points is $nobs")

n_obs = 100_000
X = Random.rand(Float32,(n_obs, n_feat));
nobs = min(size(X, 1), 1000 * nbins)
println("Observations used with X with $n_obs data points is $nobs")

n_obs = 1000_000
X = Random.rand(Float32,(n_obs, n_feat));
nobs = min(size(X, 1), 1000 * nbins)
println("Observations used with X with $n_obs data points is $nobs")

Observations used with X with 10000 data points is 10000
Observations used with X with 100000 data points is 64000
Observations used with X with 1000000 data points is 64000


We can time the routine with 10x the data and see very similar execution times

In [10]:
n_obs = 100_000
n_feat = 128
nbins = 64

X = Random.rand(Float32,(n_obs, n_feat));
@btime get_edges($X, $nbins);

  590.855 ms (527 allocations: 31.89 MiB)


In [11]:
n_obs = 1_000_000
n_feat = 128
nbins = 64

X = Random.rand(Float32,(n_obs, n_feat));
@btime get_edges($X, $nbins);

  607.097 ms (527 allocations: 31.89 MiB)


## Feature quantization 

Once the edges for each feature are found then the data in X can be quantized or discretized. 

In [12]:
function binarize(X, edges)
    x_bin = zeros(UInt8, size(X))
    for i = 1:size(X, 2)
        # Why is here ommited the last bin?
        @inbounds x_bin[:, i] .=
            searchsortedlast.(Ref(edges[i][1:end-1]), view(X, :, i)) .+ 1
    end
    return x_bin
end

function binarize2(X, edges)
    n_examples, n_features = size(X)
    x_bin = zeros(UInt8, (n_examples, n_features))
    @inbounds for j = 1:n_features
        edge_view = view(edges[j], 1:length(edges[j]) -1)
        for n in 1:n_examples
            x_bin[n, j] = searchsortedlast(edge_view, X[n, j]) + 1
        end
    end
    return x_bin
end

binarize2 (generic function with 1 method)

#### Understanding **`searchsortedlast`** method

This method returns the index of the last value in `thresholds` that is less than or equal to the input, according to the specified order.


In [142]:
thresholds = [x for x in (1:10)/10]
# returns 0 because 0.07 is < than threshold[1] 
@show searchsortedlast(thresholds, 0.07) 

# returns 1 because 0.12 is >= than threshold[1] but lower than threshold[2]
@show searchsortedlast(thresholds, 0.12) 

# returns 2 because 0.2 is >= than threshold[2] but lower than threshold[1]
@show searchsortedlast(thresholds, 0.2) 


# returns 2 because 0.2 is >= than threshold[2] but lower than threshold[1]
@show searchsortedlast(thresholds, 0.96) 

@show custom_searchsortedlast(thresholds, 1.0) 

searchsortedlast(thresholds, 0.07) = 0
searchsortedlast(thresholds, 0.12) = 1
searchsortedlast(thresholds, 0.2) = 2
searchsortedlast(thresholds, 0.96) = 9
custom_searchsortedlast(thresholds, 1.0) = 10


10

In [158]:
function custom_searchsortedlast(sorted_values, new_value)
    if new_value < sorted_values[1]
        return 0
    end
    
    @inbounds for k in 1:length(sorted_values)-1
        if new_value >= sorted_values[k] && new_value < sorted_values[k+1]
            return k
        end
    end
    return length(sorted_values)
end

custom_searchsortedlast (generic function with 1 method)

In [159]:
@show custom_searchsortedlast(thresholds, 0.07) 

# returns 1 because 0.12 is >= than threshold[1] but lower than threshold[2]
@show custom_searchsortedlast(thresholds, 0.12) 

# returns 2 because 0.2 is >= than threshold[2] but lower than threshold[1]
@show custom_searchsortedlast(thresholds, 0.2) 

# returns 2 because 0.2 is >= than threshold[2] but lower than threshold[1]
@show custom_searchsortedlast(thresholds, 0.96) 

@show custom_searchsortedlast(thresholds, 1.0) 

custom_searchsortedlast(thresholds, 0.07) = 0
custom_searchsortedlast(thresholds, 0.12) = 1
custom_searchsortedlast(thresholds, 0.2) = 2
custom_searchsortedlast(thresholds, 0.96) = 9
custom_searchsortedlast(thresholds, 1.0) = 10


10

Note that our custom method is  slower than the original

In [171]:
thresholds = [x for x in (1:64)/100]

println("original method")
@btime searchsortedlast($thresholds, 0.17) 
@btime searchsortedlast($thresholds, 0.97)

println("custom method")
@btime custom_searchsortedlast($thresholds, 0.17) 
@btime custom_searchsortedlast($thresholds, 0.97) 

original method
  9.259 ns (0 allocations: 0 bytes)
  11.637 ns (0 allocations: 0 bytes)
custom method
  16.950 ns (0 allocations: 0 bytes)
  52.358 ns (0 allocations: 0 bytes)


64

In [82]:
thresholds = [x for x in (1:100)/100]
@btime searchsortedlast($thresholds, 0.95) 

  10.468 ns (0 allocations: 0 bytes)


95

In [84]:
thresholds = [x for x in (1:20)/20]
@btime searchsortedlast($thresholds, 0.95) 

  7.925 ns (0 allocations: 0 bytes)


19

In [44]:
thresholds = [x for x in (1:10)/10]
values_to_discretize = [0.12, 0.19, 0.2, 0.29, 0.989, 1.2]

@show thresholds;
@show values_to_discretize;
for value in values_to_discretize
    println("searchsortedlast(thresholds, $value) \t=$(searchsortedlast(thresholds, value))")
end

thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
values_to_discretize = [0.12, 0.19, 0.2, 0.29, 0.989, 1.2]
searchsortedlast(thresholds, 0.12) 	=1
searchsortedlast(thresholds, 0.19) 	=1
searchsortedlast(thresholds, 0.2) 	=2
searchsortedlast(thresholds, 0.29) 	=2
searchsortedlast(thresholds, 0.989) 	=9
searchsortedlast(thresholds, 1.2) 	=10


We can broadcast using `.` with `searchsortedlast.` but doing this requires `thresholds` to have a single element

In [408]:
length(thresholds), length([thresholds]), length(Ref(thresholds))

(100, 1, 1)

In [411]:
# won't work 
# @show searchsortedlast.(thresholds, values_to_discretize);

In [393]:
thresholds = [x for x in (1:10)/10]
values_to_discretize = [0.12, 0.19, 0.2, 0.29, 0.989, 1.2]

@show thresholds;
@show values_to_discretize;
@show searchsortedlast.(Ref(thresholds), values_to_discretize);

thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
values_to_discretize = [0.12, 0.19, 0.2, 0.29, 0.989, 1.2]
searchsortedlast.(Ref(thresholds), values_to_discretize) = [1, 1, 2, 2, 9, 10]


In [572]:
thresholds = [x for x in (1:10)/10]
values_to_discretize = [0.12, 0.19, 0.2, 0.29, 0.989, 1.2]

@show thresholds;
@show values_to_discretize;
@show searchsortedlast.([thresholds], values_to_discretize);

thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
values_to_discretize = [0.12, 0.19, 0.2, 0.29, 0.989, 1.2]
searchsortedlast.([thresholds], values_to_discretize) = [1, 1, 2, 2, 9, 10]


Timings are preety similar

In [402]:
@btime searchsortedlast.([thresholds], values_to_discretize);

  523.624 ns (4 allocations: 240 bytes)


In [403]:
@btime searchsortedlast.(Ref(thresholds), values_to_discretize);

  569.220 ns (4 allocations: 192 bytes)


In [475]:
@allocated Ref(edges[i][1:end-1])

384

In [479]:
@allocated view(edges[i],1:length(edges[i])-1)

80

Now let us benchmark binarize and binarize2

In [594]:
n_obs = 100_000
n_feat = 128
nbins = 64

X = Random.rand(Float32,(n_obs, n_feat));
edges = get_edges(X, nbins);

In [595]:
Xbin1 = binarize(X, edges)
Xbin2 = binarize2(X, edges);

@show isequal(Xbin1, Xbin2)
@show mean(Xbin1 .== Xbin2)

isequal(Xbin1, Xbin2) = true
mean(Xbin1 .== Xbin2) = 1.0


1.0

In [596]:
@benchmark Xbin1 = binarize(X, edges)

BenchmarkTools.Trial: 12 samples with 1 evaluation.
 Range [90m([39m[36m[1mmin[22m[39m … [35mmax[39m[90m):  [39m[36m[1m452.612 ms[22m[39m … [35m457.228 ms[39m  [90m┊[39m GC [90m([39mmin … max[90m): [39m0.00% … 0.00%
 Time  [90m([39m[34m[1mmedian[22m[39m[90m):     [39m[34m[1m453.720 ms               [22m[39m[90m┊[39m GC [90m([39mmedian[90m):    [39m0.00%
 Time  [90m([39m[32m[1mmean[22m[39m ± [32mσ[39m[90m):   [39m[32m[1m454.266 ms[22m[39m ± [32m  1.462 ms[39m  [90m┊[39m GC [90m([39mmean ± σ[90m):  [39m0.00% ± 0.00%

  [39m▁[39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m█[39m [39m▁[34m▁[39m[39m▁[39m [39m [39m [39m [39m▁[39m [32m [39m[39m [39m [39m▁[39m [39m [39m [39m [39m [39m▁[39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m [39m▁[39m [39m [39m [39m▁[39m [39m 
  [39m█[39m█[39m▁[39m▁

In [4]:
@benchmark Xbin2 = binarize2(X, edges)

LoadError: LoadError: UndefVarError: @benchmark not defined
in expression starting at In[4]:1

#### Here a question arise, is it faster to use Ref vs view?

In [199]:
view(edges,4)[1]

64-element Vector{Float32}:
 0.016003508
 0.031283252
 0.046730824
 0.06210843
 0.0771625
 0.09262016
 0.10782797
 0.12365594
 0.13976175
 0.1546368
 0.16915064
 0.18447052
 0.20031089
 ⋮
 0.82807434
 0.8436005
 0.85880005
 0.8746571
 0.8899226
 0.9054942
 0.9215226
 0.9376114
 0.9524945
 0.9682254
 0.9840777
 0.9999978

In [180]:
@btime aux = Ref(edges[1][1:64]);

  202.422 ns (2 allocations: 352 bytes)


In [182]:
@btime aux = view(edges,1:64,1);

  78.173 ns (3 allocations: 160 bytes)
