In [1]:
using Random
using StatsBase: sample, quantile
using BenchmarkTools

In [2]:
function get_edges(X::AbstractMatrix{T}, nbins, rng = Random.MersenneTwister()) where {T}
    nobs = min(size(X, 1), 1000 * nbins)
    obs = rand(rng, 1:size(X, 1), nobs)
    edges = Vector{Vector{T}}(undef, size(X, 2))
    for i = 1:size(X, 2)
        edges[i] = quantile(view(X, obs, i), (1:nbins) / nbins)
        if length(edges[i]) == 0
            edges[i] = [minimum(view(X, obs, i))]
        end
    end
    return edges
end

get_edges (generic function with 2 methods)

Let us create a dataset containing as rows examples and as columns features

In [3]:
n_obs = 1_000_000
n_feat = 128

X = Random.rand(Float32,(n_obs, n_feat));

Note that we the function that creates the edges of the histogram does the following:

- For each feature (`i` index) it takes `nobs` observations where the indices of those observations are shared across features. Then it uses `quantile` to get the `nbins` bins. 

Let us see how this is done for feature `i`.

In [4]:
i = 1
rng = Random.MersenneTwister()
nbins = 64
nobs = min(size(X, 1), 1000 * nbins)
obs = rand(rng, 1:size(X, 1), nobs);
edges = quantile(view(X, obs, i), (1:nbins) / nbins)

64-element Vector{Float64}:
 0.015942998230457306
 0.03193836845457554
 0.04804190155118704
 0.06415528431534767
 0.07991458475589752
 0.0956811960786581
 0.1104482663795352
 0.1259417086839676
 0.14037126395851374
 0.155155211687088
 0.17153337504714727
 0.1877315305173397
 0.2034794483333826
 ⋮
 0.8292878717184067
 0.8454529084265232
 0.8618549490347505
 0.8770863637328148
 0.8922181064262986
 0.9079647399485111
 0.9230346521362662
 0.9396149516105652
 0.9547119196504354
 0.9699439462274313
 0.9846006790176034
 0.9999920725822449

The reader can note that, if the number of features is fixed, the cost of `get_edges` does not increase with more data (after a minimum threshold). This happens because the number of observations used in the function is cut at `1000 * nbins`. 

```
nobs = min(size(X, 1), 1000 * nbins)
```

So unless the dataset is quite small, the computation cost will not scale after reaching `1000 * nbins` examples in the training data.


In [5]:
n_obs = 10_000
X = Random.rand(Float32,(n_obs, n_feat));
nobs = min(size(X, 1), 1000 * nbins)
println("Observations used with X with $n_obs data points is $nobs")

n_obs = 100_000
X = Random.rand(Float32,(n_obs, n_feat));
nobs = min(size(X, 1), 1000 * nbins)
println("Observations used with X with $n_obs data points is $nobs")

n_obs = 1000_000
X = Random.rand(Float32,(n_obs, n_feat));
nobs = min(size(X, 1), 1000 * nbins)
println("Observations used with X with $n_obs data points is $nobs")

Observations used with X with 10000 data points is 10000
Observations used with X with 100000 data points is 64000
Observations used with X with 1000000 data points is 64000


## Feature quantization 

Once the edges for each feature are found then the data in X can be quantized or discretized. 

In [6]:
function binarize(X, edges)
    x_bin = zeros(UInt8, size(X))
    for i = 1:size(X, 2)
        # Why is here ommited the last bin?
        @inbounds x_bin[:, i] .=
            searchsortedlast.(Ref(edges[i][1:end-1]), view(X, :, i)) .+ 1
    end
    return x_bin
end

function binarize2(X, edges)
    n_examples, n_features = size(X)
    x_bin = zeros(UInt8, (n_examples, n_features))
    @inbounds for j = 1:n_features
        edge_view = view(edges[j], 1:length(edges[j]) -1)
        for n in 1:n_examples
            x_bin[n, j] = searchsortedlast(edge_view, X[n, j]) + 1
        end
    end
    return x_bin
end

binarize2 (generic function with 1 method)

#### Understanding **`searchsortedlast`** method

This method returns the index of the last value in `thresholds` that is less than or equal to the input, according to the specified order.


In [7]:
thresholds = [x for x in (1:10)/10]
# returns 0 because 0.07 is < than threshold[1] 
@show searchsortedlast(thresholds, 0.07) 

# returns 1 because 0.12 is >= than threshold[1] but lower than threshold[2]
@show searchsortedlast(thresholds, 0.12) 

# returns 2 because 0.2 is >= than threshold[2] but lower than threshold[1]
@show searchsortedlast(thresholds, 0.2) 


# returns 2 because 0.2 is >= than threshold[2] but lower than threshold[1]
@show searchsortedlast(thresholds, 0.96) 

@show custom_searchsortedlast(thresholds, 1.0) 

searchsortedlast(thresholds, 0.07) = 0
searchsortedlast(thresholds, 0.12) = 1
searchsortedlast(thresholds, 0.2) = 2
searchsortedlast(thresholds, 0.96) = 9


LoadError: UndefVarError: custom_searchsortedlast not defined

In [8]:
function custom_searchsortedlast(sorted_values, new_value)
    if new_value < sorted_values[1]
        return 0
    end
    
    @inbounds for k in 1:length(sorted_values)-1
        if new_value >= sorted_values[k] && new_value < sorted_values[k+1]
            return k
        end
    end
    return length(sorted_values)
end

custom_searchsortedlast (generic function with 1 method)

In [9]:
@show custom_searchsortedlast(thresholds, 0.07) 

# returns 1 because 0.12 is >= than threshold[1] but lower than threshold[2]
@show custom_searchsortedlast(thresholds, 0.12) 

# returns 2 because 0.2 is >= than threshold[2] but lower than threshold[1]
@show custom_searchsortedlast(thresholds, 0.2) 

# returns 2 because 0.2 is >= than threshold[2] but lower than threshold[1]
@show custom_searchsortedlast(thresholds, 0.96) 

@show custom_searchsortedlast(thresholds, 1.0) 

custom_searchsortedlast(thresholds, 0.07) = 0
custom_searchsortedlast(thresholds, 0.12) = 1
custom_searchsortedlast(thresholds, 0.2) = 2
custom_searchsortedlast(thresholds, 0.96) = 9
custom_searchsortedlast(thresholds, 1.0) = 10


10

Note that our custom method is  slower than the original

In [10]:
thresholds = [x for x in (1:64)/100]

println("original method")
@btime searchsortedlast($thresholds, 0.17) 
@btime searchsortedlast($thresholds, 0.97)

println("custom method")
@btime custom_searchsortedlast($thresholds, 0.17) 
@btime custom_searchsortedlast($thresholds, 0.97) 

original method
  12.638 ns (0 allocations: 0 bytes)
  13.889 ns (0 allocations: 0 bytes)
custom method
  19.140 ns (0 allocations: 0 bytes)
  47.318 ns (0 allocations: 0 bytes)


64

Can we make `custom_searchsortedlast` faster? Investigate same solution as in metaprograming folder with printing balls function