# OnlineStats.jl

_git:_ https://github.com/joshday/OnlineStats.jl                          
_documentation:_ http://joshday.github.io/OnlineStats.jl/stable/index.html

## Summary

Online Stats provide multipel analysis algorithms together with a framework to apply them to online data, this means that instead of requiring all the data to be loaded and analysed at once, one can feed it over time and update the current stat. 
(Note: we follow the documnetation's convention which uses _stat_ to name any statistics/models that the library can handle)

This is an extremely interesting feature since it not only allows to work with data coming through a live feed, it also allows the user to divide the input up into smaller parts if it is too large to be handled at once, for instance for memory usage. Additionally, the library allows merging of its _stats_, making parallelization extremely simple, since the data can be divided into subsets, each fitted separately, and then merged at the end.


The library includes stats ranging from mean and variance, to statistical learning methods such as SVM and multivariate analysis such as PCA. A full list of _stats_ it supports can be found [here](http://joshday.github.io/OnlineStats.jl/stable/stats_and_models.html#Statistics-and-Models-1)


## Usage

### Basic
The package works mostly using only two functions

```julia
    s = Series( Stats1(), Stats2(), .. (, x, Weight() ) )
```

Here we instantiate a Series, using $x$ as initial input (optional), any number of _stats_ from the models list listed above, and a _weight_ function, which determines how the weight of observation varies as more observations are added.


Next we are fitting the new inputs y to the already instantiated series s. Can additionally tell the fitting to use columns or rows for observations (defaults to row).

```julia
    fit!(s, y (,major) )
```
where _major_
should be either _Cols()_ or _Rows()_, and specifies which of _row_ or _column major_ conventions for matrix is being used.

Finally, we can get the value of a stat at any point by invoking the _value_ function:

```julia
    value(s)
```

### Weights

The choice of weights is quite important in an online method since every merge and update will use it to attribute the amount of influence of the observation. OnlineStats already contains a large library of weights, and additionally allows for custom weight functions to be passed. More information can be found [here](http://joshday.github.io/OnlineStats.jl/stable/weights.html)

Weights are passed as arugment to the _Series_ function, for example using mean and variance as _stats_:

```julia
    Series(x, WeightFunction(), Mean(), Variance()) 
```

The following plot shows the different weight functions and how they change with the number of data points added.

<img src="https://user-images.githubusercontent.com/8075494/29486708-a52b9de6-84ba-11e7-86c5-debfc5a80cca.png" width="500"/>

### Merging

If two series follow the same models, they can be merged. Additionally, options can be given as two the merging method

```julia  
    # Default
    merge!(s1, s2, :append)  # equivalent to fit!(s1, y1); fit!(s1, y2);

    # Using weighted average
    merge!(s1, s2, :weighted)

    # Treat s2 as a singleton
    merge!(s1, s2, :singleton)

    # Provide ratio of influence s2 should have
    merge!(s1, s2, .5)    
```


### Parallelization

Parallelization is then simply done by fitting the _series_ made of the same _stats_ but given different subsets of the dataset. Note that not all _stats_ will give the exact result when merged compared to if they were fitted as one. The models that do allow for exact merging are of the subtype _ExactStat_.

```julia
    y1 = randn(10_000)
    y2 = randn(10_000)
    y3 = randn(10_000)

    s1 = Series(Mean(), Variance(), Hist(50))
    s2 = Series(Mean(), Variance(), Hist(50))
    s3 = Series(Mean(), Variance(), Hist(50))

    fit!(s1, y1)
    fit!(s2, y2)
    fit!(s3, y3)

    merge!(s1, s2)  # merge information from s2 into s1
    merge!(s1, s3)  # merge information from s3 into s1

```

### Statistical Learning Methods

The objective function that OnlineStats tries to minimize is 

$$ \frac{1}{n} \sum_{i}^{n} f_i(\beta) + \sum_{j=1}^{p} \lambda_{p} g(\beta_j) $$

Choosing the right penalties and losses can then produce a multitude of models, including Ridge, Lasso, SVMs, Logistic.

```julia
    ### Doc ###

    StatLearn(p::Int, args)
    args:
        loss = .5 * L2DistLoss() # any Loss from LossFunctions.jl
        penalty = L2Penalty() # any Penalty (which has a prox method) from PenaltyFunctions.jl.
        λ = fill(.1, p) # a Vector of element-wise regularization parameters
        updater = SGD() # SGD, ADAGRAD, ADAM, ADAMAX, MSPI

    ### Usage ### 

    o = StatLearn(10, .5 * L2DistLoss(), L1Penalty(), fill(.1, 10), SGD())
    s = Series(o)
    fit!(s, x, y)
    coef(o)
    predict(o, x)
```

### Other methods


Linear regression [_(documentation)_](http://joshday.github.io/OnlineStats.jl/stable/api.html#OnlineStats.LinRegBuilder)
```julia 
    lr = LinRegBuilder(p) # p = number of dimensions
    Series((x,y), lr) # Will do simple OLS
    value(lr)

    # Can provide additional arguments with coef function
    Series(x,lr)
    coef(lr, λ, y=[7], x=[1,2,3]) # Will use columns 1,2,3 to predict column 7 of x
    coef(lr, .5) # Make Ridge by specifying λ (can be float of vector of p dimensions)
    value(lr)
```

Calculate covariance matrix and PCA
```julia
    y = randn(1000, 5)
    o = CovMatrix(5)
    Series(y, o)

    # PCA & tranformation
    d_out = 2
    corr = cor(o)
    evals, evecs = eig(corr)
    pc = sortperm(evals, rev=true)[1:d_out]
    T = y * evecs[:,pc];
```

___

## Examples

### Weights comparison

In [6]:
using OnlineStats
using Plots
plotly()

steps = 0:0.1:6*pi
sins = Float64[]
means = zeros(Float64, 0, 6)

labels = ["EqualWeight", "ExponentialWeight", "Bounded(EqualWeight(),.2)", "LearningRate(.6)", 
        "HarmonicWeight(.1)", "McclainWeight(.1)"]

s_equal = Series(EqualWeight(), Mean() )
s_exp = Series(ExponentialWeight(), Mean() )
s_bounded = Series(Bounded(EqualWeight(),.2), Mean())
s_lr = Series( LearningRate(.6), Mean())
s_harmonic = Series(HarmonicWeight(.1), Mean())
s_mcclain = Series(McclainWeight(.1), Mean())

for (i, x) in enumerate(steps)
    append!(sins, sin(x))
    fit!(s_equal, sins)
    fit!(s_exp, sins)
    fit!(s_bounded, sins)
    fit!(s_lr, sins)
    fit!(s_harmonic, sins)
    fit!(s_mcclain, sins)
        
    means = vcat(means, [value(s_equal)[1] value(s_exp)[1] value(s_bounded)[1] value(s_lr)[1] value(s_harmonic)[1] value(s_mcclain)[1]])
end

**Results:**

![Weights comparison](resources/weights_comparison.gif)

### Linear Learning (Ridge as example)

In [248]:
rmse(x) = sqrt(mean(x.^2.))
f(x) = 2*x[1] - 3*x[2]+0.1*x[3] + x[4] # Ground truth

# Prepare data, with σ=0.1 normal noise
x_train = randn(1000, 4)
y_train = [f(x_train[i,:])+0.1*randn() for i=1:size(x_train,1) ]

x_test = randn(50, 4)
y_test = [f(x_test[i,:])+0.1*randn() for i=1:size(x_test,1) ];

In [240]:
# Specify model: Ridge with 4 input dimensions, using λ=0.1 for all features, and SGD optimiser
o = StatLearn(4, L2Penalty(), L2DistLoss(), fill(0.1, 4), SGD())

# Make it into a series and fit it
s = Series(o);
fit!(s, (x_train, y_train));

# Found coefficients
coef(o);

In [253]:
# Sanity check
score = rmse( y_test - predict(o, x_test) )
println("Score on the testing set: $score")

Score on the testing set: 0.19317829923435953


### Parallelized Linear Learning

In [273]:
rmse(x) = sqrt(mean(x.^2.))
f(x) = 2*x[1] - 3*x[2]+0.1*x[3] + x[4] - 7*x[5] + 2.4*x[6] # Ground truth

# Prepare data, with σ=0.1 normal noise
x_train = randn(30_000_000, 6)
noise = 0.1*randn(30_000_000)
y_train = [f(x_train[i,:])+noise[i] for i=1:size(x_train,1) ]

x_test = randn(50, 6)
y_test = [f(x_test[i,:])+0.1*randn() for i=1:size(x_test,1) ];

In [None]:
# Specify model: Ridge with 6 input dimensions, using λ=0.1 for all features, and SGD optimiser
o = StatLearn(6, L2Penalty(), L2DistLoss(), fill(0.1, 6), SGD())

# Make it into a series and fit it
s = Series(o);

### Single thread ###
tic()
fit!(s, (x_train, y_train));
_end = toq();

println("Single thread took $(_end) seconds")

### In parallel ###
s1 = Series(StatLearn(6, L2Penalty(), L2DistLoss(), fill(0.1, 6), SGD()))
s2 = Series(StatLearn(6, L2Penalty(), L2DistLoss(), fill(0.1, 6), SGD()))
s3 = Series(StatLearn(6, L2Penalty(), L2DistLoss(), fill(0.1, 6), SGD()))

tic()
@spawn fit!(s1, (x_train[1:10_000_000,:], y_train[1:10_000_000]))
@spawn fit!(s2, (x_train[10_000_001:20_000_000,:], y_train[10_000_001:20_000_000]))
@spawn fit!(s3, (x_train[20_000_001:30_000_000,:], y_train[20_000_001:30_000_000]))

merge!(s1, s2)  # merge information from s2 into s1
merge!(s1, s3)  # merge information from s3 into s1
_end = toq();


println("Parallel took $(_end) seconds\n")

In [299]:
println("### Sanity check ###\nCoefficients comparison:\n")
n_coefs = coef(o)
p_coefs = value(s1)[1]
for i in 1:6
    println("Normal: $(round(n_coefs[i],3)) \tParallel: $(round(p_coefs[i],3))")
end

### Sanity check ###
Coefficients comparison:

Normal: 1.908 	Parallel: 1.909
Normal: -2.856 	Parallel: -2.859
Normal: 0.093 	Parallel: 0.09
Normal: 0.953 	Parallel: 0.949
Normal: -6.664 	Parallel: -6.669
Normal: 2.282 	Parallel: 2.289


([1.90809, -2.85603, 0.0931034, 0.952989, -6.66414, 2.28211],)