## SparseRegression.jl
Git: https://github.com/joshday/SparseRegression.jl

---

#### Summary
Sparse regression is a package to achieve high performance regression of linear models for large dataset where coefficients often turn out to be sparse.
The main call follows the form SModel(x,y, args) where arguments include the loss, penalty, and the $\lambda$ and $\omega$ arguments.
Prediction are done through *predict(X, model)* call

The loss and penalty functions are based on the _LossFunctions_ and _PenaltyFunctions_ MLJulia core packages.

Additionally, one can use learning strategies from the _LearningStategies_ package. This allows to set parameters that are purely learning based, such as optimizers, max iterations or max items. 
More on this in the documentation.

This structure allows for one model to be used for the many linear models such as OLS, ridge, lasso etc. which all have the same underlying structure.

Issues: Seems to lose performance quite strongly when dimensionality increases, see benchmark at the bottom

---
#### Details

| Test        | Results           
| ------------- |:-------------:|
| Package works | yes |
| Deprecations warnings      | No      |
| Compatible with JuliaDB | If targets transformed into array |
| Contains documentation | yes, but not great |
| Simplicity | good |


---
#### Usage


SModel(x, y, args...)

Arguments

- loss::Loss = .5 * L2DistLoss()

- penalty::Penalty = L2Penalty()

- λ::Vector{Float64} = fill(size(x, 2), .1)

- w::Union{Void, AbstractWeights} = nothing

---
#### Sample code

In [1]:
using SparseRegression;
include("load_titanic.jl");

[1m[36mINFO: [39m[22m[36mRecompiling stale cache file /home/edoardo/.julia/lib/v0.6/LearningStrategies.ji for module LearningStrategies.
[39m

In [2]:
X_train, y_train, X_test, y_test = load();

In [3]:
# Example using lasso regression
model = SModel(X_train,y_train, L2DistLoss(), L1Penalty());
learn!(model);
model

[1m[36mINFO: [39m[22m[36mMaxIter(100) finished
[39m

█ SModel
  > β        : [2.52083e74 -4.01126e74 … -2.42626e75 -1.07446e73]
  > λ factor : [0.1 0.1 … 0.1 0.1]
  > Loss     : L2DistLoss
  > Penalty  : L1Penalty
  > Data
    - x : 634×8 Array{Float64,2}
    - y : 634-element Array{Int64,1}
    - w : Void

In [4]:
# Predicting new data
predict(model, X_test);

---
### Simple benchmark vs python 

(Only lasso regression is tested)

In [9]:
function compute_regression(n_points::Int64, n_dims::Int64)
    x = randn(n_points, n_dims);
    y = x * linspace(-1, 1, n_dims) + randn(n_points);
    s = SModel(x, y);

    tic();
    learn!(s);
    time = toc();
   
    return time
end

compute_regression (generic function with 2 methods)

In [14]:
### This cell takes ~5mins to run on my laptop, I would suggest trusting the results listed below instead of trying to run it.
IJulia.set_verbose(false)

n_points = 10_000
n_dims = [10,100,1000, 3000]

avg_times = []

for n_dim in n_dims
    times = []
    for i in 1:5
        time = compute_regression(n_points, n_dim);
        
        push!(times, time);
    end
    avg_times = mean(times);
end

IJulia.set_verbose(true)

elapsed time: 0.000451833 seconds
elapsed time: 0.000351916 seconds
elapsed time: 0.000330776 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m[1m[36mINFO: [39m[22m[36mSweep finished
[39m[1m[36mINFO: [39m[22m[36mSweep finished
[39m

elapsed time: 0.000371106 seconds
elapsed time: 0.0004807 seconds
elapsed time: 0.014209522 seconds
elapsed time: 0.006058353 seconds
elapsed time: 0.006046312 seconds
elapsed time: 0.005942704 seconds
elapsed time: 0.006020863 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m[1m[36mINFO: [39m[22m[36mSweep finished
[39m[1m[36mINFO: [39m[22m[36mSweep finished
[39m[1m[36mINFO: [39m[22m[36mSweep finished
[39m[1m[36mINFO: [39m[22m[36mSweep finished
[39m[1m[36mINFO: [39m[22m[36mSweep finished
[39m[1m[36mINFO: [39m[22m[36mSweep finished
[39m

elapsed time: 0.266935342 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m

elapsed time: 0.26055046 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m

elapsed time: 0.411826132 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m

elapsed time: 0.350356857 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m

elapsed time: 0.267703031 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m

elapsed time: 12.303489593 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m

elapsed time: 12.107999157 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m

elapsed time: 12.238423989 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m

elapsed time: 12.392077419 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m

elapsed time: 12.459131971 seconds


[1m[36mINFO: [39m[22m[36mSweep finished
[39m

true

#### Results

| Dimensions    | Julia | Python    
| ------------- |:-----:|:-----:|
| 10 | 0.00055s | 0.023s |
| 100 | 0.0073s | 0.19s |
| 1000 | 0.29s | 2.3s|
| 5000 | 58s | 17s|

Clearly, something goes wrong with the package when dimensions increase over a certain threshold, while python's performances seem to increase as expected.
The code for the python's results can be found in *python_scripts.py*