## MLPreprocessing.jl

git: https://github.com/JuliaML/MLPreprocessing.jl

*Note:* Package is not yet registered in teh julia repo, therefore one needs to manually add it through Pkg.clone(git_url, "MLPreprocesing")

### Summary

Often in ML tasks, the data is required to be preprocessed in one way or another. $MLPreprocessing$ offers a few generic preprocessing, such as centering and scaling to any $\mu, \sigma$. The package also allows for a fixed range scaling on matrices (or dataframes). Finally, polynomial expansion is also supported.

### Details

| Test                      | Results                           |            
| :- | :- |
| Packages works            | yes                               |
| Deprecation warnings      | None                              |
| Compatible with JuliaDB   | If tables are converted to arrays or dataframes |
| Contains Documetation     | No, but sufficient examples             |
| Simplicity                | Fair               |


### Issues

Although the package says to support Vector types, it did not seem to work and 1-dimensional vectors had to be transformed into matrices (see end of notebook)

### Main Functions

```
fit(StandardScaler, X[, μ, σ; obsdim, operate_on])

    obsdim: like axis in python, whether to preprocess in rows or columns
    operate_on: Specify the indices of columns or rows to be transformed
    Operates in place on StandardScaler


fit_transform(StandardScaler, X[, μ, σ; obsdim, operate_on])

    @return: Scaled X, scaler object
    Note: if "!" suffix is used, the function will act on X in place, and only return the scaler
    
    
 fit(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])

     lower, upper: lower and upper boundaries to transform data to
 
 fit_transform(FixedRangeScaler, X[, lower, upper; obsdim, operate_on])
 
     @return: Scaled X, scaler object
     Note: if "!" suffix is used, the function will act on X in place, and only return the scaler     
  
  
 transform(X, scaler)

    Requires scaler to be fitted
    @return: scaled X
```

The lower level functions that compose these transformation can also be used independently. Refer to the github page to see more.


### Example

In [20]:
using MLPreprocessing
using Plots
include("load_titanic.jl");
pyplot();

Plots.PyPlotBackend()

In [21]:
# Loading titanic data
X_train, y_train, X_test, y_test = load();

---
#### Standardising

In [22]:
println("μ: $(mean(X_train, 1))")
println("σ: $(std(X_train, 1))")

μ: [447.267 2.23028 0.632492 29.5113 0.501577 0.430599 35.1627 0.269716]
σ: [258.502 0.835979 0.482507 14.5776 0.903301 0.831305 54.4096 0.52861]


In [23]:
scaler = fit(StandardScaler, X_train, obsdim=2);
Xt_train = transform(X_train, scaler);

In [24]:
println("μ: $(mean(Xt_train, 1))")
println("σ: $(std(Xt_train, 1))")

μ: [2.32711 -0.413391 -0.43731 -0.0899352 -0.436092 -0.439052 -0.0700642 -0.441263]
σ: [0.431171 0.0671616 0.082835 0.41062 0.07978 0.0826051 0.496512 0.0847349]


---
#### Transforming to range

In [25]:
Xt_train, scaler = fit_transform(FixedRangeScaler, X_train, 1, 3, operate_on=[4], obsdim=1);

([1.0 3.0 … 7.25 0.0; 2.0 1.0 … 71.2833 1.0; … ; 887.0 2.0 … 13.0 0.0; 888.0 1.0 … 30.0 0.0], MLPreprocessing.FixedRangeScaler{Int64,Int64,Float64,Float64,1,Int64}(1, 3, [0.42], [80.0], LearnBase.ObsDim.Constant{1}(), [4]))

In [34]:
# And now the 1st column values all are between 1 and 3
println("Before:\tmin=$(minimum(X_train[:,4])), max=$(maximum(X_train[:,4]))")
println("After:\tmin=$(minimum(Xt_train[:,4])), max=$(maximum(Xt_train[:,4]))")

Before:	min=0.42, max=80.0
After:	min=1.0, max=3.0


---
#### Issue with vectors / 1-d arrays

In [27]:
# Doesn't work although package supposedly compatible with vectors. No example code given for vector example
x = randn(100)
fit(StandardScaler, Vector{Float64}(x) )

LoadError: [91mMethodError: no method matching fit(::Type{MLPreprocessing.StandardScaler}, ::Array{Float64,1})[0m
Closest candidates are:
  fit([91m::Type{StatsBase.Histogram}[39m, ::AbstractArray{T,1} where T, [91m::StatsBase.AbstractWeights{W,T,V} where V<:AbstractArray{T,1} where T<:Real[39m, [91m::Any...[39m; kwargs...) where W at /home/edoardo/.julia/v0.6/StatsBase/src/hist.jl:233
  fit([91m::Type{StatsBase.Histogram{T,N,E} where E where N}[39m, ::AbstractArray{T,1} where T; closed, nbins) where T at /home/edoardo/.julia/v0.6/StatsBase/src/hist.jl:226
  fit([91m::Type{StatsBase.Histogram{T,N,E} where E where N}[39m, ::AbstractArray{T,1} where T, [91m::StatsBase.AbstractWeights[39m; closed, nbins) where T at /home/edoardo/.julia/v0.6/StatsBase/src/hist.jl:230
  ...[39m

In [28]:
# A fix is to simply make the 1-d array a one column-matrix
x = randn(100,1)
scler = fit(StandardScaler, x )
xt = transform(x, scaler)

println("\mu = $(mean(xt))")
println("\sigma = $(std(xt))")

mu = 0.0866453490584669
sigma = 0.9381369913673726
