### Unidad II. Regresiones y reducción de dimensionalidad.

## Regresión Lineal Múltiple por el método de los Mínimos Cuadrados.

- Mínimos Cuadrados Ponderados. 
 - Examen de los residuos. Prueba de normalidad de los residuales. 
 - Prueba de homogeneidad de varianza. 
 - Observaciones extremas. 
 - Búsqueda de la mejor ecuación de regresión. Stepwise regression

In [1]:
using MultivariateStats

In [2]:
# prepare data
X = rand(1000, 3)
a0, b0 = rand(3), rand()
y = X * a0 + b0 + 0.1 * randn(1000)

1000-element Array{Float64,1}:
 1.69082 
 2.74528 
 1.93627 
 1.8578  
 1.20081 
 1.52956 
 2.06398 
 0.906657
 1.58807 
 1.87104 
 1.37745 
 1.31134 
 2.37205 
 ⋮       
 1.56564 
 1.65489 
 2.20615 
 0.991936
 1.75456 
 1.6685  
 2.05129 
 1.82039 
 1.94612 
 1.25767 
 1.62706 
 2.11277 

In [3]:
# solve using llsq
sol = llsq(X, y)

4-element Array{Float64,1}:
 0.597109
 0.309186
 0.969809
 0.794585

In [4]:
# extract results
a, b = sol[1:end-1], sol[end]

([0.5971089527294332,0.30918607484199245,0.9698086956968706],0.7945846246484379)

In [5]:
# do prediction
yp = X * a + b

1000-element Array{Float64,1}:
 1.56625 
 2.53564 
 1.86689 
 2.05986 
 1.30457 
 1.52949 
 2.02524 
 0.894085
 1.49036 
 1.91566 
 1.44641 
 1.57973 
 2.34991 
 ⋮       
 1.68016 
 1.59811 
 2.02741 
 0.975066
 1.78062 
 1.61701 
 1.9392  
 1.85181 
 1.98969 
 1.30752 
 1.62812 
 2.06364 

In [6]:
y_sin_ruido = X * a0 + b0
sol_sin_ruido = llsq(X, y_sin_ruido)
predicción_sin_ruido = X * sol_sin_ruido[1:end-1] + sol_sin_ruido[end]

1000-element Array{Float64,1}:
 1.56431 
 2.52604 
 1.86076 
 2.04719 
 1.31064 
 1.52027 
 2.02462 
 0.908628
 1.49098 
 1.91704 
 1.45129 
 1.57384 
 2.33602 
 ⋮       
 1.68092 
 1.60373 
 2.03067 
 0.989743
 1.77409 
 1.61625 
 1.94349 
 1.86386 
 1.98882 
 1.32445 
 1.62856 
 2.05109 

In [7]:
using Plots
gr(size=(600,300))
plot(
scatter(y, yp, alpha=0.5, legend=false, 
    xlab="Valores", ylab="Predicciones", title="Con ruido"),
scatter(y_sin_ruido, predicción_sin_ruido, alpha=0.5, legend=false, 
    xlab="Valores", ylab="Predicciones", title="Sin ruido")
)

[Plots.jl] Initializing backend: gr


### Ridge Regression 

[Ridge regression](https://en.wikipedia.org/wiki/Tikhonov_regularization) en [MultivariateStats](http://multivariatestatsjl.readthedocs.io/en/latest/lreg.html#ridge-regression) y [en ScikitLearn](http://scikit-learn.org/stable/modules/linear_model.html#ridge-regression).  
Esta regresión es mucho más robusta a la presencia de variables explicativas correlacionadas (colineales).

*Ridge Regression* usando **ScikitLearn.jl**, el valor de α es seleccionado usando *validación cruzada*:

In [8]:
using ScikitLearn
@sk_import linear_model: RidgeCV

In [9]:
rcv = RidgeCV(alphas=0.01:0.01:10.0)

PyObject RidgeCV(alphas=[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0....9.84, 9.85, 9.86, 9.87, 9.88, 9.89, 9.9, 9.91, 9.92, 9.93, 9.94, 9.95, 9.96, 9.97, 9.98, 9.99, 10.0],
    cv=None, fit_intercept=True, gcv_mode=None, loss_func=None,
    normalize=False, score_func=None, scoring=None, store_cv_values=False)

In [10]:
fit!(rcv, X, y)

PyObject RidgeCV(alphas=[0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0....9.84, 9.85, 9.86, 9.87, 9.88, 9.89, 9.9, 9.91, 9.92, 9.93, 9.94, 9.95, 9.96, 9.97, 9.98, 9.99, 10.0],
    cv=None, fit_intercept=True, gcv_mode=None, loss_func=None,
    normalize=False, score_func=None, scoring=None, store_cv_values=False)

In [11]:
rcv[:coef_]

3-element Array{Float64,1}:
 0.596958
 0.309117
 0.969565

In [12]:
rcv[:intercept_]

0.794822278874679

In [13]:
rcv[:alpha_]

0.02

*Ridge Regression* usando **ScikitLearn.jl**:

In [14]:
@sk_import linear_model: Ridge

In [15]:
r = Ridge(alpha = .3)

PyObject Ridge(alpha=0.3, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, solver='auto', tol=0.001)

In [16]:
fit!(r, X, y)

PyObject Ridge(alpha=0.3, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, solver='auto', tol=0.001)

In [17]:
r[:coef_]

3-element Array{Float64,1}:
 0.594857
 0.308152
 0.966171

In [18]:
rcv[:intercept_]

0.794822278874679

*Ridge Regression* usando **MultivariateStats**:

In [19]:
MultivariateStats.ridge(X, y, 0.3)

4-element Array{Float64,1}:
 0.594857
 0.308152
 0.966171
 0.798137

*Ridge Regression* usando **R** and **RCall**:

In [20]:
using RCall

R"""
library(MASS)
lmr <- lm.ridge($y ~ $X, lambda=0.3)
"""

RCall.RObject{RCall.VecSxp}
                  `##RCall##11741`1 `##RCall##11741`2 `##RCall##11741`3 
        0.7948810         0.5969168         0.3090992         0.9695100 
