# Coordinate descent

CuML library can implement lasso and elastic net algorithms. The lasso model extends LinearRegression with L2 regularization and elastic net extends LinearRegression with a combination of L1 and L2 regularizations. 

We see tremendous speed up for datasets with large number of rows and less number of rows. Furthermore, the MSE value for the cuML implementation is much smaller than the scikit-learn implementation for very small datasets.


# Setup:
1.  Install most recent Miniconda release compatible with Google Colab's Python install (3.6.7)
2.  Install RAPIDS libraries
3. Set necessary environment variables
4. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions
    - may take a few minutes
    - long output (output display removed)


In [0]:
!wget -nc https://github.com/rapidsai/notebooks-extended/raw/master/utils/rapids-colab.sh
!bash rapids-colab.sh

import sys, os

sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

## Data & Imports

In [2]:
# Download data 
!mkdir data
!wget https://github.com/rapidsai/notebooks/raw/branch-0.8/cuml/data/mortgage.npy.gz -O data/mortgage.npy.gz

# Select a particular GPU to run the notebook  (if needed)
# %env CUDA_VISIBLE_DEVICES=2
# Import the required libraries

# rapids
import cudf, cuml, xgboost
import dask_cudf, dask_cuml
from cuml import Lasso as cuLasso
from cuml.linear_model import ElasticNet as cuElasticNet
# scikit
from sklearn.linear_model import Lasso
from sklearn.datasets import make_regression
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import ElasticNet
# general 
import numpy as np
import pandas as pd

--2019-07-29 08:41:45--  https://github.com/rapidsai/notebooks/raw/branch-0.8/cuml/data/mortgage.npy.gz
Resolving github.com (github.com)... 140.82.118.3
Connecting to github.com (github.com)|140.82.118.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rapidsai/notebooks/branch-0.8/cuml/data/mortgage.npy.gz [following]
--2019-07-29 08:41:45--  https://raw.githubusercontent.com/rapidsai/notebooks/branch-0.8/cuml/data/mortgage.npy.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 6642646 (6.3M) [application/octet-stream]
Saving to: ‘data/mortgage.npy.gz’


2019-07-29 08:41:46 (78.8 MB/s) - ‘data/mortgage.npy.gz’ saved [6642646/6642646]



## Helper Functions

In [0]:
# Check if the mortgage dataset is present and then extract the data from it, else just create a random dataset for regression 
import gzip
def load_data(nrows, ncols, cached = 'data/mortgage.npy.gz'):
    # Split the dataset in a 80:20 split
    train_rows = int(nrows*0.8)
    if os.path.exists(cached):
        print('use mortgage data')

        with gzip.open(cached) as f:
            X = np.load(f)
        # The 4th column is 'adj_remaining_months_to_maturity'
        # used as the label
        X = X[:,[i for i in range(X.shape[1]) if i!=4]]
        y = X[:,4:5]
        rindices = np.random.randint(0,X.shape[0]-1,nrows)
        X = X[rindices,:ncols]
        y = y[rindices]
        df_y_train = pd.DataFrame({'fea%d'%i:y[0:train_rows,i] for i in range(y.shape[1])})
        df_y_test = pd.DataFrame({'fea%d'%i:y[train_rows:,i] for i in range(y.shape[1])})
    else:
        print('use random data')
        X,y = make_regression(n_samples=nrows,n_features=ncols,n_informative=ncols, random_state=0)
        df_y_train = pd.DataFrame({'fea0':y[0:train_rows,]})
        df_y_test = pd.DataFrame({'fea0':y[train_rows:,]})

    df_X_train = pd.DataFrame({'fea%d'%i:X[0:train_rows,i] for i in range(X.shape[1])})
    df_X_test = pd.DataFrame({'fea%d'%i:X[train_rows:,i] for i in range(X.shape[1])})

    return df_X_train, df_X_test, df_y_train, df_y_test

## Obtain and convert the dataset

In [4]:
%%time
# nrows = number of samples
# ncols = number of features of each sample 
nrows = 2*21
ncols = 500

# Split the dataset into training and testing sets, in the ratio of 80:20 respectively
X_train, X_test, y_train, y_test = load_data(nrows,ncols)
print('training data',X_train.shape)
print('training label',y_train.shape)
print('testing data',X_test.shape)
print('testing label',y_test.shape)
print('label',y_test.shape)

use mortgage data
training data (33, 500)
training label (33, 1)
testing data (9, 500)
testing label (9, 1)
label (9, 1)
CPU times: user 4.81 s, sys: 1.83 s, total: 6.64 s
Wall time: 6.72 s


In [5]:
%%time
# Convert the pandas dataframe to cudf format
X_cudf = cudf.DataFrame.from_pandas(X_train)
X_cudf_test = cudf.DataFrame.from_pandas(X_test)
y_cudf = y_train.values
y_cudf = y_cudf[:,0]
y_cudf = cudf.Series(y_cudf)

CPU times: user 1.9 s, sys: 315 ms, total: 2.21 s
Wall time: 2.84 s


## Define the model parameters

In [0]:
# lr = learning rate
# algo = algorithm used in the model
lr = 0.001
algo = 'cyclic'

# Lasso

The lasso model implemented in cuml allows the user to change the following parameter values:

1. `alpha`: regularizing constant that is multiplied with L1 to control the extent of regularization. (default = 1)
2. `normalize`: variable decides if the predictors in X will be normalized or not. (default = False)
3. `fit_intercept`: if set to True the model tries to center the data. (default = True)
4. `max_iter`: maximum number of iterations for training (fitting) the data to the model. (default = 1000)
5. `tol`: the tolerance for optimization. (default = 1e-3)
6. `algorithm`: the user can set the algorithm value as 'cyclic' or 'random'


The model accepts only numpy arrays or cudf dataframes as the input. 
- In order to convert your dataset to cudf format please read the cudf [documentation](https://rapidsai.github.io/projects/cudf/en/latest/) 
- For additional information on the lasso model please refer to the [documentation](https://rapidsai.github.io/projects/cuml/en/latest/index.html)

## Scikit-learn model for lasso 

In [7]:
%%time
# Use the sklearn lasso model to fit the training dataset 
skols = Lasso(alpha=np.array([lr]), fit_intercept = True, normalize = False, max_iter = 1000, selection=algo, tol=1e-10)
skols.fit(X_train, y_train)

CPU times: user 2.51 ms, sys: 4.9 ms, total: 7.41 ms
Wall time: 19.5 ms


In [8]:
%%time
# Calculate the mean squared error for the sklearn lasso model on the testing dataset
sk_predict = skols.predict(X_test)
error_sk = mean_squared_error(y_test,sk_predict)

CPU times: user 6.73 ms, sys: 3.95 ms, total: 10.7 ms
Wall time: 8.68 ms


## CuML model for lasso

In [9]:
%%time
# Run the cuml linear regression model to fit the training dataset 
cuols = cuLasso(alpha=np.array([lr]), fit_intercept = True, normalize = False, max_iter = 1000, selection=algo, tol=1e-10)
cuols.fit(X_cudf, y_cudf)

CPU times: user 1.06 s, sys: 154 ms, total: 1.21 s
Wall time: 1.15 s


In [10]:
%%time
# Calculate the mean squared error of the testing dataset using the cuml linear regression model
cu_predict = cuols.predict(X_cudf_test).to_array()
error_cu = mean_squared_error(y_test,cu_predict)

CPU times: user 119 ms, sys: 1.38 ms, total: 121 ms
Wall time: 121 ms


In [11]:
# Print the mean squared error of the sklearn and cuml model to compare the two
print("SKL MSE(y):")
print(error_sk)
print("CUML MSE(y):")
print(error_cu)

SKL MSE(y):
1.4399155910280714e-05
CUML MSE(y):
1.4399175e-05


# Elastic Net

The elastic net model implemented in cuml contains the same parameters as the lasso model. In addition to the variable values that can be altered in lasso, elastic net has another variable who's value can be changed


- `l1_ratio`: decides the ratio of amount of L1 and L2 regularization that would be applied to the model
  - When L1 ratio = 0
    - the model will have only L2, regularization shall be applied to the model (default = 0.5)


The model accepts only numpy arrays or cudf dataframes as the input. 
- In order to convert your dataset to cudf format please read the cudf [documentation](https://rapidsai.github.io/projects/cudf/en/latest/) 
- For additional information on the lasso model please refer to the [documentation](https://rapidsai.github.io/projects/cuml/en/latest/index.html)

## Scikit-learn model for elastic net

In [12]:
%%time
# Use the sklearn linear regression model to fit the training dataset 
elastic_sk = ElasticNet(alpha=np.array([lr]), fit_intercept = True, normalize = False, max_iter = 1000, selection=algo, tol=1e-10)
elastic_sk.fit(X_train, y_train)

CPU times: user 4.89 ms, sys: 1.23 ms, total: 6.12 ms
Wall time: 11.1 ms


In [13]:
%%time
# Calculate the mean squared error of the sklearn linear regression model on the testing dataset
sk_predict_elas = elastic_sk.predict(X_test)
error_sk_elas = mean_squared_error(y_test,sk_predict_elas)

CPU times: user 2.36 ms, sys: 1.91 ms, total: 4.27 ms
Wall time: 3.77 ms


## CuML model for elastic net

In [14]:
%%time
# Run the cuml linear regression model to fit the training dataset 
elastic_cu = cuElasticNet(alpha=np.array([lr]), fit_intercept = True, normalize = False, max_iter = 1000, selection=algo, tol=1e-10)
elastic_cu.fit(X_cudf, y_cudf)

CPU times: user 357 ms, sys: 93 ms, total: 450 ms
Wall time: 379 ms


In [15]:
%%time
# Calculate the mean squared error of the testing dataset using the cuml linear regression model
cu_predict_elas = elastic_cu.predict(X_cudf_test).to_array()
error_cu_elas = mean_squared_error(y_test,cu_predict_elas)

CPU times: user 126 ms, sys: 2.43 ms, total: 128 ms
Wall time: 133 ms


In [16]:
# Print the mean squared error of the sklearn and cuml model to compare the two
print("SKL MSE(y):")
print(error_sk_elas)
print("CUML MSE(y):")
print(error_cu_elas)

SKL MSE(y):
1.4254451740114062e-05
CUML MSE(y):
1.4254318e-05
