<h1><center>Generalized Linear Models Using pai4sk and cudf</center></h1>


In this example, we will train Logistic Regression, Ridge Regression, Lasso Regression and Support Vector Machine models on the epsilon dataset, using cuDF dataframe on both `scikit-learn` and `pai4sk`. 


The epsilon dataset is from the [PASCAL Large Scale Learning Challenge](http://www.k4all.org/project/large-scale-learning-challenge/). 

We will load epsilon dataset into pandas dataframes and convert into RAPIDS dataframes. Then, we train a Logistic Regression model using both pai4sk and scikit-learn. Update device_ids list in LogisticRegression of snap_ml based on the number of GPUs available for you.

### Imports

In [1]:
import time
import numpy as np
import sys
import argparse
import pandas as pd

import cudf
from cudf.dataframe import DataFrame
from sklearn.datasets import load_svmlight_file

defaultPath = "."

### Download input dataset into pandas dataframes

Two wget commands are given below for downloading input dataset. The training and testing dataset is only a small fraction of the actual dataset for quick demonstration. Many times better perfomance of snapML training is seen with bigger dataset.

You can uncomment the downloading code below if you need to download the actual data.

In [2]:
#Download the data file
#!mkdir data
#!cd data
#!wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.bz2
#!bunzip2 epsilon_normalized.bz2
#!cd ../

#X,y = load_svmlight_file("./data/epsilon_normalized")

# Make the train-test split
#from sklearn.model_selection import train_test_split
#X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.1, test_size=0.05, random_state=42)

# Convert to dense
#X_train = np.array(X_train.todense())
#X_test  = np.array(X_test.todense())

# Write to binary numpy files
#np.save("./data/epsilon.X_train", X_train)
#np.save("./data/epsilon.X_test",  X_test)
#np.save("./data/epsilon.y_train", y_train)
#np.save("./data/epsilon.y_test",  y_test)

ytrain = np.load('./data/epsilon.y_train.npy')
ytest = np.load('./data/epsilon.y_test.npy')
Xtrain = np.load('./data/epsilon.X_train.npy')
Xtest = np.load('./data/epsilon.X_test.npy')

pdf_trainX = pd.DataFrame(Xtrain, dtype=np.float32)
pdf_testX = pd.DataFrame(Xtest, dtype=np.float32)
pdf_trainY = pd.DataFrame(ytrain, dtype=np.float32)
pdf_testY = pd.DataFrame(ytest, dtype=np.float32)

### Convert pandas dataframes into cudf dataframes

In [3]:
df_trainX = DataFrame.from_pandas(pdf_trainX)
df_trainY = DataFrame.from_pandas(pdf_trainY)

# ndarray data used for training (for sklearn case)
# Converting to C-contiguous ndarray
X_train_ndarray = np.ascontiguousarray(pdf_trainX.values)
y_train_ndarray = np.ascontiguousarray(pdf_trainY.values)

# data used for training
# Converting to C-contiguous DeviceNDArray
X_train = copy_as_gpu_cmatrix(df_trainX)
y_train = copy_as_gpu_cmatrix(df_trainY)

#data used for inferencing
# Converting to C-contiguous ndarray
X_test = np.ascontiguousarray(pdf_testX.values)
y_test = np.ascontiguousarray(pdf_testY.values)

### Logistic Regression in pai4sk with primal formulation

In [4]:
num_threads = 256

# Create a LogisticRegression from pai4sk
from pai4sk import LogisticRegression

#primal formulation (dual - False)
lr = LogisticRegression(use_gpu=True, device_ids=[0],
                        num_threads=num_threads, class_weight=None,
                        fit_intercept=False, regularizer=100, dual=False)
# Training
t0 = time.time()
lr.fit(X_train, y_train)
print("[pai4sk] Training time (s) with pai4sk primal formulation:  {:.2f}".format(time.time()-t0))

# Evaluate log-loss on test set
pred = lr.predict_proba(X_test)[:,1]

from sklearn.metrics import average_precision_score
acc_snap = average_precision_score(y_test, pred)
print("[pai4sk] Average Precision Score :   {:.4f}".format(acc_snap))

[pai4sk] Training time (s) with pai4sk primal formulation:  0.26
[pai4sk] Average Precision Score :   0.8437


### Logistic Regression in pai4sk with dual formulation

In [5]:
from pai4sk import LogisticRegression

#dual formulation
lr = LogisticRegression(use_gpu=True, device_ids=[0],
                        num_threads=num_threads, class_weight=None,
                        fit_intercept=False, regularizer=100, dual=True)
# Training
t0 = time.time()
lr.fit(X_train_ndarray, y_train_ndarray)
print("[pai4sk] Training time (s) with pai4sk dual formulation:  {:.2f}".format(time.time()-t0))


# Evaluate log-loss on test set
pred = lr.predict_proba(X_test)[:,1]

from pai4sk.metrics import average_precision_score
acc_snap = average_precision_score(y_test, pred)
print("[pai4sk] Average Precision Score :   {:.4f}".format(acc_snap))

[pai4sk] Training time (s) with pai4sk dual formulation:  0.09
[pai4sk] Average Precision Score :   0.8443


### Logistic Regression with Scikit-Learn (no native GPU support)

In [6]:
# Import sklearn's LogisticRegression from pai4sk module directly
from pai4sk.linear_model import LogisticRegressionSklearn as LogisticRegression
lr = LogisticRegression(fit_intercept=False, dual=True, tol=0.001,
                        class_weight=None, random_state=42, C=1.0/100)

# Training time
t0 = time.time()
lr.fit(X_train_ndarray, y_train_ndarray)
print("[sklearn] Training time (s) with scikit-learn (no GPU support):  {0:.2f}".format(time.time()-t0))

pred = lr.predict_proba(X_test)[:,1]

from pai4sk.metrics import average_precision_score
acc_snap = average_precision_score(y_test, pred)
print("[sklearn] Average Precision Score :   {:.4f}".format(acc_snap))

  y = column_or_1d(y, warn=True)


[sklearn] Training time (s) with scikit-learn (no GPU support):  2.35
[sklearn] Average Precision Score :   0.8442


### SVM using pai4sk

In [7]:
# Create a SVM instance from pai4sk explicit call
from pai4sk import SupportVectorMachine
svm = SupportVectorMachine(use_gpu=True, num_threads=num_threads, 
                           class_weight=None, device_ids=[0], 
                           regularizer=2, fit_intercept=False)

# Training
t0 = time.time()
svm.fit(X_train, y_train)
print("[pai4sk] Training time (s):  {:.2f}".format(time.time()-t0))


# Inference
pred = svm.predict(X_test)

# Evaluate accuracy on test set
from pai4sk.metrics import accuracy_score
acc_snap = accuracy_score(y_test, pred)
print("[pai4sk] Accuracy: {:.3f}".format(acc_snap))

# Inference
pred = svm.decision_function(X_test)

[pai4sk] Training time (s):  0.22
[pai4sk] Accuracy: 0.885


### SVM using ScikitLearn

In [8]:
#This is another way of import

# Create a SVM instance from pai4sk implicit call
from pai4sk.svm import LinearSVCSklearn as SupportVectorMachine 
svm = SupportVectorMachine(class_weight = None, fit_intercept=False)

# Training
t0 = time.time()
svm.fit(X_train, y_train)
print("[sklearn.svm] Training time (s):  {:.2f}".format(time.time()-t0))


# Inference
pred = svm.predict(X_test)


# Evaluate accuracy on test set
from pai4sk.metrics import accuracy_score
acc_snap = accuracy_score(y_test, pred)
print("[sklearn.svm] Accuracy: {:.3f}".format(acc_snap))

# Inference
pred = svm.decision_function(X_test)


  y = column_or_1d(y, warn=True)


[sklearn.svm] Training time (s):  5.71
[sklearn.svm] Accuracy: 0.890


### Ridge Regression using pai4sk

In [9]:
# Import RidgeRegression from pai4sk.linear_model 
from pai4sk.linear_model import Ridge
Ridge = Ridge(use_gpu=True, device_ids=[0],
                        num_threads=num_threads,
                        fit_intercept=False, dual=True, tol=0.001)

# Training time
t0 = time.time()
Ridge.fit(X_train, y_train)
print("[pai4sk.lmodel] Training time (s):  {0:.2f}".format(time.time()-t0))

# Inference
pred = Ridge.predict(X_test)

from pai4sk.metrics import mean_squared_error
mse = mean_squared_error(y_test, pred)
print("[pai4sk.lmodel] Mean Squared Error :   {:.4f}".format(mse))


SnapML: Default values for these parameters are modified for 'snapml' solver: max_iter




[pai4sk.lmodel] Training time (s):  0.92
[pai4sk.lmodel] Mean Squared Error :   0.4459


### Ridge Regression using ScikitLearn

In [10]:
# Import sklearn's RidgeRegression from pai4sk module directly
from pai4sk.linear_model import RidgeSklearn as Ridge
Ridge = Ridge(fit_intercept=False, 
                        random_state=42)

# Training time
t0 = time.time()
Ridge.fit(X_train_ndarray, y_train_ndarray)
print("[sklearn] Training time (s):  {0:.2f}".format(time.time()-t0))

pred = Ridge.predict(X_test)

from pai4sk.metrics import mean_squared_error
mse = mean_squared_error(y_test, pred)
print("[sklearn] Mean Squared Error :   {:.4f}".format(mse))

[sklearn] Training time (s):  6.62
[sklearn] Mean Squared Error :   0.4458


### Ridge Regression using pai4sk

In [11]:
# Import LassoRegression from pai4sk.linear_model                                                                                                                      
from pai4sk.linear_model import Lasso
Lasso = Lasso(use_gpu=True, device_ids=[0],
                        num_threads=num_threads,
                        fit_intercept=False,tol=0.001)

# Training time                                                                                                                                                        
t0 = time.time()
Lasso.fit(X_train, y_train)
print("[pai4sk.lmodel] Training time (s):  {0:.2f}".format(time.time()-t0))

# Inference                                                                                                                                                            
pred = Lasso.predict(X_test)

from pai4sk.metrics import mean_squared_error
mse = mean_squared_error(y_test, pred)
print("[pai4sk.lmodel] Mean Squared Error :   {:.4f}".format(mse))



[pai4sk.lmodel] Training time (s):  0.66
[pai4sk.lmodel] Mean Squared Error :   0.4503


### Lasso Regression using pai4sk

In [12]:
from pai4sk.linear_model import Lasso
Lasso = Lasso(use_gpu=True, device_ids=[0],
                        num_threads=num_threads,
                        fit_intercept=False, tol=0.001)

# Training time                                                                                                                                                        
t0 = time.time()
Lasso.fit(X_train, y_train)
print("[pai4sk.lmodel] Training time (s):  {0:.2f}".format(time.time()-t0))

# Inference                                                                                                                                                            
pred = Lasso.predict(X_test)

from pai4sk.metrics import mean_squared_error
mse = mean_squared_error(y_test, pred)
print("[pai4sk.lmodel] Mean Squared Error :   {0:.4f}".format(mse))




[pai4sk.lmodel] Training time (s):  0.61
[pai4sk.lmodel] Mean Squared Error :   0.4503


### Lasso Regression using ScikitLearn

In [13]:
# Import sklearn's LassoRegression from pai4sk module directly
from pai4sk.linear_model import LassoSklearn as Lasso
Lasso = Lasso(fit_intercept=False, 
                        random_state=42)

# Training time
t0 = time.time()
Lasso.fit(X_train_ndarray, y_train_ndarray)
print("[sklearn] Training time (s):  {0:.2f}".format(time.time()-t0))

pred = Lasso.predict(X_test)

from pai4sk.metrics import mean_squared_error
mse = mean_squared_error(y_test, pred)
print("[sklearn] Mean Squared Error :   {:.4f}".format(mse))

[sklearn] Training time (s):  0.58
[sklearn] Mean Squared Error :   1.0000
