# Linear Regression

**Linear Regression** is a simple machine learning model where the response y is modelled by a linear combination of the predictors in X.

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well as cuDF DataFrames as the input. 

For information about cuDF, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable).

For information about cuML's linear regression API: https://docs.rapids.ai/api/cuml/stable/api.html#cuml.LinearRegression

**NOTE:** This notebook is not expected to run on a GPU with under 16GB of RAM with its current value for `n_smaples`.  Please change `n_samples` from `2**20` to `2**19`

## Imports

In [1]:
import cudf
from cuml import make_regression, train_test_split
from cuml.linear_model import LinearRegression as cuLinearRegression
from cuml.metrics.regression import r2_score
from sklearn.linear_model import LinearRegression as skLinearRegression

## Define Parameters

In [4]:
n_samples = 2**19 #If you are running on a GPU with less than 16GB RAM, please change to 2**19 or you could run out of memory
n_features = 399

random_state = 23

## Generate Data

In [5]:
%%time
X, y = make_regression(n_samples=n_samples, n_features=n_features, random_state=random_state)

X = cudf.DataFrame(X)
y = cudf.DataFrame(y)[0]

X_cudf, X_cudf_test, y_cudf, y_cudf_test = train_test_split(X, y, test_size = 0.2, random_state=random_state)

CPU times: user 3.35 s, sys: 781 ms, total: 4.13 s
Wall time: 5.1 s


In [6]:
# Copy dataset from GPU memory to host memory.
# This is done to later compare CPU and GPU results.
X_train = X_cudf.to_pandas()
X_test = X_cudf_test.to_pandas()
y_train = y_cudf.to_pandas()
y_test = y_cudf_test.to_pandas()

## Scikit-learn Model

### Fit, predict and evaluate

In [7]:
%%time
ols_sk = skLinearRegression(fit_intercept=True,
                            normalize=True,
                            n_jobs=-1)

ols_sk.fit(X_train, y_train)

CPU times: user 47.7 s, sys: 7.05 s, total: 54.8 s
Wall time: 9.08 s


LinearRegression(n_jobs=-1, normalize=True)

In [8]:
%%time
predict_sk = ols_sk.predict(X_test)

CPU times: user 122 ms, sys: 28.4 ms, total: 151 ms
Wall time: 56.1 ms


In [9]:
%%time
r2_score_sk = r2_score(y_cudf_test, predict_sk)

CPU times: user 5.89 ms, sys: 11.6 ms, total: 17.5 ms
Wall time: 2.91 ms


## cuML Model

### Fit, predict and evaluate

In [10]:
%%time
ols_cuml = cuLinearRegression(fit_intercept=True,
                              normalize=True,
                              algorithm='eig')

ols_cuml.fit(X_cudf, y_cudf)

CPU times: user 620 ms, sys: 6.67 ms, total: 626 ms
Wall time: 634 ms


LinearRegression(algorithm='eig', fit_intercept=True, normalize=True, handle=<cuml.common.handle.Handle object at 0x7faf345095b0>, verbose=4, output_type='cudf')

In [11]:
%%time
predict_cuml = ols_cuml.predict(X_cudf_test)

CPU times: user 22.6 ms, sys: 3.77 ms, total: 26.4 ms
Wall time: 25.9 ms


In [12]:
%%time
r2_score_cuml = r2_score(y_cudf_test, predict_cuml)

CPU times: user 404 µs, sys: 68 µs, total: 472 µs
Wall time: 478 µs


## Compare Results

In [13]:
print("R^2 score (SKL):  %s" % r2_score_sk)
print("R^2 score (cuML): %s" % r2_score_cuml)

R^2 score (SKL):  1.0
R^2 score (cuML): 0.6348239183425903
