# GPU Gradient Descent Solving

Many traditional machine learning algorithms can be implemented on GPU.

There may be more direct ways to use the GPU (e.g., RAPIDS) or faster solvers (e.g., L-BFGS, OWL-QN) for traditional problems.

But in some cases, the deep learning toolkits can simplify challenging aspects -- e.g., via embeddings
* https://abhadury.com/articles/2020-03/embeddings-for-recommender-systems
* https://tech.instacart.com/deep-learning-with-emojis-not-math-660ba1ad6cdc

And in other cases, we may want to use a custom loss to construct a model that maximizes a business objective, rather than just the likelihood of the model params.

We'll take a look at a linear regression for the diamonds dataset using PyTorch where we want to
* minimize RMSE
* __but also__ minimize the number of undervalued diamonds

This is an simple example of a common business scenario. 

Start with data prep

In [None]:
import cudf

input_file = "data/diamonds.csv"

df = cudf.read_csv(input_file, header = 0)
df2 = df.drop(df.columns[0], axis=1)
df3 = cudf.get_dummies(df2, columns=['cut', 'color', 'clarity'])
y = df3['price'].astype('double')
X = df3.drop('price')

X[:3]

In [None]:
import cuml

X_train, X_test, y_train, y_test = cuml.preprocessing.model_selection.train_test_split(X, y, train_size=0.75, random_state=42)

Now we'll start with a regular linear regression

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

X_train_pyt, y_train_pyt, X_test_pyt, y_test_pyt = \
  torch.cuda.DoubleTensor(X_train.as_gpu_matrix()), \
  torch.cuda.DoubleTensor(y_train.to_gpu_array()),  \
  torch.cuda.DoubleTensor(X_test.as_gpu_matrix()),  \
  torch.cuda.DoubleTensor(y_test.to_gpu_array())

train_ds = TensorDataset(X_train_pyt, y_train_pyt[:, None])

print(len(train_ds))

__Notice: We were able to pass our data from RAPIDS to PyTorch without copying, via support for__ `__cuda_array_interface__`

### Linear Regression

In [None]:
batch_size, D_in, D_out = 200, 26, 1

model = torch.nn.Sequential(
  torch.nn.Linear(D_in, D_out).double()
).cuda()

In [None]:
import pandas as pd
import numpy as np

loss_fn = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)

history = []
for epoch in range(250):
    batch_losses = []
    for i in range((len(train_ds) - 1) // batch_size + 1):
        xb, yb = train_ds[i * batch_size: i * batch_size + batch_size]
        pred = model(xb)
        loss = loss_fn(pred, yb)
        optimizer.zero_grad()
        loss.backward()        
        optimizer.step()
        batch_losses.append(loss.item()) # note the .item()
    epoch_loss = np.sqrt(pd.Series(batch_losses).mean()) #not 100% accurate due to batch size diff, we'll fix that later
    history.append(epoch_loss)
    if epoch % 10 == 0:
        print("Training RMSE for epoch {} = {}".format(epoch, epoch_loss))

Let's check our test set performance:

In [None]:
# make predictions on test set:
y_pred_pyt = model(X_test_pyt) #we're leaving out some "best practices" for simplicity
print(y_pred_pyt.shape)

In [None]:
print(y_test_pyt.shape)
print(y_test_pyt.unsqueeze(1).shape)

In [None]:
rmse1 = loss_fn(y_pred_pyt, y_test_pyt.unsqueeze(1)).sqrt()
print(f"Calulating Final Test RMSE: {rmse1}")

How does this compare to the std dev of the response?

In [None]:
y_test.std()

In [None]:
model[0].weight

Now, how many of those predictions were below the true price? (Recall, in our business scenario we want to minimize this case)

In [None]:
(y_pred_pyt < y_test_pyt.unsqueeze(1)).sum()

## Ok, No Surprises - Now Let's Try Something A Bit More Interesting

By customizing our loss function, we can optimize for *business loss* as opposed to the pure MSE loss.

In [None]:
batch_size, D_in, D_out = 200, 26, 1

model = torch.nn.Sequential(
  torch.nn.Linear(D_in, D_out).double()
).cuda()

Here we'll define our custom loss

In [None]:
def loss_fn(pred, label):
    sq = (label-pred)**2
    mask = (pred < label)
    sq[mask] = sq[mask]**(1.2)
    return sq.mean()

Because of the larger losses, we need to slow our learning rate, and run more epochs

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=1e-5)

history = []
for epoch in range(1500):
    batch_losses = []
    for i in range((len(train_ds) - 1) // batch_size + 1):
        xb, yb = train_ds[i * batch_size: i * batch_size + batch_size]      
        pred = model(xb)
        loss = loss_fn(pred, yb)
        optimizer.zero_grad()
        loss.backward()        
        optimizer.step()
        batch_losses.append(loss.item()) # note the .item()
    epoch_loss = np.sqrt(pd.Series(batch_losses).mean()) #not 100% accurate due to batch size diff, we'll fix that later
    history.append(epoch_loss)
    if epoch % 10 == 0:
        print("Training sqrt-loss for epoch {} = {}".format(epoch, epoch_loss))

In [None]:
model[0].weight

Ok, so we have this dual-objective loss...

Part 1: What is the RMSE?

In [None]:
y_pred_pyt = model(X_test_pyt)
rmse2 = torch.nn.MSELoss()(y_pred_pyt, y_test_pyt.unsqueeze(1)).sqrt()
rmse2.item()

Part 2: How many diamonds are undervalued by the model?

In [None]:
(y_pred_pyt < y_test_pyt.unsqueeze(1)).sum()

While we're here with PyTorch, we might as well try a more legit multilayer perceptron and see if we can do a little better

In [None]:
batch_size, D_in, H, D_out = 200, 26, 30, 1

model = torch.nn.Sequential(
  torch.nn.Linear(D_in, H).double(),
  torch.nn.ReLU(),
  torch.nn.Linear(H, D_out).double()
).cuda()

In [None]:
optimizer = torch.optim.Adam(model.parameters())

history = []
for epoch in range(1500):
    batch_losses = []
    for i in range((len(train_ds) - 1) // batch_size + 1):
        xb, yb = train_ds[i * batch_size: i * batch_size + batch_size]      
        pred = model(xb)
        loss = loss_fn(pred, yb)
        optimizer.zero_grad()
        loss.backward()        
        optimizer.step()
        batch_losses.append(loss.item()) # note the .item()
    epoch_loss = np.sqrt(pd.Series(batch_losses).mean()) #not 100% accurate due to batch size diff, we'll fix that later
    history.append(epoch_loss)
    if epoch % 10 == 0:
        print("Training sqrt-loss for epoch {} = {}".format(epoch, epoch_loss))

In [None]:
y_pred_pyt = model(X_test_pyt)
rmse3 = torch.nn.MSELoss()(y_pred_pyt, y_test_pyt.unsqueeze(1)).sqrt()
rmse3.item()

In [None]:
(y_pred_pyt.cuda() < y_test_pyt.unsqueeze(1).cuda()).sum()