# Intro to PyTorch

After using TensorFlow and Scikit-Learn, I am finding all the frameworks probably have a similar feel.   The key things to learn:

- How the framework wants its arrays -- numpy, custom, pandas ?
- How to use the frameworks wrappers (if there are any)
- Row or column orientation ?   Training examples stacked rows or columns
- One-hot encoded Y-expected results ?

PyTorch has its own set of wrappers, so you have to convert from ndarray<->torch structures.  Everything in Torch should be in Tensor structures.


In [2]:
import torch
import numpy

torchTensorA = torch.from_numpy(numpy.array([[1,2,3],[4,5,6]]))
torchTensorB = torch.from_numpy(numpy.array([8,8,8]))

ttC1 = torchTensorA + torchTensorB
ttC2 = torch.add(torchTensorA,torchTensorB)

print (ttC1, '\n', ttC2, '\n', ttC1==ttC2)

np1 = ttC1.numpy()
print (np1, type(np1))

tensor([[  9,  10,  11],
        [ 12,  13,  14]], dtype=torch.int32) 
 tensor([[  9,  10,  11],
        [ 12,  13,  14]], dtype=torch.int32) 
 tensor([[ 1,  1,  1],
        [ 1,  1,  1]], dtype=torch.uint8)
[[ 9 10 11]
 [12 13 14]] <class 'numpy.ndarray'>


In [3]:
# Torch has CUDA GPU support.
print ('cuda?', torch.cuda.is_available())

# Using it isn't too hard but TF seems easier:  https://pytorch.org/docs/stable/notes/cuda.html
cuda = torch.device("cuda:0") 
if (torch.cuda.is_available()):
    x = torch.empty((8, 42), device=cuda)  # if u have a GPU


cuda? False


## Basic Logistic Regression

Yet another Logistic Regression Example !


In [7]:
import torch, os, sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

import numpy as np
from sklearn.utils import shuffle
from myutils import *
    
# boilerplate setup my data
data, yarr, features, fnames = getGagaData(maxrows=200, maxfeatures=2000, gtype=None, stopwords='english')
xMatrix = shuffle(data, random_state=0)
yArr = shuffle(np.array(yarr).reshape(-1,1), random_state=0)
partition = int(.70*len(yArr))
trainingX = xMatrix[:partition]
trainingY = yArr[:partition]
testX = xMatrix[partition:]
testY = yArr[partition:]

# Create random input and output data
dtype = torch.float
x = torch.tensor(trainingX, dtype=dtype)
y = torch.tensor(trainingY, dtype=dtype)  
testx = torch.tensor(testX, dtype=dtype)
testy = torch.tensor(testY, dtype=dtype)  

# Randomly initialize weights
torch.manual_seed(0)
w1 = torch.randn(len(features),1, dtype=dtype)

# gradient descent
learning_rate = 0.1
for t in range(25000):
    h = x.mm(w1)               # each feature * weight
    y_pred = h.sigmoid()

    # Compute and print loss
    loss = (y_pred - y).pow(2).sum().item()  # item?
    if (t % 5000 == 0):
        log.warn('loop %d, %.8f'%(t, loss))

    grad_w1 = x.t().mm(y_pred - y)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1

print('training complete ', loss)
print('test validation phase')

h = testx.mm(w1)               # matrixMult or dot prod == same?
y_pred = h.sigmoid().round()

# y_pred = h.mm(w2)
# y_pred_sig = y_pred.sigmoid()
log.debug('ytest', pandas.DataFrame(testy.numpy()).head())
log.debug('ypred', pandas.DataFrame(y_pred.numpy()).head())

# Compute and print loss after rounding to 0/1's
testDiffs = (y_pred - testy)
p = pandas.DataFrame(testDiffs.numpy())
log.debug('diffs', p.head())
tests = len(p)
correct = len(p[(p[0] == 0)])
print('total correct/tests', correct, tests)
print('correct % =', round((correct/tests)*100, 2))




training complete  3.308371532284582e-08
test validation phase
total correct/tests 47 60
correct % = 78.33


I won't bore everyone with the details above since its documented in prior notebooks.

-----

Onto the next step <B>Torch Neural Networks</B>, doing it the manual way first (manually calculating gradients/backprop).  However a basic intro to Neural Networks ..

If Logistical Regression is like this:   
![title](img/log-reg.png)


A Neural network is adding 1 more layer like this (H<sub>1</sub> and H<sub>2</sub>) 
![title](img/log-reg-nn.png)

----

Thats not the clearest diagram, but I'll cleanup later.  The calibration of all the weights to minimize total error/cost is significantly more complicated with a multi-layer network.  You calculate the effect of w1 or w7 on the cost function, you are needing to calculate the partial derivatives and use the chain rule for dependent variables, eg:

\begin{align}
 \frac{\partial C }{\partial w7} = \frac{\partial w7 }{\partial h} * \frac{\partial h }{\partial g} \\   
 \frac{\partial C }{\partial w1} = \frac{\partial w1 }{\partial H} * \frac{\partial H }{\partial w7} ... \\   
\end{align}

In [7]:
# example using some basic bits of PyTorch creates a 500-20-2 neural net for the Gaga text classifier
# and manually calculates the backprop gradients

dtype, device = torch.float,torch.device("cpu")

F,H, D_out = 500,20, 2 # features, hiddennodes, output nodes
D_in = F

# standard training/test data setup
data, yarr, features, fnames = getGagaData(maxrows=500, maxfeatures=F, gtype=None, stopwords='english')
xMatrix = shuffle(data, random_state=0)
yArr = shuffle(yarr, random_state=0)

partition = int(.70*len(yArr))
trainingX = xMatrix[:partition]
trainingY = yArr[:partition]
testX = xMatrix[partition:]
testY = yArr[partition:]

# add torch wrappers
x = torch.tensor(trainingX, dtype=dtype)
y = torch.tensor(pd.get_dummies(trainingY).values, dtype=dtype)  # onehot y's
testx = torch.tensor(testX, dtype=dtype)
testy = torch.tensor(pd.get_dummies(testY).values, dtype=dtype)  # onehot y's

# Randomly initialize weights, repeatable w/ seed
np.random.seed(0)
w1 = torch.tensor(np.random.rand(D_in, H), dtype=dtype, requires_grad=False)
w2 = torch.tensor(np.random.rand(H, D_out), dtype=dtype, requires_grad=False)

#gradient descent
learning_rate = 0.005
for t in range(1500):
    # Forward pass: compute predicted y
    h = x.mm(w1)               # matrixMult or dot prod == same?
    h_sig = h.sigmoid()        
    y_pred = h_sig.mm(w2).sigmoid()

    loss = (y_pred - y).pow(2).sum().item()  # item unwraps
    if (t % 200 == 0):
        print(t, loss)
        if (loss < 0.0001):
            break

    # Manual backprop routines
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_sig.t().mm(grad_y_pred) 
    grad_h_sig = grad_y_pred.mm(w2.t())
    grad_h = grad_h_sig.clone()
    grad_w1 = x.t().mm(grad_h)

    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

print('training complete ',loss)

h = testx.mm(w1)               # matrixMult or dot prod == same?
h_sig = h.sigmoid()
y_pred = h_sig.mm(w2).sigmoid()
y_pred2 = torch.tensor(pd.get_dummies(y_pred.argmax(dim=1)).values, dtype=dtype)

print('predictions\n')
display(pandas.DataFrame(y_pred.numpy()).head())

# Compute and print loss after rounding to 0/1's
testDiffs = (y_pred2 - testy)
p = pandas.DataFrame(testDiffs.numpy())
tests = len(p)
correct = len(p[(p[0]==0) & (p[1]==0)])
print('total correct/tests',correct,tests)
print('correct % =', round((correct/tests)*100,2))
print('hidden nodes %d'%H)




0 212.7218780517578
200 15.184028625488281
400 13.640188217163086
600 12.888275146484375
800 13.629972457885742
1000 13.452261924743652
1200 13.511369705200195
1400 14.130924224853516
training complete  14.127641677856445
predictions



Unnamed: 0,0,1
0,0.999103,0.000926
1,0.02302,0.978216
2,0.952783,0.051358
3,0.001234,0.99877
4,0.99972,0.000305


total correct/tests 64 92
correct % = 69.57
hidden nodes 20


In [9]:
## Same as above, but using PyTorch autodiff to do backprop and calculate gradients automatically using a dependency graph
## notice results are sligthly different which is odd (?)

dtype, device = torch.float, torch.device("cpu")
F, H = 500, 20   # features
D_in, D_out = F, 2  # 100 hidden nodes, 2 output nodes

#boilerplate stuff preparing the training/test data
data, yarr, features, fnames = getGagaData(maxrows=500, maxfeatures=F, gtype=None, stopwords='english')
xMatrix = shuffle(data, random_state=0)
yArr = shuffle(yarr, random_state=0)

partition = int(.70*len(yArr))
trainingX = xMatrix[:partition]
trainingY = yArr[:partition]
testX = xMatrix[partition:]
testY = yArr[partition:]

# add torch wrappers
x = torch.tensor(trainingX, dtype=dtype)
y = torch.tensor(pd.get_dummies(trainingY).values, dtype=dtype)  # onehot y's
testx = torch.tensor(testX, dtype=dtype)
testy = torch.tensor(pd.get_dummies(testY).values, dtype=dtype)  # onehot y's

# Randomly initialize weights, repeatable w/ seed, put w1/w2 on graph
np.random.seed(0)
w1 = torch.tensor(np.random.rand(D_in, H), dtype=dtype, requires_grad=True)   # requires_grad = True puts on graph
w2 = torch.tensor(np.random.rand(H, D_out), dtype=dtype, requires_grad=True)  # requires_grad = True puts on graph

#gradient descent using autograd
learning_rate = 0.005
for t in range(1500):
    # Forward pass:2compute predicted y
    h = x.mm(w1).sigmoid()               # matrixMult or dot prod == same?
    y_pred = h.mm(w2).sigmoid()

    loss = (y_pred - y).pow(2).sum()  # item unwraps
    if (t % 200 == 0):
        print(t, loss.item())
        if (loss < 0.0001):
            break

    # autograd backprop ** only real difference
    loss.backward()  # goes thru graph and calculates all .grad values magically !!
    with torch.no_grad():  # take off graph for this step
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad
        w1.grad.zero_()
        w2.grad.zero_()

print('training complete ', loss)

h = testx.mm(w1).sigmoid()               # matrixMult or dot prod == same?
y_pred = h.mm(w2).sigmoid()
y_pred2 = torch.tensor(pd.get_dummies(y_pred.argmax(dim=1)).values, dtype=dtype)

print('ypred')
display(pandas.DataFrame(y_pred.detach().numpy()).head())
testDiffs = (y_pred2 - testy)
p = pandas.DataFrame(testDiffs.numpy())
tests = len(p)
correct = len(p[(p[0] == 0) & (p[1] == 0)])
print('total correct/tests', correct, tests)
print('correct % =', round((correct/tests)*100, 2))
print('hidden nodes %d' % H)




0 212.7218780517578
200 170.50454711914062
400 70.86431884765625
600 48.30420684814453
800 31.754566192626953
1000 23.829137802124023
1200 19.80072784423828
1400 17.226394653320312
training complete  tensor(16.4909)
ypred


Unnamed: 0,0,1
0,0.954171,0.043588
1,0.003371,0.995483
2,0.977425,0.024113
3,0.005237,0.994935
4,0.999485,0.000576


total correct/tests 70 92
correct % = 76.09
hidden nodes 20


The only real change is setting autograd=True on the w1,w2 values then removing the manual backprop steps and just extracting the .grad values.  Adding w1,w2 makes all dependent variables (like X) automatically added to the graph.  Once on graph you can't manipulate those variables until you go offgraph (headache yes).  Having it on graph does the partial derivs and calculates all gradients in the graph automatically which is amazing/magical and just plain conveninent.

----

Above doesn't show great results 70% -- but with only 1500 iterations its ok.  How neural network back propagation works needs some focus if you want to understand it.  It is complex enough that a few videos are suggested (The series by 3Brown1Blue - https://www.youtube.com/watch?v=Ilg3gGewQ5U&feature=youtu.be highly recommended).   



------

# PyTorch Neural Network

PyTorch has its own framework in torch.nn -- Module and its subclasses like Linear, Conv#d, RNN, etc.

First define the Network: 500 input - 20 hidden - 1 output

In [11]:
import torch.nn as nn
import torch.nn.functional as tfun

# extends Net w/ fc internal model 500:20:1 model, using sigmoid activations
class GagaNet(nn.Module):
    def __init__(self):
        super(GagaNet, self).__init__()
        self.inp = nn.Linear(500, 20, bias=False)   # fully connected linear layers - 500 input to 20 output
        self.hid = nn.Linear(20, 1, bias=False)     # fc layer for hidden 20 input nodes, 1 output

    # forward prop -wrap both steps in sigmoid
    def forward(self, x):
        x = tfun.sigmoid(self.inp(x))  # oddly works better w/o this sigmoid
        x = tfun.sigmoid(self.hid(x))  
        return x

net = GagaNet()
print(net)

GagaNet(
  (inp): Linear(in_features=500, out_features=20, bias=False)
  (hid): Linear(in_features=20, out_features=1, bias=False)
)


Next setup the training - same boring stuff, but wrapped in torch.tensor

In [12]:
# get test data
data, yarr, features, fnames = getGagaData(maxrows=500, maxfeatures=500, gtype=None, stopwords='english')

xMatrix = shuffle(data, random_state=0)
yArr = shuffle(yarr, random_state=0)
partition = int(.70*len(yArr))
trainingX = xMatrix[:partition]
trainingY = yArr[:partition]
testX = xMatrix[partition:]
testY = yArr[partition:]

input = torch.tensor(xMatrix, dtype=torch.float)             # m x 500
target = torch.tensor(yArr, dtype=torch.float).view(-1,1)    # 1 x m
test_input = torch.tensor(testX, dtype=torch.float)          # m x 500
test_target = torch.tensor(testY, dtype=torch.float).view(-1,1)




In [25]:
# run the NN training
import torch.optim as optim
torch.set_printoptions(threshold=10)   # limit tensor printing

criterion = nn.MSELoss()   # loss function mean-square-error
optimizer = optim.SGD(net.parameters(), lr=0.01) # stochastic grad descent

# in your training loop:
for epoch in range(2500):
    optimizer.zero_grad()   # zero the gradient buffers
    output = net(input)     # forward prop __call__ 
    loss = criterion(output, target)   # loss optimization    
    loss.backward()         # auto backward prop
    optimizer.step()        # does the update of applying gradients to weights
    if (epoch % 500 == 0):
        print('Epoch %s, mean loss: %s' % (epoch, loss))
        print('weights spot check',net.inp.weight[0])
print('-------------\ntraining done\n-----------\n')

# now do testing
test_res = net.forward(test_input)
test_res_round = test_res.round()
test_diff = test_res_round - test_target
print ('predict sigmoid', test_res.view(1,-1))  
print ('predict rounded', test_res_round.view(1,-1))  
print ('expected output', test_target.view(1,-1))  
print ('err %d / total %d = %2f accuracy '%(test_diff.abs().sum().item(), len(test_diff),(len(test_diff)- test_diff.abs().sum().item()) / (len(test_diff))))


Epoch 0, mean loss: tensor(0.1015)
weights spot check tensor([ 0.0064, -0.0239, -0.0181,  ...,  0.0434, -0.0041,  0.0255])
Epoch 500, mean loss: tensor(0.1004)
weights spot check tensor([ 0.0060, -0.0246, -0.0188,  ...,  0.0434, -0.0036,  0.0255])
Epoch 1000, mean loss: tensor(1.00000e-02 *
       9.9337)
weights spot check tensor([ 0.0056, -0.0253, -0.0194,  ...,  0.0434, -0.0032,  0.0255])
Epoch 1500, mean loss: tensor(1.00000e-02 *
       9.8291)
weights spot check tensor([ 0.0051, -0.0260, -0.0200,  ...,  0.0434, -0.0028,  0.0256])
Epoch 2000, mean loss: tensor(1.00000e-02 *
       9.7262)
weights spot check tensor([ 0.0047, -0.0266, -0.0206,  ...,  0.0434, -0.0024,  0.0256])
-------------
training done
-----------

predict sigmoid tensor([[ 0.4157,  0.9996,  0.0714,  ...,  0.4837,  0.6227,  0.6707]])
predict rounded tensor([[ 0.,  1.,  0.,  ...,  0.,  1.,  1.]])
expected output tensor([[ 1.,  1.,  0.,  ...,  1.,  1.,  1.]])
err 10 / total 92 = 0.891304 accuracy 


Pretty easy -- similar to TF, its all about figuring out the inputs, wrapping them, then making the macro calls to train your model.

