[View in Colaboratory](https://colab.research.google.com/github/christopher-ell/Deep_Learning_Begin/blob/master/OPT5_DeepLearning_for_NLP.ipynb)

Official Pytorch Tutorials - Deep Learning for NLP

Source: https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html

In [1]:
## File created in Google colaboratory so need to download libraries and data on begin 
!pip install torch

Collecting torch
[?25l  Downloading https://files.pythonhosted.org/packages/df/a4/7f5ec6e9df1bf13f1881353702aa9713fcd997481b26018f35e0be85faf7/torch-0.4.0-cp27-cp27mu-manylinux1_x86_64.whl (484.0MB)
[K    100% |████████████████████████████████| 484.0MB 24kB/s 
tcmalloc: large alloc 1073750016 bytes == 0x5654676ae000 @  0x7f892c40c1c4 0x56540ca000d8 0x56540cae9d5d 0x56540ca1377a 0x56540ca18462 0x56540ca10b3a 0x56540ca1882e 0x56540ca10b3a 0x56540ca1882e 0x56540ca10b3a 0x56540ca1882e 0x56540ca10b3a 0x56540ca18e1f 0x56540ca10b3a 0x56540ca1882e 0x56540ca10b3a 0x56540ca1882e 0x56540ca18462 0x56540ca18462 0x56540ca10b3a 0x56540ca18e1f 0x56540ca18462 0x56540ca10b3a 0x56540ca18e1f 0x56540ca10b3a 0x56540ca18e1f 0x56540ca10b3a 0x56540ca1882e 0x56540ca10b3a 0x56540ca4150f 0x56540ca3c202
[?25hInstalling collected packages: torch
Successfully installed torch-0.4.0


In [2]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f087f977870>

**Creating Tensors**

- Tensors can be created from Python lists with the torch.Tensor() function

In [3]:
# torch.tensor(data) creates a torch.Tensor object with the given data.
V_data = [1., 2., 3.]
V = torch.tensor(V_data)
print(V)

# Creates a Matrix
M_data = [[1., 2., 3.], [4., 5., 6.]]
M = torch.tensor(M_data)
print(M)

# Create a 3D tensor of size 2x2x2.
T_data = [[[1., 2.], [3., 4.]],
         [[5., 6.], [7., 8.]]]
T = torch.tensor(T_data)
print(T)

tensor([ 1.,  2.,  3.])
tensor([[ 1.,  2.,  3.],
        [ 4.,  5.,  6.]])
tensor([[[ 1.,  2.],
         [ 3.,  4.]],

        [[ 5.,  6.],
         [ 7.,  8.]]])


In [4]:
# Indexing into V and get a scalar (0 dimensional tensor)
print(V[0])

# Get a python number from it
print(V[0].item())

# Index into M and get a vector
print(M[0])

# Index into T and get a matrix
print(T[0])

tensor(1.)
1.0
tensor([ 1.,  2.,  3.])
tensor([[ 1.,  2.],
        [ 3.,  4.]])


In [5]:
x = torch.randn((3, 4, 5))
print(x)

tensor([[[-1.5256, -0.7502, -0.6540, -1.6095, -0.1002],
         [-0.6092, -0.9798, -1.6091, -0.7121,  0.3037],
         [-0.7773, -0.2515, -0.2223,  1.6871,  0.2284],
         [ 0.4676, -0.6970, -1.1608,  0.6995,  0.1991]],

        [[ 0.8657,  0.2444, -0.6629,  0.8073,  1.1017],
         [-0.1759, -2.2456, -1.4465,  0.0612, -0.6177],
         [-0.7981, -0.1316,  1.8793, -0.0721,  0.1578],
         [-0.7735,  0.1991,  0.0457,  0.1530, -0.4757]],

        [[-0.1110,  0.2927, -0.1578, -0.0288,  0.4533],
         [ 1.1422,  0.2486, -1.7754, -0.0255, -1.0233],
         [-0.5962, -1.0055,  0.4285,  1.4761, -1.7869],
         [ 1.6103, -0.7040, -0.1853, -0.9962, -0.8313]]])


**Operations with Tensors**
- You can operate on your own tensors in ways you could expect

In [6]:
x = torch.tensor([1., 2., 3.])
y = torch.tensor([4., 5., 6.])
z = x + y
print(z)

tensor([ 5.,  7.,  9.])


In [7]:
# By default, it concatenates along the first axis (concatinates rows)
x_1 = torch.randn(2, 5)
y_1 = torch.randn(3, 5)
z_1 = torch.cat([x_1, y_1])
print(z_1)

# Concatenate columns:
x_2 = torch.randn(2, 3)
y_2 = torch.randn(2, 5)
# Second arg specifies which axis to concat along
z_2 = torch.cat([x_2, y_2], 1)
print(z_2)

# If your tensors are not compatible, torch will complain. Uncomment to see the 
# error 
# torch.cat([x_1, x_2])

tensor([[-0.8029,  0.2366,  0.2857,  0.6898, -0.6331],
        [ 0.8795, -0.6842,  0.4533,  0.2912, -0.8317],
        [-0.5525,  0.6355, -0.3968, -0.6571, -1.6428],
        [ 0.9803, -0.0421, -0.8206,  0.3133, -1.1352],
        [ 0.3773, -0.2824, -2.5667, -1.4303,  0.5009]])
tensor([[ 0.5438, -0.4057,  1.1341, -0.1473,  0.6272,  1.0935,  0.0939,
          1.2381],
        [-1.1115,  0.3501, -0.7703, -1.3459,  0.5119, -0.6933, -0.1668,
         -0.9999]])


**Reshaping Tensors**
- Use the .view() method to reshape tensors.
- .view() is heavily used because many neural networks expect their inputs to have a certain shape

In [8]:
x = torch.randn(2, 3, 4)
print(x)
print(x.view(2, 12)) # Reshape to 2 rows, 12 columns
# Same as above. If one of the dimensions is -1, it's size can be inferred
print(x.view(2, -1))

tensor([[[ 0.4175, -0.2127, -0.8400, -0.4200],
         [-0.6240, -0.9773,  0.8748,  0.9873],
         [-0.0594, -2.4919,  0.2423,  0.2883]],

        [[-0.1095,  0.3126,  1.5038,  0.5038],
         [ 0.6223, -0.4481, -0.2856,  0.3880],
         [-1.1435, -0.6512, -0.1032,  0.6937]]])
tensor([[ 0.4175, -0.2127, -0.8400, -0.4200, -0.6240, -0.9773,  0.8748,
          0.9873, -0.0594, -2.4919,  0.2423,  0.2883],
        [-0.1095,  0.3126,  1.5038,  0.5038,  0.6223, -0.4481, -0.2856,
          0.3880, -1.1435, -0.6512, -0.1032,  0.6937]])
tensor([[ 0.4175, -0.2127, -0.8400, -0.4200, -0.6240, -0.9773,  0.8748,
          0.9873, -0.0594, -2.4919,  0.2423,  0.2883],
        [-0.1095,  0.3126,  1.5038,  0.5038,  0.6223, -0.4481, -0.2856,
          0.3880, -1.1435, -0.6512, -0.1032,  0.6937]])


**Computation Graphs and Automatic Differentiation**
- Computational graph important as it allows you to not have to write back propagation gradients yourself
- A computational graph is a specification of how data is combined to give your output, so has enough information to compute derivatives
- Can see what's going on using flag requires_grad

- In the torch.tensor objects you have the data, shape and other things stored
- When two tensors are added, we get an output tensor. All the output tensors knows is its data and shape. It has no idea it was the sum of the other two tensors
- if requires_grad=True, the object keeps track of how it was created

In [9]:
# Tensor factory method have a "requires_grad" flag
x = torch.tensor([1., 2., 3.], requires_grad=True)

# With requires_grad=True, you can still do all the operations you previously 
# could
y = torch.tensor([4., 5., 6.], requires_grad=True)
z = x + y
print(z)

# But z knows something extra
print(z.grad_fn)

tensor([ 5.,  7.,  9.])
<AddBackward1 object at 0x7f084f7798d0>


In [10]:
# Lets sum all the entries in z
s = z.sum()
print(s)
print(s.grad_fn)

tensor(21.)
<SumBackward0 object at 0x7f087f89f750>


In [11]:
# Calling .backward() on any variable will run backprop, starting from it.
s.backward()
print(x.grad)

tensor([ 1.,  1.,  1.])


In [12]:
x = torch.randn(2, 2)
y = torch.randn(2, 2)
# By default, user created Tensors have "requires_grad=False"
print(x.requires_grad, y.requires_grad)
z = x + y
# So you can't backprop through z
print(z.grad_fn)

# ".requires_grad( ... )" changes an existing Tensor's "requires_grad"
# flag in place. The input flag defaults to "True" if not given.
x = x.requires_grad_()
y = y.requires_grad_()
# z contains enough information to compute gradients, as we saw above
z = x + y
print(z.grad_fn)
# If any input to an operation has "requires_grad=True", so will the output
print(z.requires_grad)

# Now z has the computation history that relates itself to x and y
# Can we just take it's values, and **detach** it from its history
new_z = z.detach()

# ... does new_z have information to backprop to x and y?
# NO!
print(new_z.grad_fn)
# and how could it? "z.detach()" returns a tensor that shares the same storage 
# as "z", but with the computation history forgotten. It doesn't know anything
# about how it was computed.
# In essence, we have broken the Tensor away from its past history.

(False, False)
None
<AddBackward1 object at 0x7f084f799210>
True
None


In [13]:
print(x.requires_grad)
print((x**2).requires_grad)

with torch.no_grad():
  print((x**2).requires_grad)

True
True
False


**Deep Learning with Pytorch**

In [15]:
lin = nn.Linear(5,3) # Maps from R^5 to R^3, parameters A, b
# data is 2x5. A maps from 5 to 3... can we map data under A
data = torch.randn(2, 5)
print(data)
print(lin(data)) # Yes

tensor([[ 0.5629, -0.6205, -0.1024, -0.8491,  0.1112],
        [ 0.1618, -1.4105, -0.3404, -3.0121,  0.5710]])
tensor([[-0.1176,  0.0377,  0.4498],
        [-0.5548,  0.5594,  0.2233]])


In [16]:
# In pytorch most non-Linearities are in torch.functional (we have it imported as F)
# Note non-linearities typically don't have parameters like affine maps do.
# That is, they don't have weights that are updated by training.
data = torch.randn(2,2)
print(data)
print(F.relu(data))

tensor([[ 1.4330,  1.6689],
        [ 1.8068, -0.6527]])
tensor([[ 1.4330,  1.6689],
        [ 1.8068,  0.0000]])


In [17]:
data = torch.randn(5)
print(data)
print(F.softmax(data, dim=0))
print(F.softmax(data, dim=0).sum()) # Sums to 1 because it is a distribution
print(F.log_softmax(data, dim=0)) # Theres also log_softmax

tensor([ 1.0488,  0.4975,  0.3865, -1.5278,  0.3892])
tensor([ 0.3724,  0.2146,  0.1921,  0.0283,  0.1926])
tensor(1.0000)
tensor([-0.9877, -1.5389, -1.6499, -3.5643, -1.6473])


In [18]:
data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
       ("Give it to me".split(), "ENGLISH"),
       ("No creo que sea una buena idea".split(), "SPANISH"),
       ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
            ("it is lost on me".split(), "ENGLISH")]

# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the bag of words vector
word_to_ix = {}
for sent, _ in data + test_data:
  for word in sent:
    if word not in word_to_ix:
      word_to_ix[word] = len(word_to_ix)

print(word_to_ix)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2


class BoWClassifier(nn.Module):   # inheriting from nn.Module
  
  def __init__(self, num_labels, vocab_size):
    # Calls the init function of nn.Module. Dont get confused by syntax,
    # just always do it in an nn.Module
    super(BoWClassifier, self).__init__()
    
    # Define the parameters that you will need. In this case, we need A and b,
    # the parameters of the affine mapping.
    # Torch defines nn.Linear(), which provides the affine map.
    # Make sure you understand why the input dimension is vocab_size
    # and the output is num_labels!
    self.linear = nn.Linear(vocab_size, num_labels)
    
    # NOTE: The non-linearity log softmax does not have parameters! So we don't need 
    # to worry about that here. 
    
  def forward(self, bow_vec):
    # Pass the input through the linear layer, 
    # then pass that through log_softmax.
    # Many non-linearities and other functions are in torch.nn.functional
    return F.log_softmax(self.linear(bow_vec), dim=1)
    
def make_bow_vector(sentence, word_to_ix):
  vec = torch.zeros(len(word_to_ix))
  for word in sentence:
    vec[word_to_ix[word]] += 1
  return vec.view(1, -1)
    
def make_target(labels, label_to_ix):
  return torch.LongTensor([label_to_ix[label]])
    
model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

# The model knows its parameters. The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line.
# Then through some Python magic from the Pytorch devs, your module
# (in this case, BoWClassifier) will store knowledge of the nn.linear's parameters
for param in model.parameters():
  print(param)
  
# To run the model, pass in the BoW vector
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
  sample = data[0]
  bow_vector = make_bow_vector(sample[0], word_to_ix)
  log_probs = model(bow_vector)
  print(log_probs)

{'en': 3, 'No': 9, 'buena': 14, 'it': 7, 'at': 22, 'sea': 12, 'cafeteria': 5, 'Yo': 23, 'la': 4, 'to': 8, 'creo': 10, 'is': 16, 'a': 18, 'good': 19, 'get': 20, 'idea': 15, 'que': 11, 'not': 17, 'me': 0, 'on': 25, 'gusta': 1, 'lost': 21, 'Give': 6, 'una': 13, 'si': 24, 'comer': 2}
Parameter containing:
tensor([[-0.1710,  0.1650, -0.0372,  0.0396,  0.0073, -0.1250,  0.1104,
          0.1099,  0.0099, -0.1115, -0.0833,  0.0027, -0.1120, -0.1094,
         -0.0293, -0.0565,  0.0481, -0.0515, -0.0260, -0.0749, -0.1792,
          0.1710,  0.0374,  0.1754, -0.0316, -0.0493],
        [-0.1844, -0.0744,  0.1286, -0.1921, -0.0686,  0.1195,  0.1130,
          0.0724, -0.0388, -0.0148, -0.0372, -0.0723,  0.0818, -0.0668,
         -0.1102,  0.0445, -0.1418, -0.0419,  0.1002,  0.0733,  0.1670,
         -0.1338,  0.0017, -0.0579, -0.1097, -0.1103]])
Parameter containing:
tensor(1.00000e-02 *
       [ 4.9409,  2.0519])
tensor([[-0.6077, -0.7866]])


In [0]:
label_to_ix = {"SPANISH": 0, "ENGLISH": 1}

In [20]:
# Run on test data before we train, just to see a before-and-after
with torch.no_grad():
  for instance, label in test_data:
    bow_vec = make_bow_vector(instance, word_to_ix)
    log_probs = model(bow_vec)
    print(log_probs)
    
# Print the matrix column corresponding to "creo"
print(next(model.parameters())[:, word_to_ix["creo"]])

loss_function = nn.NLLLoss()
optimiser = optim.SGD(model.parameters(), lr=0.1)

# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances. Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
  for instance, label in data:
    # Step 1: Remember that Pytorch accumulates gradients.
    # We need to clear them out before each instance 
    model.zero_grad()
    
    # Step 2: Make our BOW vector and also we must wrap the target in a
    # tensor as an integer. For example, if the target is SPANISH, then
    # we wrap the integer 0. The loss function then knows that the 0th
    # element of the log probabilities is the log probabilities 
    # corresponding to SPANISH
    bow_vec = make_bow_vector(instance, word_to_ix)
    target = make_target(label, label_to_ix)
    
    # Step 3: Run our forward pass
    log_probs = model(bow_vec)
    
    # Step 4: Compute the loss, gradients, and update the parameters by
    # calling optimiser.step()
    loss = loss_function(log_probs, target)
    loss.backward()
    optimiser.step()

with torch.no_grad():
  for instance, label in test_data:
    bow_vec = make_bow_vector(instance, word_to_ix)
    log_probs = model(bow_vec)
    print(log_probs)
    
# Index corresponding to Spanish goes up, English goes down!
print(next(model.parameters())[:, word_to_ix["creo"]])

tensor([[-0.5255, -0.8947]])
tensor([[-0.4251, -1.0605]])
tensor(1.00000e-02 *
       [-8.3331, -3.7244])
tensor([[-0.0927, -2.4247]])
tensor([[-2.1196, -0.1279]])
tensor([ 0.3637, -0.4843])


**Word Embeddings: Encoding Lexical Semantics**

In [21]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5) #2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 1.4697, -0.3951, -0.5101,  1.1163, -0.5926]])


In [23]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

# We should tokenize the input, but we will ignore that for now
# build a list of tuples. Each tuple is ([word_i-2, word_i-1], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
           for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}

class NGramLanguageModeler(nn.Module):
  
  def __init__(self, vocab_size, embedding_dim, context_size):
    super(NGramLanguageModeler, self).__init__()
    self.embeddings = nn.Embedding(vocab_size, embedding_dim)
    self.linear1 = nn.Linear(context_size * embedding_dim, 128)
    self.linear2 = nn.Linear(128, vocab_size)
    
  def forward(self, inputs):
    embeds = self.embeddings(inputs).view((1, -1))
    out = F.relu(self.linear1(embeds))
    out = self.linear2(out)
    log_probs = F.log_softmax(out, dim=1)
    return log_probs

losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimiser = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
  total_loss = 0
  for context, target in trigrams:
    
    # Step 1. Prepare the inputs to be passed to the model (i.e. turn the words
    # into integer indicies and wrap them in tensors)
    context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
    
    # Step 2. Recall that torch *accumulates* gradients. Before passing in a 
    # new instance, you need to zero out the gradients from the old 
    # instance.
    model.zero_grad()
    
    # Step 3. Run the forward pass, getting log probabilities over next
    # words
    log_probs = model(context_idxs)
    
    # Step 4. Compute your loss function (Again, torch wants the target 
    # word wrapped in a tensor)
    loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))
    
    # Step 5. Do the backward pass and update the gradient
    loss.backward()
    optimiser.step()
    
    # Get the Python number from a 1-element tensor by calling tensor.item()
    total_loss += loss.item()
  losses.append(total_loss)
print(losses) # The loss decreased every iteration over the training data.

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
[518.8717761039734, 516.1813304424286, 513.5089640617371, 510.8543493747711, 508.21565222740173, 505.59266996383667, 502.9842948913574, 500.3891382217407, 497.8059902191162, 495.23540782928467]


In [24]:
## NEED TO FINISH EXERCISE

CONTEXT_SIZE = 2 # 2 words to the left and 2 words to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from "raw_text," we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
  context = [raw_text[i - 2], raw_text[i - 1], 
             raw_text[i + 1], raw_text[i + 2]]
  target = raw_text[i]
  data.append((context, target))
print(data[:5])

class CBOW(nn.Module):
  
  def __init__(self):
    pass
  
  def forward(self, inputs):
    pass
  
  
# Create the model and train. Here are some functions to make the data ready 
# for use by your module.

def make_context_vector(context, word_to_ix):
  idxs = [word_to_ix[w] for w in context]
  return torch.tensor(idxs, dtype=torch.long)

make_context_vector(data[0][0], word_to_ix) # example
  

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


tensor([ 13,   7,  18,  42])