# Deep Learning for NLP with Pytorch

This tutorial will walk you through the key ideas of deep learning programming using Pytorch. Many of the concepts (such as the computation graph abstraction and autograd) are not unique to Pytorch and are relevant to any deep learning toolkit out there.

I am writing this tutorial to focus specifically on NLP for people who have never written code in any deep learning framework (e.g, TensorFlow, Theano, Keras, Dynet). It assumes working knowledge of core NLP problems: part-of-speech tagging, language modeling, etc. It also assumes familiarity with neural networks at the level of an intro AI class (such as one from the Russel and Norvig book). Usually, these courses cover the basic backpropagation algorithm on feed-forward neural networks, and make the point that they are chains of compositions of linearities and non-linearities. This tutorial aims to get you started writing deep learning code, given you have this prerequisite knowledge.

Note this is about models, not data. For all of the models, I just create a few test examples with small dimensionality so you can see how the weights change as it trains. If you have some real data you want to try, you should be able to rip out any of the models from this notebook and use them on it.

# 1 Introduction to PyTorch

## 1.1 Introduction to Torch’s tensor library

所有的深度学习都是在张量上计算的,其中张量是一个可以被超过二维索引的矩阵的一般化. 稍后我们将详细讨论这意味着什么.首先,我们先来看一下我们可以用张量来干什么.

In [1]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x1f585797870>

### Creating Tensors

张量可以在Python list形式下通过torch.Tensor()函数创建

In [8]:
# torch.tensor(data) creates a torch.Tensor object with the given data.
V_data = [1., 2., 3.]
V = torch.Tensor(V_data)

print(V)

print(V.size())
print(V.shape)


 1
 2
 3
[torch.FloatTensor of size 3]

torch.Size([3])
torch.Size([3])


In [10]:
# Creates a matrix
M_data = [[1., 2., 3.], [4., 5., 6]]
M = torch.Tensor(M_data)

print(M)

print(M.shape)


 1  2  3
 4  5  6
[torch.FloatTensor of size 2x3]

torch.Size([2, 3])


In [11]:
# Create a 3D tensor of size 2x2x2.
T_data = [[[1., 2.], [3., 4.]],
          [[5., 6.], [7., 8.]]]
T = torch.Tensor(T_data)

print(T)

print(T.shape)


(0 ,.,.) = 
  1  2
  3  4

(1 ,.,.) = 
  5  6
  7  8
[torch.FloatTensor of size 2x2x2]

torch.Size([2, 2, 2])


什么是三维张量? 让我们这样想象.如果你有一个向量,那么对向量索引就会得到一个标量. 如果你有一个矩阵,对矩阵索引那么就会得到一个向量.如果你有一个三维张量,那么对其索引 就会得到一个矩阵!

针对术语的说明: 当我在本教程内使用”tensor”,它针对的是所有torch.Tensor对象.矩阵和向量是特殊的torch.Tensors, 他们的维度分别是1和2.当我说到三维张量,我会简洁的使用”3D tensor”.

In [13]:
# Index into V and get a scalar (0 dimensional tensor)
print(V[0])

# Index into M and get a vector
print(M[0])

# Index into T and get a matrix
print(T[0])

1.0

 1
 2
 3
[torch.FloatTensor of size 3]


 1  2
 3  4
[torch.FloatTensor of size 2x2]



你也可以创建其他数据类型的tensors.默认的数据类型为Float(浮点型). 可以使用torch.LongTensor()来 创建一个整数类型的tensor.你可以在文件中寻找更多的数据类型,但是Float(浮点型)和Long(长整形)最常用的.

你可以使用torch.randn()创建一个随机数据和需要提供维度的tensor.

In [14]:
x = torch.randn((3, 4, 5))
print(x)


(0 ,.,.) = 
  0.6614  0.2669  0.0617  0.6213 -0.4519
 -0.1661 -1.5228  0.3817 -1.0276 -0.5631
 -0.8923 -0.0583 -0.1955 -0.9656  0.4224
  0.2673 -0.4212 -0.5107 -1.5727 -0.1232

(1 ,.,.) = 
  3.5870 -1.8313  1.5987 -1.2770  0.3255
 -0.4791  1.3790  2.5286  0.4107 -0.9880
 -0.9081  0.5423  0.1103 -2.2590  0.6067
 -0.1383  0.8310 -0.2477 -0.8029  0.2366

(2 ,.,.) = 
  0.2857  0.6898 -0.6331  0.8795 -0.6842
  0.4533  0.2912 -0.8317 -0.5525  0.6355
 -0.3968 -0.6571 -1.6428  0.9803 -0.0421
 -0.8206  0.3133 -1.1352  0.3773 -0.2824
[torch.FloatTensor of size 3x4x5]



### Operations with Tensors

你可以以你想要的方式操作tensor

In [15]:
x = torch.Tensor([1., 2., 3.])
y = torch.Tensor([4., 5., 6.])
z = x + y
print(z)


 5
 7
 9
[torch.FloatTensor of size 3]



可以查阅 文档 获取大量可用操作的完整列表, 扩展到了非数学操作. http://pytorch.org/docs/torch.html

接下来一个很有帮助的操作就是连接.

In [16]:
# By default, it concatenates along the first axis (concatenates rows)
x_1 = torch.randn(2, 5)
y_1 = torch.randn(3, 5)
z_1 = torch.cat([x_1, y_1])
print(z_1)

# Concatenate columns:
x_2 = torch.randn(2, 3)
y_2 = torch.randn(2, 5)
# second arg specifies which axis to concat along
z_2 = torch.cat([x_2, y_2], 1)
print(z_2)

# If your tensors are not compatible, torch will complain.  Uncomment to see the error
# torch.cat([x_1, x_2])


-2.5667 -1.4303  0.5009  0.5438 -0.4057
 1.1341 -1.1115  0.3501 -0.7703 -0.1473
 0.6272  1.0935  0.0939  1.2381 -1.3459
 0.5119 -0.6933 -0.1668 -0.9999 -1.6476
 0.8098  0.0554  1.1340 -0.5326  0.6592
[torch.FloatTensor of size 5x5]


-1.5964 -0.3769 -3.1020 -0.0020 -1.0952  0.6016  0.6984 -0.8005
-0.0995 -0.7213  1.2708  1.5381  1.4673  1.5951 -1.5279  1.0156
[torch.FloatTensor of size 2x8]



### Reshaping Tensors

使用.view()去重构tensor.这是一个高频方法, 因为许多神经网络的神经元对输入格式 有明确的要求. 你通常需要先将数据重构再输入到神经元中.

In [17]:
x = torch.randn(2, 3, 4)
print(x)
print(x.view(2, 12))  # Reshape to 2 rows, 12 columns
# Same as above.  If one of the dimensions is -1, its size can be inferred
print(x.view(2, -1))


(0 ,.,.) = 
 -0.2020 -1.2865  0.8231 -0.6101
 -1.2960 -0.9434  0.6684  1.1628
 -0.3229  1.8782 -0.5666  0.4016

(1 ,.,.) = 
 -0.1153  0.3170  0.5629  0.8662
 -0.3528  0.3482  1.1371 -0.3339
 -1.4724  0.7296 -0.1312 -0.6368
[torch.FloatTensor of size 2x3x4]



Columns 0 to 9 
-0.2020 -1.2865  0.8231 -0.6101 -1.2960 -0.9434  0.6684  1.1628 -0.3229  1.8782
-0.1153  0.3170  0.5629  0.8662 -0.3528  0.3482  1.1371 -0.3339 -1.4724  0.7296

Columns 10 to 11 
-0.5666  0.4016
-0.1312 -0.6368
[torch.FloatTensor of size 2x12]



Columns 0 to 9 
-0.2020 -1.2865  0.8231 -0.6101 -1.2960 -0.9434  0.6684  1.1628 -0.3229  1.8782
-0.1153  0.3170  0.5629  0.8662 -0.3528  0.3482  1.1371 -0.3339 -1.4724  0.7296

Columns 10 to 11 
-0.5666  0.4016
-0.1312 -0.6368
[torch.FloatTensor of size 2x12]



## 1.2 Computation Graphs and Automatic Differentiation

计算图的思想对于有效率的深度学习编程是很重要的, 因为它允许你不必去自己写反向梯度传播. 计算图只是简单地说明了如何将数据组合在一起以输出结果.因为图完全指定了操作所包含的参数, 因此它包含了足够的信息去求导.这可能听起来很模糊, 所以让我们看看使用Pytorch的基本类: autograd.Variable.

首先, 从程序员的角度来思考.在torch中存储了什么, 是我们在上面创建的Tensor对象吗? 显然, 是数据和 结构, 也很可能是其他的东西. 但是当我们将两个tensors相加后, 我们得到了一个输出tensor.这个输出所能 体现出的只有数据和结构, 并不能体现出是由两个tensors加和得到的(因为它可能是从一个文件中读取的, 也可能是 其他操作的结果等).

变量类别可以一直跟踪它是如何创建的.让我们在实际中来看.

In [31]:
x = autograd.Variable(torch.Tensor([1., 2., 3.]), requires_grad=True)
print(x)

Variable containing:
 1
 2
 3
[torch.FloatTensor of size 3]



In [32]:
print(x.data)


 1
 2
 3
[torch.FloatTensor of size 3]



In [33]:
# With requires_grad=True, you can still do all the operations you previously could
y = autograd.Variable(torch.Tensor([4., 5., 6.]), requires_grad=True)
z = x+y
print(z)

Variable containing:
 5
 7
 9
[torch.FloatTensor of size 3]



In [36]:
# BUT z knows something extra.
# if x and y all set requires_grad=True, this function returns None
print(z.grad_fn)

<AddBackward1 object at 0x000001F587E390F0>


In [37]:
print(z.grad_fn.next_functions)

((<AccumulateGrad object at 0x000001F587E392B0>, 0), (<AccumulateGrad object at 0x000001F587E396D8>, 0))


既然变量知道怎么创建的它们. z知道它并非是从文件读取的, 也不是乘法或指数或其他运算的结果. 如果你继续跟踪 z.grad_fn, 你会从中找到x和y的痕迹.

但是它如何帮助我们计算梯度?

In [39]:
s = z.sum()
print(s)
print(s.grad_fn)

Variable containing:
 21
[torch.FloatTensor of size 1]

<SumBackward0 object at 0x000001F587E39400>


那么这个计算和对x的第一个分量的导数等于多少? 在数学上,我们求 ds/dx, s知道它是被tensor z的和创建的.z 知道它是x+y的和, 并且s包含了足够的信息去决定我们需要的导数为1!

当然它掩盖了如何计算导数的挑战.这是因为s携带了足够多的信息所以导数可以被计算.现实中,Pytorch 程序的开发人员用程序指令sum()和 + 操作以知道如何计算它们的梯度并且运行反向传播算法.深入讨论此算法 超出了本教程的范围.

让我们用Pytorch计算梯度,发现我们是对的:(如果你运行这个方块很多次,梯度会上升,这是因为Pytorch accumulates (累积) 渐变为.grad属性, 因为对于很多模型它是很方便的.)

In [40]:
# calling .backward() on any variable will run backprop, starting from it.
s.backward()
print(x.grad)

Variable containing:
 1
 1
 1
[torch.FloatTensor of size 3]



In [41]:
print(x.grad.data)


 1
 1
 1
[torch.FloatTensor of size 3]



对于一个成功的深度学习程序员了解下面的方块如何运行是至关重要的.

In [42]:
x = torch.randn((2, 2))
y = torch.randn((2, 2))
z = x + y  # 这些是Tensor类型,反向是不可能的

var_x = autograd.Variable(x, requires_grad=True)
var_y = autograd.Variable(y, requires_grad=True)
# var_z 包含了足够的信息去计算梯度,如下所示
var_z = var_x + var_y
print(var_z.grad_fn)

var_z_data = var_z.data  # 从 var_z中得到包裹Tensor对象...
# 在一个新的变量中重新包裹tensor
new_var_z = autograd.Variable(var_z_data)

# new_var_z 有去反向x和y的信息吗?
# 没有!
print(new_var_z.grad_fn)
# 怎么会这样? 我们将 tensor 从 var_z 中提取 (提取为var_z.data). 这个张量不知道它是如
# 何计算的.我们把它传递给 new_var_z.
# 这就是new_var_z得到的所有信息. 如果 var_z_data 不知道它是如何计算的, 那么就不会有 new_var_z 的方法.
# 从本质上讲, 我们已经把这个变量从过去的历史中分离出来了.
#

<AddBackward1 object at 0x000001F587E48208>
None


这就是基础的,但是对于计算自动求导是特别重要的规则 (这比Pytorch更通用,在每个主要的深度学习工具箱中都有一个相同的对象):

**如果你想要从损失函数返回到神经网络的某个神经元得到错误,那么你就不能将断开从该组件到你的丢失变量的变量链.如果你这样做, 损失将不知道你的组件存在, 并且它的参数不能被更新.**

我用粗体表示, 因为这个错误会在不经意间发生(我将在下面展示一些这样的方法), 并且它不会导致您的代码崩溃或报错, 所以您必须小心.

# 2 Deep Learning with PyTorch

## 2.1 Deep Learning Building Blocks: Affine maps, non-linearities and objectives

深度学习以巧妙的方式将non-linearities和linearities组合在一起.non-linearities的引入允许强大的模型. 在本节中, 我们将使用这些核心组件, 构建一个objective函数, 并且看看模型是如何训练的.

### Affine Maps

深度学习的核心工作之一是affine map, 这是一个函数f(x) 其中

f(x) = Ax + b

对于矩阵 A 和向量 x,b. 这里学习的参数是 A and b. 通常, b 被称为 偏差 项.

Pytorch和大多数其他深度学习框架与传统的线性代数有所不同.它映射输入的是行而不是列. 也就是说, 下面的输出的第 i 行是 A 的输入的第 i 行加上偏置项的映射. 看下面的例子.

In [43]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x1f585797870>

In [48]:
lin = nn.Linear(5,3)
data = autograd.Variable(torch.randn(2, 5))
print(data)
print(lin(data))

Variable containing:
-1.1115  0.3501 -0.7703 -0.1473  0.6272
 1.0935  0.0939  1.2381 -1.3459  0.5119
[torch.FloatTensor of size 2x5]

Variable containing:
-0.0236 -0.3005  0.2450
 0.3692 -1.3302  0.0736
[torch.FloatTensor of size 2x3]



### Non-Linearities

首先, 注意以下将解释为什么我们首先需要 non-linearities.假设我们有两个 affine maps f(x)=Ax+b and g(x)=Cx+d. 什么是 f(g(x))?
- f(g(x))=A(Cx+d)+b=ACx+(Ad+b)

AC 是一个矩阵, Ad+b 是一个向量, 所以我们看到组合两个affine maps会得到一个affine map

由此可以看出, 如果你想让你的神经网络成为affine 组合的长链条, 那么相比于做一个简单的affine map, 此举不会给你的模型增加新的作用.

如果我们在affine层之间引入non-linearities, 则不再是这种情况, 我们可以构建更强大的模型.

接下来有一些重要的non-linearities. tanh(x),σ(x),ReLU(x) 是最常见的. 你可能想知道: “为什么这些函数？我可以想到很多其他的non-linearities 函数.” 其原因是他们的梯度容易计算, 并且计算梯度对学习是必不可少的. 例如
- dσ/dx=σ(x)(1−σ(x))

一个简单的提示: 虽然你可能已经在入门AI中学习到了一些神经网络, 其中 σ(x) 是默认的non-linearity, 但通常人们在实践中会避免它. 这是因为随着参数绝对值的增长, 梯度会很快 消失 . 小梯度意味着很难学习. 大多数人默认tanh或ReLU.

In [50]:
# In pytorch, most non-linearities are in torch.functional (we have it imported as F)
# Note that non-linearites typically don't have parameters like affine maps do.
# That is, they don't have weights that are updated during training.
data = autograd.Variable(torch.randn(2, 2))
print(data)
print(F.relu(data))

Variable containing:
 0.8098  0.0554
 1.1340 -0.5326
[torch.FloatTensor of size 2x2]

Variable containing:
 0.8098  0.0554
 1.1340  0.0000
[torch.FloatTensor of size 2x2]



### Softmax and Probabilities

函数 Softmax(x) 也是一个 non-linearity, 但它的特殊之处在于它通常是网络中一次操作. 这是因为它接受了一个实数向量并返回一个概率分布.其定义如下. 定义 x 是一个实数的向量(正数或负数都无所谓, 没有限制). 然后, 第i个 Softmax(x) 的组成是
- exp(xi)/∑jexp(xj)

应该清楚的是, 输出是一个概率分布: 每个元素都是非负的, 并且所有元素的总和都是1.

你也可以把它看作只是将一个元素明确的指数运算符应用于输入, 以使所有内容都为非负值, 然后除以归一化常数.

In [51]:
# Softmax is also in torch.nn.functional
data = autograd.Variable(torch.randn(5))
print(data)
print(F.softmax(data, dim=0))
print(F.softmax(data, dim=0).sum())  # Sums to 1 because it is a distribution!
print(F.log_softmax(data, dim=0))  # theres also log_softmax

Variable containing:
 0.6592
-1.5964
-0.3769
-3.1020
-0.0995
[torch.FloatTensor of size 5]

Variable containing:
 0.5125
 0.0537
 0.1819
 0.0119
 0.2400
[torch.FloatTensor of size 5]

Variable containing:
 1
[torch.FloatTensor of size 1]

Variable containing:
-0.6684
-2.9240
-1.7045
-4.4297
-1.4271
[torch.FloatTensor of size 5]



log_softmax
- While mathematically equivalent to log(softmax(x)), doing these two operations separately is slower, and numerically unstable. This function uses an alternative formulation to compute the output and gradient correctly.

###  Objective Functions

Objective function 是一个目标函数，你训练网络的目的是使其最小(在这种情况下, 它通常被称为 loss function 或 cost function ). 首先选择一个训练实例, 通过神经网络运行它, 计算输出的损失. 然后利用损失函数的导数来更新模型的参数. 直观地说, 如果你的模型对答案完全有信心, 但答案是错误的, 你的损失就会很高. 如果它的答案非常有信心, 而且答案是正确的, 那么损失就会很低.

将训练样例的损失函数最小化的想法是, 你的网络希望能够很好地产生, 并且在开发集, 测试集或生产环境中未知的示例有小的损失. 一个示例损失函数是 negative log likelihood loss , 这是多类分类的一个非常普遍的目标函数. 对于有监督的多类别分类, 这意味着训练网络以最小化正确输出的负对数概率(或等同地, 最大化正确输出的对数概率).

## 2.2 Optimization and Training

那么我们可以计算一个实例的损失函数?我们该怎么做?我们之前看到 autograd.Variable 知道如何计算与计算梯度有关的事物.那么, 因为我们的loss是一个 autograd.Variable, 我们可以对所有用于计算的参数计算梯度！然后我们可以执行standard gradient updates. 令 θ 是我们的参数, L(θ) 损失函数, 以及: η 是一个正的的学习率. 然后:
- θ(t+1)=θ(t)−η∇θL(θ)

有大量的算法和积极的研究去尝试比vanilla gradient updates更新更出色的方法. 许多人试图根据训练的情况改变学习率. 除非你真的感兴趣, 否则你不必担心这些算法具体做什么. Torch提供了许多 torch.optim 包, 它们都是开源的.使用最简单的梯度更新与更复杂的算法效果相同.尝试不同的更新算法和更新算法的不同参数(如不同的初始学习速率)对于优化网络性能非常重要. 通常, 只需用Adam或RMSProp等优化器替换vanilla SGD 即可显着提升性能.

## 2.3 Creating Network Components in PyTorch

在我们开始关注NLP之前, 让我们做一个注释的例子, 在Pytorch中只使用affine maps和non-linearities构建网络.我们还将看到如何使用Pytorch建立的negative log likelihood 计算损失函数, 并通过反向传播更新参数.

所有神经元都应该从nn.Module继承并覆盖forward()方法.就样板而言就是这样.从nn.Module继承能为你的神经元提供功能.例如, 它可以跟踪其可训练的参数, 可以使用.cuda()或.cpu()函数等在CPU和GPU之间交换, 等等.

我们来编写一个带有注释的网络示例, 该网络采用稀疏的词袋表示法, 并输出概率分布在两个标签上: “英语”和“西班牙语”.使用的模型是逻辑回归.

### Example: Logistic Regression Bag-of-Words classifier

我们的模型将映射一个稀疏的BOW表示来记录标签上的概率.我们为词汇表中的每个单词分配一个索引. 例如, 我们的完整的词汇表有两个单词: “hello” 和 “world”, 这两个单词的索引分别为0和1. 句子为 “hello hello hello hello” 的BoW向量为
- [4,0]

对于 “hello world world hello” , 它是
- [2,2]

etc. 一般来说, 它是
- [Count(hello),Count(world)]

将这个BOW向量表示为 x. 我们的网络输出是:
- logSoftmax(Ax+b)

也就是说, 我们通过affine map传递输入, 然后进行softmax.

In [52]:
data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]

In [54]:
test_data

[(['Yo', 'creo', 'que', 'si'], 'SPANISH'),
 (['it', 'is', 'lost', 'on', 'me'], 'ENGLISH')]

In [61]:
# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data+test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)
print()
print(len(word_to_ix))

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}

26


In [62]:
VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2

In [66]:
class BoWClassifier(nn.Module): # inheriting from nn.Module!
    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()
        
        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        # Make sure you understand why the input dimension is vocab_size
        # and the output is num_labels!
        self.linear = nn.Linear(vocab_size, num_labels)
        
        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here
        
    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec), dim=1)

In [75]:
def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1, -1) # Reshape to the input shape

In [67]:
model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

In [68]:
# the model knows its parameters.  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the PyTorch devs, your module
# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters
for param in model.parameters():
    print(param)

Parameter containing:

Columns 0 to 9 
-0.0490 -0.0159 -0.1690 -0.0387 -0.1265  0.1802 -0.1695 -0.1529 -0.0067 -0.1060
-0.0153 -0.0063  0.0333  0.0924  0.0315  0.0598 -0.1764  0.1429  0.1710  0.1621

Columns 10 to 19 
 0.0702 -0.0755 -0.0921  0.0111  0.1420 -0.1380  0.0921  0.1260  0.1918 -0.1373
 0.1450 -0.1415 -0.0727  0.1729 -0.1494  0.1779 -0.1542 -0.1381  0.0959 -0.1409

Columns 20 to 25 
 0.0475 -0.1450  0.1674 -0.0761  0.1181  0.0058
-0.0449  0.1427  0.1553  0.1855 -0.0398 -0.1524
[torch.FloatTensor of size 2x26]

Parameter containing:
 0.1931
-0.0418
[torch.FloatTensor of size 2]



In [73]:
# To run the model, pass in a BoW vector, and it's needed to wrapped in autograd.Variable
sample = data[0]
bow_vector = make_bow_vector(sample[0], word_to_ix); bow_vector



Columns 0 to 12 
    1     1     1     1     1     1     0     0     0     0     0     0     0

Columns 13 to 25 
    0     0     0     0     0     0     0     0     0     0     0     0     0
[torch.FloatTensor of size 1x26]

In [74]:
log_probs = model(autograd.Variable(bow_vector))
print(log_probs)

Variable containing:
-0.7869 -0.6074
[torch.FloatTensor of size 1x2]



以上哪个值对应于”英语”的概率, 以及哪个值是”西班牙语”?我们从来没有定义过它, 但如果我们想要训练这个模型, 我们需要去定义.

In [86]:
label_to_ix = {"SPANISH": 0, "ENGLISH": 1}

def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])

来做训练吧！要做到这一点, 我们通过实例来获取概率, 计算损失函数, 计算损失函数的梯度, 然后用梯度步骤更新参数.Torch在nn软件包中提供了损失函数.nn.NLLLoss()是我们想要的负对数似然损失.它还定义了torch.optim中的优化函数.在这里, 我们只使用SGD.

请注意, NLLLoss 的 输入 是一个对数概率向量和一个目标标签. 它不会为我们计算对数概率. 这就是为什么我们网络的最后一层是log softmax. 损失函数 nn.CrossEntropyLoss() 与 NLLLoss() 相同, 唯一的不同是它为你去做 softmax.

In [77]:
# Run on test data before we train, just to see a before-and-after
for instance, label in test_data:
    bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
    log_probs = model(bow_vec)
    print(log_probs)

Variable containing:
-0.6338 -0.7563
[torch.FloatTensor of size 1x2]

Variable containing:
-0.6821 -0.7044
[torch.FloatTensor of size 1x2]



In [78]:
# generator object, so we can directly call next() w/o iter()
model.parameters()

<generator object Module.parameters at 0x000001F587E2BD58>

In [79]:
# Print the matrix column corresponding to "creo"
print(next(model.parameters())[:, word_to_ix["creo"]])

Variable containing:
 0.0702
 0.1450
[torch.FloatTensor of size 2]



In [80]:
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [90]:
instance, label = data[1]
print(instance)
print(label)

['Give', 'it', 'to', 'me']
ENGLISH


In [91]:
bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
target = autograd.Variable(make_target(label, label_to_ix))
print()
print(bow_vec)
print(target) # Notice this is one number, not a vector with #class dim


Variable containing:

Columns 0 to 12 
    1     0     0     0     0     0     1     1     1     0     0     0     0

Columns 13 to 25 
    0     0     0     0     0     0     0     0     0     0     0     0     0
[torch.FloatTensor of size 1x26]

Variable containing:
 1
[torch.LongTensor of size 1]



In [89]:
# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
    for instance, label in data:
        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()
        
        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Variable as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
        target = autograd.Variable(make_target(label, label_to_ix))
        
        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)
        
        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = criterion(log_probs, target)
        loss.backward()
        optimizer.step()

In [92]:
for instance, label in test_data:
    bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
    log_probs = model(bow_vec)
    print(log_probs)

Variable containing:
-0.1113 -2.2507
[torch.FloatTensor of size 1x2]

Variable containing:
-2.4659 -0.0888
[torch.FloatTensor of size 1x2]



In [93]:
# Index corresponding to Spanish goes up, English goes down!
print(next(model.parameters())[:, word_to_ix["creo"]])

Variable containing:
 0.5102
-0.2951
[torch.FloatTensor of size 2]



我们得到了正确的答案! 你可以看到, 第一个示例中西班牙语的概率要高得多, 而测试数据的第二个英语概率应该高得多.

现在你看到了如何制作一个Pytorch组件, 通过它传递一些数据并做梯度更新.我们准备深入挖掘NLP所能提供的内容.

# 3 Word Embeddings: Encoding Lexical Semantics

Word embeddings are dense vectors of real numbers, one per word in your vocabulary. In NLP, it is almost always the case that your features are words! But how should you represent a word in a computer? You could store its ascii character representation, but that only tells you what the word is, it doesn’t say much about what it means (you might be able to derive its part of speech from its affixes, or properties from its capitalization, but not much). Even more, in what sense could you combine these representations? We often want dense outputs from our neural networks, where the inputs are |V| dimensional, where V is our vocabulary, but often the outputs are only a few dimensional (if we are only predicting a handful of labels, for instance). How do we get from a massive dimensional space to a smaller dimensional space?

How about instead of ascii representations, we use a one-hot encoding? That is, we represent the word w by
- [0,0,…,1,…,0,0], |V| elements

where the 1 is in a location unique to w. Any other word will have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how huge it is. It basically treats all words as independent entities with no relation to each other. What we really want is some notion of similarity between words. Why? Let’s see an example.

Suppose we are building a language model. Suppose we have seen the sentences
- The mathematician ran to the store.
- The physicist ran to the store.
- The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before seen in our training data:
- The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn’t it be much better if we could use the following two facts:
- We have seen mathematician and physicist in the same role in a sentence. Somehow they have a semantic relation.
- We have seen mathematician in the same role in this new unseen sentence as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen sentence? This is what we mean by a notion of similarity: we mean semantic similarity, not simply having similar orthographic representations. It is a technique to combat the sparsity of linguistic data, by connecting the dots between what we have seen and what we haven’t. This example of course relies on a fundamental linguistic assumption: that words appearing in similar contexts are related to each other semantically. This is called the **distributional hypothesis** https://en.wikipedia.org/wiki/Distributional_semantics

## 3.1 Getting Dense Word Embeddings

How can we solve this problem? That is, how could we actually encode semantic similarity in words? Maybe we think up some semantic attributes. For example, we see that both mathematicians and physicists can run, so maybe we give these words a high score for the “is able to run” semantic attribute. Think of some other attributes, and imagine what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector, like this:
- q_mathematician=[2.3 (can run),9.4 (likes coffee),−5.5 (majored in Physics),…]
- q_physicist=[2.5 (can run),9.1 (likes coffee),6.4 (majored in Physics),…]

Then we can get a measure of similarity between these words by doing:
- Similarity(physicist,mathematician)=q_physicist⋅q_mathematician

Although it is more common to normalize by the lengths:
- Similarity(physicist,mathematician)=(q_physicist⋅q_mathematician)/(||q_physicist|| ||q_mathematician||)=cos(ϕ)

Where ϕ is the angle between the two vectors. That way, extremely similar words (words whose embeddings point in the same direction) will have similarity 1. Extremely dissimilar words should have similarity -1.

You can think of the sparse one-hot vectors from the beginning of this section as a special case of these new vectors we have defined, where each word basically has similarity 0, and we gave each word some unique semantic attribute. These new vectors are dense, which is to say their entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of different semantic attributes that might be relevant to determining similarity, and how on earth would you set the values of the different attributes? Central to the idea of deep learning is that the neural network learns representations of the features, rather than requiring the programmer to design them herself. So why not just let the word embeddings be parameters in our model, and then be updated during training? This is exactly what we will do. We will have some latent semantic attributes that the network can, in principle, learn. Note that the word embeddings will probably not be interpretable. That is, although with our hand-crafted vectors above we can see that mathematicians and physicists are similar in that they both like coffee, if we allow a neural network to learn the embeddings and see that both mathematicians and physicists have a large value in the second dimension, it is not clear what that means. They are similar in some latent semantic dimension, but this probably has no interpretation to us.

In summary, **word embeddings are a representation of the *semantics* of a word, efficiently encoding semantic information that might be relevant to the task at hand.** You can embed other things too: part of speech tags, parse trees, anything! The idea of feature embeddings is central to the field.

## 3.2 Word Embeddings in Pytorch

Before we get to a worked example and an exercise, a few quick notes about how to use embeddings in Pytorch and in deep learning programming in general. Similar to how we defined a unique index for each word when making one-hot vectors, we also need to define an index for each word when using embeddings. These will be keys into a lookup table. That is, embeddings are stored as a **|V|×D matrix**, where D is the dimensionality of the embeddings, such that the word assigned index i has its embedding stored in the i‘th row of the matrix. In all of my code, the mapping from words to indices is a dictionary named word_to_ix.

The module that allows you to use embeddings is **torch.nn.Embedding**, which takes two arguments: the vocabulary size, and the dimensionality of the embeddings.

To index into this table, you must use **torch.LongTensor** (since the indices are integers, not floats).

In [94]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x1f585797870>

In [108]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.LongTensor([word_to_ix["hello"], word_to_ix["world"]])
hello_embed = embeds(autograd.Variable(lookup_tensor)) # wrap by Variable input, the output always need .view(1, -1) as next layer input
print(lookup_tensor)
print(hello_embed)


 0
 1
[torch.LongTensor of size 2]

Variable containing:
-2.5667 -1.4303  0.5009  0.5438 -0.4057
 1.1341 -1.1115  0.3501 -0.7703 -0.1473
[torch.FloatTensor of size 2x5]



## 3.3 An Example: N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words w, we want to compute
- P(wi|wi−1,wi−2,…,wi−n+1)

Where wi is the ith word of the sequence.

In this example, we will compute the loss function on some training examples and update the parameters with backpropagation.

In [121]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10

# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()

# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i+1]], test_sentence[i+2]) for i in range(len(test_sentence)-2)]

# print the first 3, just so you can see what they look like
print(trigrams[:3])

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]


In [122]:
vocab = set(test_sentence)
word_to_ix = {word:i for i,word in enumerate(vocab)}
print(word_to_ix)

{'praise': 0, 'winters': 1, 'held:': 2, 'Were': 3, 'days;': 4, 'How': 5, 'within': 6, 'thriftless': 7, 'new': 8, 'say,': 9, 'see': 10, 'were': 11, 'thou': 12, "totter'd": 13, 'asked,': 14, 'own': 15, 'now,': 16, 'worth': 17, 'field,': 18, 'and': 19, 'child': 20, 'blood': 21, 'more': 22, 'lies,': 23, 'answer': 24, 'of': 25, 'thine': 26, 'all-eating': 27, 'fair': 28, 'sum': 29, 'treasure': 30, "excuse,'": 31, 'all': 32, 'dig': 33, 'by': 34, 'besiege': 35, 'sunken': 36, 'on': 37, 'be': 38, 'much': 39, 'when': 40, 'And': 41, 'mine': 42, 'thy': 43, 'small': 44, 'it': 45, 'brow,': 46, 'thine!': 47, 'count,': 48, 'old': 49, 'livery': 50, 'Thy': 51, 'forty': 52, 'eyes,': 53, 'To': 54, 'trenches': 55, 'so': 56, 'If': 57, 'in': 58, "feel'st": 59, 'shall': 60, 'Shall': 61, 'make': 62, 'gazed': 63, 'succession': 64, 'the': 65, 'use,': 66, 'an': 67, 'Will': 68, 'shame,': 69, 'couldst': 70, 'his': 71, "beauty's": 72, 'art': 73, 'where': 74, "youth's": 75, 'a': 76, 'made': 77, 'warm': 78, 'This': 79,

In [127]:
class NGramLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModel, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size*embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)
        
    def forward(self, inputs):
        embeds = self.embeddings(inputs).view(1, -1)
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1) #(batch, V)
        return log_probs

In [128]:
losses = []
criterion = nn.NLLLoss()
model = NGramLanguageModel(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [129]:
model

NGramLanguageModel(
  (embeddings): Embedding(97, 10)
  (linear1): Linear(in_features=20, out_features=128, bias=True)
  (linear2): Linear(in_features=128, out_features=97, bias=True)
)

In [130]:
for epoch in range(10):
    total_loss = 0.0
    for context, target in trigrams:
        
        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in Variable)
        context_idxs = [word_to_ix[w] for w in context]
        context_var = autograd.Variable(torch.LongTensor(context_idxs))
        
        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old instance
        model.zero_grad()
        
        # Step 3. Run the forward pass, getting log probabilities over next words
        log_probs = model(context_var)
        
        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a Variable)
        loss = criterion(log_probs, autograd.Variable(torch.LongTensor([word_to_ix[target]])))
        
        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()
        
        total_loss += loss.data[0]
    losses.append(total_loss)

# The loss decreased every iteration over the training data!
print(losses) 

[519.310714006424, 516.8290419578552, 514.3636934757233, 511.9132571220398, 509.47750878334045, 507.0546727180481, 504.6437327861786, 502.2436821460724, 499.853636264801, 497.4737927913666]


## 3.4 Exercise: Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep learning. It is a model that tries to predict words given the context of a few words before and a few words after the target word. This is distinct from language modeling, since CBOW is not sequential and does not have to be probabilistic. Typcially, CBOW is used to quickly train word embeddings, and these embeddings are used to initialize the embeddings of some more complicated model. Usually, this is referred to as pretraining embeddings. It almost always helps performance a couple of percent.

The CBOW model is as follows. Given a target word wi and an N context window on each side, wi−1,…,wi−N and wi+1,…,wi+N, referring to all context words collectively as C, CBOW tries to minimize
- −logp(wi|C)=−logSoftmax(A(∑w∈C qw)+b)

where qw is the embedding for word w.

Implement this model in Pytorch by filling in the class below. Some tips:
- Think about which parameters you need to define.
- Make sure you know what shape each operation expects. Use .view() if you need to reshape.

In [None]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(self):
        pass

    def forward(self, inputs):
        pass

# create your model and train.  here are some functions to help you make
# the data ready for use by your module

def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    tensor = torch.LongTensor(idxs)
    return autograd.Variable(tensor)


make_context_vector(data[0][0], word_to_ix)  # example

# 4 Sequence Models and Long-Short Term Memory Networks

At this point, we have seen various feed-forward networks. That is, there is no state maintained by the network at all. This might not be the behavior we want. Sequence models are central to NLP: they are models where there is some sort of dependence through time between your inputs. The classical example of a sequence model is the Hidden Markov Model for part-of-speech tagging. Another example is the conditional random field.

A recurrent neural network is a network that maintains some kind of state. For example, its output could be used as part of the next input, so that information can propogate along as the network passes over the sequence. In the case of an LSTM, for each element in the sequence, there is a corresponding hidden state ht, which in principle can contain information from arbitrary points earlier in the sequence. We can use the hidden state to predict words in a language model, part-of-speech tags, and a myriad of other things.

## 4.1 LSTM’s in Pytorch

Before getting to the example, note a few things. Pytorch’s LSTM expects all of its inputs to be 3D tensors. The semantics of the axes of these tensors is important. **Input of shape (seq_len, batch, input_size)**
- The first axis is the sequence itself, 
- the second indexes instances in the mini-batch, 
- and the third indexes elements of the input.

The input can also be a packed **variable length** sequence. See **torch.nn.utils.rnn.pack_padded_sequence()** or **torch.nn.utils.rnn.pack_sequence()** for details.

We haven’t discussed mini-batching, so lets just ignore that and assume we will always have just 1 dimension on the second axis. If we want to run the sequence model over the sentence “The cow jumped”, our input should look like

\begin{align}\begin{bmatrix}
   \overbrace{q_\text{The}}^\text{row vector} \\
   q_\text{cow} \\
   q_\text{jumped}
   \end{bmatrix}\end{align}


Except remember there is an additional 2nd dimension with size 1.

In addition, you could go through the sequence one at a time, in which case the 1st axis will have size 1 also.

Let’s see a quick example.

In [131]:
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x1f585797870>

In [135]:
# make a sequence of length 5
inputs = [autograd.Variable(torch.randn((1,3))) for _ in range(5)]
print(inputs) # 5*1*3

[Variable containing:
 0.4533  0.2912 -0.8317
[torch.FloatTensor of size 1x3]
, Variable containing:
-0.5525  0.6355 -0.3968
[torch.FloatTensor of size 1x3]
, Variable containing:
-0.6571 -1.6428  0.9803
[torch.FloatTensor of size 1x3]
, Variable containing:
-0.0421 -0.8206  0.3133
[torch.FloatTensor of size 1x3]
, Variable containing:
-1.1352  0.3773 -0.2824
[torch.FloatTensor of size 1x3]
]


In [136]:
# initialize the hidden state.
hidden = (autograd.Variable(torch.randn(1, 1, 3)), # 1*1*3 for h
          autograd.Variable(torch.randn(1, 1, 3))) # 1*1*3 for c
print(hidden)

(Variable containing:
(0 ,.,.) = 
 -2.5667 -1.4303  0.5009
[torch.FloatTensor of size 1x1x3]
, Variable containing:
(0 ,.,.) = 
  0.5438 -0.4057  1.1341
[torch.FloatTensor of size 1x1x3]
)


In [137]:
lstm = nn.LSTM(3, 3)
for i in inputs:
    # Step through the sequence one element at a time.
    # after each step, hidden contains the hidden state.
    out, hidden = lstm(i.view(1, 1, -1), hidden) # fit the hidden into lstm to track the [hidden state, cell state]
    print(out)
    print(hidden)
    print('=======')

Variable containing:
(0 ,.,.) = 
  0.5158 -0.0452  0.5860
[torch.FloatTensor of size 1x1x3]

(Variable containing:
(0 ,.,.) = 
  0.5158 -0.0452  0.5860
[torch.FloatTensor of size 1x1x3]
, Variable containing:
(0 ,.,.) = 
  1.2370 -0.0902  0.9707
[torch.FloatTensor of size 1x1x3]
)
Variable containing:
(0 ,.,.) = 
  0.2656 -0.2105  0.3044
[torch.FloatTensor of size 1x1x3]

(Variable containing:
(0 ,.,.) = 
  0.2656 -0.2105  0.3044
[torch.FloatTensor of size 1x1x3]
, Variable containing:
(0 ,.,.) = 
  0.7071 -0.2950  0.5684
[torch.FloatTensor of size 1x1x3]
)
Variable containing:
(0 ,.,.) = 
  0.1058 -0.0410  0.4638
[torch.FloatTensor of size 1x1x3]

(Variable containing:
(0 ,.,.) = 
  0.1058 -0.0410  0.4638
[torch.FloatTensor of size 1x1x3]
, Variable containing:
(0 ,.,.) = 
  0.1456 -0.1012  0.8353
[torch.FloatTensor of size 1x1x3]
)
Variable containing:
(0 ,.,.) = 
 -0.1966 -0.0293  0.4417
[torch.FloatTensor of size 1x1x3]

(Variable containing:
(0 ,.,.) = 
 -0.1966 -0.0293  0.4417
[t

torch.nn.LSTM(*args, **kwargs)
- Parameters:	
 - input_size – The number of expected features in the input x
 - hidden_size – The number of features in the hidden state h
 - num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
 - bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
 - batch_first – If True, then the input and output tensors are provided as (batch, seq, feature)
 - dropout – If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout. Default: 0
 - bidirectional – If True, becomes a bidirectional LSTM. Default: False
- Inputs: input, (h_0, c_0)
 - input of shape (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() or torch.nn.utils.rnn.pack_sequence() for details.
 - h_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch.
 - c_0 of shape (num_layers * num_directions, batch, hidden_size): tensor containing the initial cell state for each element in the batch.
 - If (h_0, c_0) is not provided, both h_0 and c_0 default to zero.

- Outputs: output, (h_n, c_n)
 - output of shape (seq_len, batch, hidden_size * num_directions): tensor containing the output features (h_t) from the last layer of the LSTM, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.
 - h_n of shape (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t = seq_len
 - c_n (num_layers * num_directions, batch, hidden_size): tensor containing the cell state for t = seq_len

In [138]:
# alternatively, we can do the entire sequence all at once.
# the first value returned by LSTM is all of the hidden states throughout
# the sequence. the second is just the most recent hidden state
# (compare the last slice of "out" with "hidden" below, they are the same)
# The reason for this is that:
# "out" will give you access to all hidden states in the sequence
# "hidden" will allow you to continue the sequence and backpropagate,
# by passing it as an argument  to the lstm at a later time

# Add the extra 2nd dimension

In [139]:
inputs = torch.cat(inputs) # default dim=0
print(inputs)

Variable containing:
 0.4533  0.2912 -0.8317
-0.5525  0.6355 -0.3968
-0.6571 -1.6428  0.9803
-0.0421 -0.8206  0.3133
-1.1352  0.3773 -0.2824
[torch.FloatTensor of size 5x3]



In [143]:
inputs = inputs.view(5, 1, -1)
print(inputs)

Variable containing:
(0 ,.,.) = 
  0.4533  0.2912 -0.8317

(1 ,.,.) = 
 -0.5525  0.6355 -0.3968

(2 ,.,.) = 
 -0.6571 -1.6428  0.9803

(3 ,.,.) = 
 -0.0421 -0.8206  0.3133

(4 ,.,.) = 
 -1.1352  0.3773 -0.2824
[torch.FloatTensor of size 5x1x3]



In [144]:
hidden = (autograd.Variable(torch.randn(1, 1, 3)),
          autograd.Variable(torch.randn(1, 1, 3)))
print(hidden)

(Variable containing:
(0 ,.,.) = 
  0.3170  0.5629  0.8662
[torch.FloatTensor of size 1x1x3]
, Variable containing:
(0 ,.,.) = 
 -0.3528  0.3482  1.1371
[torch.FloatTensor of size 1x1x3]
)


In [145]:
out, hidden = lstm(inputs, hidden) #input: 5*1*3
print(out) # out: 5*1*3
print(hidden) # hidden: 2*(1*1*3), one for h and one for c

Variable containing:
(0 ,.,.) = 
 -0.2440 -0.0258  0.3401

(1 ,.,.) = 
 -0.0774 -0.1039  0.2000

(2 ,.,.) = 
 -0.3938 -0.0046  0.4113

(3 ,.,.) = 
 -0.4511  0.0169  0.4237

(4 ,.,.) = 
 -0.1882 -0.0359  0.3567
[torch.FloatTensor of size 5x1x3]

(Variable containing:
(0 ,.,.) = 
 -0.1882 -0.0359  0.3567
[torch.FloatTensor of size 1x1x3]
, Variable containing:
(0 ,.,.) = 
 -0.3341 -0.0658  0.5682
[torch.FloatTensor of size 1x1x3]
)


## 4.2 Example: An LSTM for Part-of-Speech Tagging

In this section, we will use an LSTM to get part of speech tags. We will
not use Viterbi or Forward-Backward or anything like that, but as a
(challenging) exercise to the reader, think about how Viterbi could be
used after you have seen what is going on.

The model is as follows: let our input sentence be
$w_1, \dots, w_M$, where $w_i \in V$, our vocab. Also, let
$T$ be our tag set, and $y_i$ the tag of word $w_i$.
Denote our prediction of the tag of word $w_i$ by
$\hat{y}_i$.

This is a structure prediction, model, where our output is a sequence
$\hat{y}_1, \dots, \hat{y}_M$, where $\hat{y}_i \in T$.

To do the prediction, pass an LSTM over the sentence. Denote the hidden
state at timestep $i$ as $h_i$. Also, assign each tag a
unique index (like how we had word\_to\_ix in the word embeddings
section). Then our prediction rule for $\hat{y}_i$ is

\begin{align}\hat{y}_i = \text{argmax}_j \  (\log \text{Softmax}(Ah_i + b))_j\end{align}

That is, take the log softmax of the affine map of the hidden state,
and the predicted tag is the tag that has the maximum value in this
vector. Note this implies immediately that the dimensionality of the
target space of $A$ is $|T|$.


Prepare data:

In [147]:
training_data = [
    ("The dog ate the apple".split(), ["DET", "NN", "V", "DET", "NN"]),
    ("Everybody read that book".split(), ["NN", "V", "DET", "NN"])
]
print(training_data)

[(['The', 'dog', 'ate', 'the', 'apple'], ['DET', 'NN', 'V', 'DET', 'NN']), (['Everybody', 'read', 'that', 'book'], ['NN', 'V', 'DET', 'NN'])]


In [148]:
word_to_ix = {}
for sent, tags in training_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

{'The': 0, 'dog': 1, 'ate': 2, 'the': 3, 'apple': 4, 'Everybody': 5, 'read': 6, 'that': 7, 'book': 8}


In [149]:
tag_to_ix = {"DET": 0, "NN": 1, "V": 2}

In [150]:
def prepare_sequence(seq, to_ix):
    idxs = [to_ix[w] for w in seq]
    tensor = torch.LongTensor(idxs)
    return autograd.Variable(tensor)

In [152]:
# These will usually be more like 32 or 64 dimensional.
# We will keep them small, so we can see how the weights change as we train.
EMBEDDING_DIM = 6
HIDDEN_DIM = 6

Create the model:

In [None]:
class LSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, target_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        
        # The LSTM takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        
        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, target_size)
        self.hidden = self.init_hidden()
        
    def init_hidden(self):
        # Before we've done anything, we dont have any hidden state.
        # Refer to the Pytorch documentation to see exactly
        # why they have this dimensionality.
        # The axes semantics are (num_layers, minibatch_size, hidden_dim)
        return (autograd.Variable(torch.zeros(1, 1, self.hidden_dim)),
                autograd.Variable(torch.zeros(1, 1, self.hidden_dim)))
    
    def forward(self, sentence):
        embeds = self.word_embeddings(sentence).view(len(sentence), 1, -1)
        lstm_out, self.hidden = self.lstm(embeds, self.hidden)
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

Train the model:

In [153]:
model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_ix), len(tag_to_ix))
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

In [154]:
# See what the scores are before training
# Note that element i,j of the output is the score for tag j for word i.
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)
print(tag_scores)

Variable containing:
-1.2976 -0.8606 -1.1910
-1.3122 -0.8940 -1.1339
-1.2810 -0.8799 -1.1795
-1.2906 -0.8681 -1.1870
-1.3309 -0.8544 -1.1704
[torch.FloatTensor of size 5x3]



In [158]:
for epoch in range(300):
    for sentence, tags in training_data:
        # Step 1. Remember that Pytorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()
        
        # Also, we need to clear out the hidden state of the LSTM,
        # detaching it from its history on the last instance.
        model.hidden = model.init_hidden()
        
        # Step 2. Get our inputs ready for the network, that is, turn them into
        # Tensors of word indices.
        sentence_in = prepare_sequence(sentence, word_to_ix)
        targets = prepare_sequence(tags, tag_to_ix)
        
        # Step 3. Run our forward pass.
        tag_scores = model(sentence_in)
        
        # Step 4. Compute the loss, gradients, and update the parameters by
        #  calling optimizer.step()
        loss = criterion(tag_scores, targets)
        loss.backward()
        optimizer.step()

In [159]:
# See what the scores are after training
inputs = prepare_sequence(training_data[0][0], word_to_ix)
tag_scores = model(inputs)

# The sentence is "the dog ate the apple".  i,j corresponds to score for tag j
# for word i. The predicted tag is the maximum scoring tag.
# Here, we can see the predicted sequence below is 0 1 2 0 1
# since 0 is index of the maximum value of row 1,
# 1 is the index of maximum value of row 2, etc.
# Which is DET NOUN VERB DET NOUN, the correct sequence!
print (tag_scores)

Variable containing:
-0.0322 -3.5142 -6.2351
-4.9764 -0.0133 -5.0654
-4.4516 -3.6952 -0.0372
-0.0138 -4.7282 -5.3295
-4.7061 -0.0194 -4.5925
[torch.FloatTensor of size 5x3]



##  4.3 Exercise: Augmenting the LSTM part-of-speech tagger with character-level features

In the example above, each word had an embedding, which served as the
inputs to our sequence model. Let's augment the word embeddings with a
representation derived from the characters of the word. We expect that
this should help significantly, since character-level information like
affixes have a large bearing on part-of-speech. For example, words with
the affix *-ly* are almost always tagged as adverbs in English.

To do this, let $c_w$ be the character-level representation of
word $w$. Let $x_w$ be the word embedding as before. Then
the input to our sequence model is the concatenation of $x_w$ and
$c_w$. So if $x_w$ has dimension 5, and $c_w$
dimension 3, then our LSTM should accept an input of dimension 8.

To get the character level representation, do an LSTM over the
characters of a word, and let $c_w$ be the final hidden state of
this LSTM. Hints:

* There are going to be two LSTM's in your new model.
  The original one that outputs POS tag scores, and the new one that
  outputs a character-level representation of each word.
* To do a sequence model over characters, you will have to embed characters.
  The character embeddings will be the input to the character LSTM.

# 5 Advanced: Making Dynamic Decisions and the Bi-LSTM CRF

## 5.1 Dynamic versus Static Deep Learning Toolkits

Pytorch is a *dynamic* neural network kit. Another example of a dynamic
kit is `Dynet <https://github.com/clab/dynet>`__ (I mention this because
working with Pytorch and Dynet is similar. If you see an example in
Dynet, it will probably help you implement it in Pytorch). The opposite
is the *static* tool kit, which includes Theano, Keras, TensorFlow, etc.
The core difference is the following:

* In a static toolkit, you define
  a computation graph once, compile it, and then stream instances to it.
* In a dynamic toolkit, you define a computation graph *for each
  instance*. It is never compiled and is executed on-the-fly

Without a lot of experience, it is difficult to appreciate the
difference. One example is to suppose we want to build a deep
constituent parser. Suppose our model involves roughly the following
steps:

* We build the tree bottom up
* Tag the root nodes (the words of the sentence)
* From there, use a neural network and the embeddings
  of the words to find combinations that form constituents. Whenever you
  form a new constituent, use some sort of technique to get an embedding
  of the constituent. In this case, our network architecture will depend
  completely on the input sentence. In the sentence "The green cat
  scratched the wall", at some point in the model, we will want to combine
  the span $(i,j,r) = (1, 3, \text{NP})$ (that is, an NP constituent
  spans word 1 to word 3, in this case "The green cat").

However, another sentence might be "Somewhere, the big fat cat scratched
the wall". In this sentence, we will want to form the constituent
$(2, 4, NP)$ at some point. The constituents we will want to form
will depend on the instance. If we just compile the computation graph
once, as in a static toolkit, it will be exceptionally difficult or
impossible to program this logic. In a dynamic toolkit though, there
isn't just 1 pre-defined computation graph. There can be a new
computation graph for each instance, so this problem goes away.

Dynamic toolkits also have the advantage of being easier to debug and
the code more closely resembling the host language (by that I mean that
Pytorch and Dynet look more like actual Python code than Keras or
Theano).

## 5.2 Bi-LSTM Conditional Random Field Discussion


For this section, we will see a full, complicated example of a Bi-LSTM
Conditional Random Field for named-entity recognition. The LSTM tagger
above is typically sufficient for part-of-speech tagging, but a sequence
model like the CRF is really essential for strong performance on NER.
Familiarity with CRF's is assumed. Although this name sounds scary, all
the model is is a CRF but where an LSTM provides the features. This is
an advanced model though, far more complicated than any earlier model in
this tutorial. If you want to skip it, that is fine. To see if you're
ready, see if you can:

-  Write the recurrence for the viterbi variable at step i for tag k.
-  Modify the above recurrence to compute the forward variables instead.
-  Modify again the above recurrence to compute the forward variables in
   log-space (hint: log-sum-exp)

If you can do those three things, you should be able to understand the
code below. Recall that the CRF computes a conditional probability. Let
$y$ be a tag sequence and $x$ an input sequence of words.
Then we compute

\begin{align}P(y|x) = \frac{\exp{(\text{Score}(x, y)})}{\sum_{y'} \exp{(\text{Score}(x, y')})}\end{align}

Where the score is determined by defining some log potentials
$\log \psi_i(x,y)$ such that

\begin{align}\text{Score}(x,y) = \sum_i \log \psi_i(x,y)\end{align}

To make the partition function tractable, the potentials must look only
at local features.

In the Bi-LSTM CRF, we define two kinds of potentials: emission and
transition. The emission potential for the word at index $i$ comes
from the hidden state of the Bi-LSTM at timestep $i$. The
transition scores are stored in a $|T|x|T|$ matrix
$\textbf{P}$, where $T$ is the tag set. In my
implementation, $\textbf{P}_{j,k}$ is the score of transitioning
to tag $j$ from tag $k$. So:

\begin{align}\text{Score}(x,y) = \sum_i \log \psi_\text{EMIT}(y_i \rightarrow x_i) + \log \psi_\text{TRANS}(y_{i-1} \rightarrow y_i)\end{align}

\begin{align}= \sum_i h_i[y_i] + \textbf{P}_{y_i, y_{i-1}}\end{align}

where in this second expression, we think of the tags as being assigned
unique non-negative indices.

If the above discussion was too brief, you can check out
`this <http://www.cs.columbia.edu/%7Emcollins/crf.pdf>`__ write up from
Michael Collins on CRFs.

## 5.3 Implementation Notes

For this section, we will see a full, complicated example of a Bi-LSTM
Conditional Random Field for named-entity recognition. The LSTM tagger
above is typically sufficient for part-of-speech tagging, but a sequence
model like the CRF is really essential for strong performance on NER.
Familiarity with CRF's is assumed. Although this name sounds scary, all
the model is is a CRF but where an LSTM provides the features. This is
an advanced model though, far more complicated than any earlier model in
this tutorial. If you want to skip it, that is fine. To see if you're
ready, see if you can:

-  Write the recurrence for the viterbi variable at step i for tag k.
-  Modify the above recurrence to compute the forward variables instead.
-  Modify again the above recurrence to compute the forward variables in
   log-space (hint: log-sum-exp)

If you can do those three things, you should be able to understand the
code below. Recall that the CRF computes a conditional probability. Let
$y$ be a tag sequence and $x$ an input sequence of words.
Then we compute

\begin{align}P(y|x) = \frac{\exp{(\text{Score}(x, y)})}{\sum_{y'} \exp{(\text{Score}(x, y')})}\end{align}

Where the score is determined by defining some log potentials
$\log \psi_i(x,y)$ such that

\begin{align}\text{Score}(x,y) = \sum_i \log \psi_i(x,y)\end{align}

To make the partition function tractable, the potentials must look only
at local features.

In the Bi-LSTM CRF, we define two kinds of potentials: emission and
transition. The emission potential for the word at index $i$ comes
from the hidden state of the Bi-LSTM at timestep $i$. The
transition scores are stored in a $|T|x|T|$ matrix
$\textbf{P}$, where $T$ is the tag set. In my
implementation, $\textbf{P}_{j,k}$ is the score of transitioning
to tag $j$ from tag $k$. So:

\begin{align}\text{Score}(x,y) = \sum_i \log \psi_\text{EMIT}(y_i \rightarrow x_i) + \log \psi_\text{TRANS}(y_{i-1} \rightarrow y_i)\end{align}

\begin{align}= \sum_i h_i[y_i] + \textbf{P}_{y_i, y_{i-1}}\end{align}

where in this second expression, we think of the tags as being assigned
unique non-negative indices.

If the above discussion was too brief, you can check out
`this <http://www.cs.columbia.edu/%7Emcollins/crf.pdf>`__ write up from
Michael Collins on CRFs.

In [160]:
# TBD