In [1]:
%matplotlib inline

# Tensorflow

Follow the list of useful links:
- https://www.tensorflow.org/tutorials for this site read all in the 'BEGINNER' section
- https://sourcedexter.com/tensorflow-text-classification-python/ try to classify senetences with BoW

# Torch

If you choose the Torch way follow the next steps

Introduction to Torch's tensor library
======================================

All of deep learning is computations on tensors, which are
generalizations of a matrix that can be indexed in more than 2
dimensions. We will see exactly what this means in-depth later. First,
lets look what we can do with tensors.



In [1]:
# http://pytorch.org/
# from os.path import exists
# platform = '{}{}-{}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag())
# cuda_output = !ldconfig -p|grep cudart.so|sed -e 's/.*\.\([0-9]*\)\.\([0-9]*\)$/cu\1\2/'
# accelerator = cuda_output[0] if exists('/dev/nvidia0') else 'cpu'

# !pip install -q http://download.pytorch.org/whl/{accelerator}/torch-0.4.1-{platform}-linux_x86_64.whl torchvision
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7fc02cc5f690>

In [2]:
# torch.tensor(data) creates a torch.Tensor object with the given data.
V_data = [1., 2., 3.]
V = torch.tensor(V_data)
print(V)

# Creates a matrix
M_data = [[1., 2., 3.], [4., 5., 6]]
M = torch.tensor(M_data)
print(M)

# Create a 3D tensor of size 2x2x2.
T_data = [[[1., 2.], [3., 4.]],
          [[5., 6.], [7., 8.]]]
T = torch.tensor(T_data)
print(T)

tensor([1., 2., 3.])
tensor([[1., 2., 3.],
        [4., 5., 6.]])
tensor([[[1., 2.],
         [3., 4.]],

        [[5., 6.],
         [7., 8.]]])


Creating Tensors
------------------

Tensors can be created from Python lists with the torch.Tensor()
function.




What is a 3D tensor anyway? Think about it like this. If you have a
vector, indexing into the vector gives you a scalar. If you have a
matrix, indexing into the matrix gives you a vector. If you have a 3D
tensor, then indexing into the tensor gives you a matrix!

A note on terminology:
when I say "tensor" in this tutorial, it refers
to any torch.Tensor object. Matrices and vectors are special cases of
torch.Tensors, where their dimension is 1 and 2 respectively. When I am
talking about 3D tensors, I will explicitly use the term "3D tensor".




In [3]:
# Index into V and get a scalar (0 dimensional tensor)
print(V[0])
# Get a Python number from it
print(V[0].item())

# Index into M and get a vector
print(M[0])

# Index into T and get a matrix
print(T[0])

tensor(1.)
1.0
tensor([1., 2., 3.])
tensor([[1., 2.],
        [3., 4.]])


You can also create tensors of other datatypes. The default, as you can
see, is Float. To create a tensor of integer types, try
torch.LongTensor(). Check the documentation for more data types, but
Float and Long will be the most common.




You can create a tensor with random data and the supplied dimensionality
with torch.randn()




In [4]:
x = torch.randn((3, 4, 5))
print(x)

tensor([[[-1.5256, -0.7502, -0.6540, -1.6095, -0.1002],
         [-0.6092, -0.9798, -1.6091, -0.7121,  0.3037],
         [-0.7773, -0.2515, -0.2223,  1.6871,  0.2284],
         [ 0.4676, -0.6970, -1.1608,  0.6995,  0.1991]],

        [[ 0.8657,  0.2444, -0.6629,  0.8073,  1.1017],
         [-0.1759, -2.2456, -1.4465,  0.0612, -0.6177],
         [-0.7981, -0.1316,  1.8793, -0.0721,  0.1578],
         [-0.7735,  0.1991,  0.0457,  0.1530, -0.4757]],

        [[-0.1110,  0.2927, -0.1578, -0.0288,  0.4533],
         [ 1.1422,  0.2486, -1.7754, -0.0255, -1.0233],
         [-0.5962, -1.0055,  0.4285,  1.4761, -1.7869],
         [ 1.6103, -0.7040, -0.1853, -0.9962, -0.8313]]])


Operations with Tensors
--------------------------

You can operate on tensors in the ways you would expect.



In [5]:
x = torch.tensor([1., 2., 3.])
y = torch.tensor([4., 5., 6.])
z = x + y
print(z)

tensor([5., 7., 9.])


See `the documentation <http://pytorch.org/docs/torch.html>`__ for a
complete list of the massive number of operations available to you. They
expand beyond just mathematical operations.

One helpful operation that we will make use of later is concatenation.




In [6]:
# By default, it concatenates along the first axis (concatenates rows)
x_1 = torch.randn(2, 5)
y_1 = torch.randn(3, 5)
z_1 = torch.cat([x_1, y_1])
print(z_1)

# Concatenate columns:
x_2 = torch.randn(2, 3)
y_2 = torch.randn(2, 5)
# second arg specifies which axis to concat along
z_2 = torch.cat([x_2, y_2], 1)
print(z_2)

# If your tensors are not compatible, torch will complain.  Uncomment to see the error
# torch.cat([x_1, x_2])

tensor([[-0.8029,  0.2366,  0.2857,  0.6898, -0.6331],
        [ 0.8795, -0.6842,  0.4533,  0.2912, -0.8317],
        [-0.5525,  0.6355, -0.3968, -0.6571, -1.6428],
        [ 0.9803, -0.0421, -0.8206,  0.3133, -1.1352],
        [ 0.3773, -0.2824, -2.5667, -1.4303,  0.5009]])
tensor([[ 0.5438, -0.4057,  1.1341, -0.1473,  0.6272,  1.0935,  0.0939,  1.2381],
        [-1.1115,  0.3501, -0.7703, -1.3459,  0.5119, -0.6933, -0.1668, -0.9999]])


Reshaping Tensors
--------------------

Use the .view() method to reshape a tensor. This method receives heavy
use, because many neural network components expect their inputs to have
a certain shape. Often you will need to reshape before passing your data
to the component.




In [7]:
x = torch.randn(2, 3, 4)
print(x)
print(x.view(2, 12))  # Reshape to 2 rows, 12 columns
# Same as above.  If one of the dimensions is -1, its size can be inferred
print(x.view(2, -1))

tensor([[[ 0.4175, -0.2127, -0.8400, -0.4200],
         [-0.6240, -0.9773,  0.8748,  0.9873],
         [-0.0594, -2.4919,  0.2423,  0.2883]],

        [[-0.1095,  0.3126,  1.5038,  0.5038],
         [ 0.6223, -0.4481, -0.2856,  0.3880],
         [-1.1435, -0.6512, -0.1032,  0.6937]]])
tensor([[ 0.4175, -0.2127, -0.8400, -0.4200, -0.6240, -0.9773,  0.8748,  0.9873,
         -0.0594, -2.4919,  0.2423,  0.2883],
        [-0.1095,  0.3126,  1.5038,  0.5038,  0.6223, -0.4481, -0.2856,  0.3880,
         -1.1435, -0.6512, -0.1032,  0.6937]])
tensor([[ 0.4175, -0.2127, -0.8400, -0.4200, -0.6240, -0.9773,  0.8748,  0.9873,
         -0.0594, -2.4919,  0.2423,  0.2883],
        [-0.1095,  0.3126,  1.5038,  0.5038,  0.6223, -0.4481, -0.2856,  0.3880,
         -1.1435, -0.6512, -0.1032,  0.6937]])


Computation Graphs and Automatic Differentiation
================================================

The concept of a computation graph is essential to efficient deep
learning programming, because it allows you to not have to write the
back propagation gradients yourself. A computation graph is simply a
specification of how your data is combined to give you the output. Since
the graph totally specifies what parameters were involved with which
operations, it contains enough information to compute derivatives. This
probably sounds vague, so let's see what is going on using the
fundamental flag ``requires_grad``.

First, think from a programmers perspective. What is stored in the
torch.Tensor objects we were creating above? Obviously the data and the
shape, and maybe a few other things. But when we added two tensors
together, we got an output tensor. All this output tensor knows is its
data and shape. It has no idea that it was the sum of two other tensors
(it could have been read in from a file, it could be the result of some
other operation, etc.)

If ``requires_grad=True``, the Tensor object keeps track of how it was
created. Lets see it in action.




In [8]:
# Tensor factory methods have a ``requires_grad`` flag
x = torch.tensor([1., 2., 3], requires_grad=True)

# With requires_grad=True, you can still do all the operations you previously
# could
y = torch.tensor([4., 5., 6], requires_grad=True)
z = x + y
print(z)

# BUT z knows something extra.
print(z.grad_fn)

tensor([5., 7., 9.], grad_fn=<AddBackward0>)
<AddBackward0 object at 0x7fc02e64fb10>


So Tensors know what created them. z knows that it wasn't read in from
a file, it wasn't the result of a multiplication or exponential or
whatever. And if you keep following z.grad_fn, you will find yourself at
x and y.

But how does that help us compute a gradient?




In [9]:
# Lets sum up all the entries in z
s = z.sum()
print(s)
print(s.grad_fn)

tensor(21., grad_fn=<SumBackward0>)
<SumBackward0 object at 0x7fc02e631410>


So now, what is the derivative of this sum with respect to the first
component of x? In math, we want

\begin{align}\frac{\partial s}{\partial x_0}\end{align}



Well, s knows that it was created as a sum of the tensor z. z knows
that it was the sum x + y. So

\begin{align}s = \overbrace{x_0 + y_0}^\text{$z_0$} + \overbrace{x_1 + y_1}^\text{$z_1$} + \overbrace{x_2 + y_2}^\text{$z_2$}\end{align}

And so s contains enough information to determine that the derivative
we want is 1:

\begin{align}\frac{\partial s}{\partial x_0} = \overbrace{\frac{\partial x_0}{\partial x_0} + 0}^\text{$z_0$} + \overbrace{0 + 0}^\text{$z_1$} + \overbrace{0 + 0}^\text{$z_2$} = 1\end{align}


Of course this glosses over the challenge of how to actually compute
that derivative. The point here is that s is carrying along enough
information that it is possible to compute it. In reality, the
developers of Pytorch program the sum() and + operations to know how to
compute their gradients, and run the back propagation algorithm. An
in-depth discussion of that algorithm is beyond the scope of this
tutorial.




Lets have Pytorch compute the gradient, and see that we were right:
(note if you run this block multiple times, the gradient will increment.
That is because Pytorch *accumulates* the gradient into the .grad
property, since for many models this is very convenient.)




In [10]:
# calling .backward() on any variable will run backprop, starting from it.
s.backward()
print(x.grad)

tensor([1., 1., 1.])


Understanding what is going on in the block below is crucial for being a
successful programmer in deep learning.




In [11]:
x = torch.randn(2, 2)
y = torch.randn(2, 2)
# By default, user created Tensors have ``requires_grad=False``
print(x.requires_grad, y.requires_grad)
z = x + y
# So you can't backprop through z
print(z.grad_fn)
print(z.requires_grad)
print()

# ``.requires_grad_( ... )`` changes an existing Tensor's ``requires_grad``
# flag in-place. The input flag defaults to ``True`` if not given.
x = x.requires_grad_()
y = y.requires_grad_()
# z contains enough information to compute gradients, as we saw above
z = x + y
print(z.grad_fn)
# If any input to an operation has ``requires_grad=True``, so will the output
print(z.requires_grad)
print()

# Now z has the computation history that relates itself to x and y
# Can we just take its values, and **detach** it from its history?
new_z = z.detach()

# ... does new_z have information to backprop to x and y?
# NO!
print(new_z.grad_fn)
print(new_z.requires_grad)
# And how could it? ``z.detach()`` returns a tensor that shares the same storage
# as ``z``, but with the computation history forgotten. It doesn't know anything
# about how it was computed.
# In essence, we have broken the Tensor away from its past history

False False
None
False

<AddBackward0 object at 0x7fc02e667710>
True

None
False


You can also stop autograd from tracking history on Tensors
with ``.requires_grad``=True by wrapping the code block in
``with torch.no_grad():``



In [12]:
print(x.requires_grad)
print((x ** 2).requires_grad)

with torch.no_grad():
	print((x ** 2).requires_grad)

True
True
False


Deep Learning Building Blocks: Affine maps, non-linearities and objectives
==========================================================================

Deep learning consists of composing linearities with non-linearities in
clever ways. The introduction of non-linearities allows for powerful
models. In this section, we will play with these core components, make
up an objective function, and see how the model is trained.


Affine Maps
-------------

One of the core workhorses of deep learning is the affine map, which is
a function $f(x)$ where

\begin{align}f(x) = Ax + b\end{align}

for a matrix $A$ and vectors $x, b$. The parameters to be learned here are $A$ and $b$. Often, $b$ is refered to
as the *bias* term.


PyTorch and most other deep learning frameworks do things a little
differently than traditional linear algebra. It maps the rows of the
input instead of the columns. That is, the $i$'th row of the
output below is the mapping of the $i$'th row of the input under
$A$, plus the bias term. Look at the example below.




In [13]:
lin = nn.Linear(5, 3)  # maps from R^5 to R^3, parameters A, b
# data is 2x5.  A maps from 5 to 3... can we map "data" under A?
data = torch.randn(2, 5)
print(lin(data))  # yes

tensor([[-0.6831,  0.3639, -0.7709],
        [ 0.6161,  1.2096, -0.3063]], grad_fn=<AddmmBackward>)


Non-Linearities
--

First, note the following fact, which will explain why we need
non-linearities in the first place. Suppose we have two affine maps
$f(x) = Ax + b$ and $g(x) = Cx + d$. What is
$f(g(x))$?

\begin{align}f(g(x)) = A(Cx + d) + b = ACx + (Ad + b)\end{align}

$AC$ is a matrix and $Ad + b$ is a vector, so we see that
composing affine maps gives you an affine map.

From this, you can see that if you wanted your neural network to be long
chains of affine compositions, that this adds no new power to your model
than just doing a single affine map.

If we introduce non-linearities in between the affine layers, this is no
longer the case, and we can build much more powerful models.

There are a few core non-linearities.
$\tanh(x), \sigma(x), \text{ReLU}(x)$ are the most common. You are
probably wondering: "why these functions? I can think of plenty of other
non-linearities." The reason for this is that they have gradients that
are easy to compute, and computing gradients is essential for learning.
For example

\begin{align}\frac{d\sigma}{dx} = \sigma(x)(1 - \sigma(x))\end{align}

A quick note: although you may have learned some neural networks in your
intro to AI class where $\sigma(x)$ was the default non-linearity,
typically people shy away from it in practice. This is because the
gradient *vanishes* very quickly as the absolute value of the argument
grows. Small gradients means it is hard to learn. Most people default to
tanh or ReLU.




In [14]:
# In pytorch, most non-linearities are in torch.functional (we have it imported as F)
# Note that non-linearites typically don't have parameters like affine maps do.
# That is, they don't have weights that are updated during training.
data = torch.randn(2, 2)
print(data)
print(f'relu: {F.relu(data)}')
print(f'tanh: {torch.tanh(data)}')

tensor([[-0.1024, -0.8491],
        [ 0.1112,  0.1618]])
relu: tensor([[0.0000, 0.0000],
        [0.1112, 0.1618]])
tanh: tensor([[-0.1020, -0.6906],
        [ 0.1108,  0.1604]])


Softmax and Probabilities
---

The function $\text{Softmax}(x)$ is also just a non-linearity, but
it is special in that it usually is the last operation done in a
network. This is because it takes in a vector of real numbers and
returns a probability distribution. Its definition is as follows. Let
$x$ be a vector of real numbers (positive, negative, whatever,
there are no constraints). Then the i'th component of
$\text{Softmax}(x)$ is

\begin{align}\frac{\exp(x_i)}{\sum_j \exp(x_j)}\end{align}

It should be clear that the output is a probability distribution: each
element is non-negative and the sum over all components is 1.

You could also think of it as just applying an element-wise
exponentiation operator to the input to make everything non-negative and
then dividing by the normalization constant.


In [15]:
# Softmax is also in torch.nn.functional
data = torch.randn(5)
print(data)
print(F.softmax(data, dim=0))
print(F.softmax(data, dim=0).sum())  # Sums to 1 because it is a distribution!
print(F.log_softmax(data, dim=0))  # theres also log_softmax

tensor([-1.4105, -0.3404, -3.0121,  0.5710,  1.4330])
tensor([0.0350, 0.1021, 0.0071, 0.2541, 0.6017])
tensor(1.)
tensor([-3.3515, -2.2815, -4.9531, -1.3700, -0.5080])


Objective Functions (Loss Functions/Cost Functions)
--

The objective function is the function that your network is being
trained to minimize (in which case it is often called a *loss function*
or *cost function*). This proceeds by first choosing a training
instance, running it through your neural network, and then computing the
loss of the output. The parameters of the model are then updated by
taking the derivative of the loss function. Intuitively, if your model
is completely confident in its answer, and its answer is wrong, your
loss will be high. If it is very confident in its answer, and its answer
is correct, the loss will be low.

The idea behind minimizing the loss function on your training examples
is that your network will hopefully generalize well and have small loss
on unseen examples in your dev set, test set, or in production. An
example loss function is the *negative log likelihood loss*, which is a
very common objective for multi-class classification. For supervised
multi-class classification, this means training the network to minimize
the negative log probability of the correct output (or equivalently,
maximize the log probability of the correct output).




Optimization and Training
=========================

So what we can compute a loss function for an instance? What do we do
with that? We saw earlier that Tensors know how to compute gradients
with respect to the things that were used to compute it. Well,
since our loss is an Tensor, we can compute gradients with
respect to all of the parameters used to compute it! Then we can perform
standard gradient updates. Let $\theta$ be our parameters,
$L(\theta)$ the loss function, and $\eta$ a positive
learning rate. Then:

\begin{align}\theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta L(\theta)\end{align}

There are a huge collection of algorithms and active research in
attempting to do something more than just this vanilla gradient update.
Many attempt to vary the learning rate based on what is happening at
train time. Torch provides
many in the torch.optim package, and they are all completely
transparent. Using the simplest gradient update is the same as the more
complicated algorithms. Trying different update algorithms and different
parameters for the update algorithms (like different initial learning
rates) is important in optimizing your network's performance. Often,
just replacing vanilla SGD with an optimizer like Adam or RMSProp will
boost performance noticably.




Creating Network Components in PyTorch
======================================

Before we move on to our focus on NLP, lets do an annotated example of
building a network in PyTorch using only affine maps and
non-linearities. We will also see how to compute a loss function, using
PyTorch's built in negative log likelihood, and update parameters by
backpropagation.

All network components should inherit from nn.Module and override the
forward() method. That is about it, as far as the boilerplate is
concerned. Inheriting from nn.Module provides functionality to your
component. For example, it makes it keep track of its trainable
parameters, you can swap it between CPU and GPU with the ``.to(device)``
method, where device can be a CPU device ``torch.device("cpu")`` or CUDA
device ``torch.device("cuda:0")``.

Let's write an annotated example of a network that takes in a sparse
bag-of-words representation and outputs a probability distribution over
two labels: "English" and "Spanish". This model is just logistic
regression.




Example: Logistic Regression Bag-of-Words classifier
--

Our model will map a sparse BoW representation to log probabilities over
labels. We assign each word in the vocab an index. For example, say our
entire vocab is two words "hello" and "world", with indices 0 and 1
respectively. The BoW vector for the sentence "hello hello hello hello"
is

\begin{align}\left[ 4, 0 \right]\end{align}

For "hello world world hello", it is

\begin{align}\left[ 2, 2 \right]\end{align}

etc. In general, it is

\begin{align}\left[ \text{Count}(\text{hello}), \text{Count}(\text{world}) \right]\end{align}

Denote this BOW vector as $x$. The output of our network is:

\begin{align}\log \text{Softmax}(Ax + b)\end{align}

That is, we pass the input through an affine map and then do log
softmax.




In [16]:
data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
        ("Give it to me".split(), "ENGLISH"),
        ("No creo que sea una buena idea".split(), "SPANISH"),
        ("No it is not a good idea to get lost at sea".split(), "ENGLISH")]

test_data = [("Yo creo que si".split(), "SPANISH"),
             ("it is lost on me".split(), "ENGLISH")]

# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data + test_data:
    for word in sent:
        if word not in word_to_ix:
            word_to_ix[word] = len(word_to_ix)
print(word_to_ix)

VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2


class BoWClassifier(nn.Module):  # inheriting from nn.Module!

    def __init__(self, num_labels, vocab_size):
        # calls the init function of nn.Module.  Dont get confused by syntax,
        # just always do it in an nn.Module
        super(BoWClassifier, self).__init__()

        # Define the parameters that you will need.  In this case, we need A and b,
        # the parameters of the affine mapping.
        # Torch defines nn.Linear(), which provides the affine map.
        # Make sure you understand why the input dimension is vocab_size
        # and the output is num_labels!
        self.linear = nn.Linear(vocab_size, num_labels)

        # NOTE! The non-linearity log softmax does not have parameters! So we don't need
        # to worry about that here

    def forward(self, bow_vec):
        # Pass the input through the linear layer,
        # then pass that through log_softmax.
        # Many non-linearities and other functions are in torch.nn.functional
        return F.log_softmax(self.linear(bow_vec), dim=1)


def make_bow_vector(sentence, word_to_ix):
    vec = torch.zeros(len(word_to_ix))
    for word in sentence:
        vec[word_to_ix[word]] += 1
    return vec.view(1, -1)


def make_target(label, label_to_ix):
    return torch.LongTensor([label_to_ix[label]])


model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)

# the model knows its parameters.  The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the PyTorch devs, your module
# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters
for param in model.parameters():
    print(f'param: {param}')

# To run the model, pass in a BoW vector
# Here we don't need to train, so the code is wrapped in torch.no_grad()
with torch.no_grad():
    sample = test_data[0]
    bow_vector = make_bow_vector(sample[0], word_to_ix)
    # As we have inherited from nn.Module, we can just use model(vector) instead of model.forward(vector)
    log_probs = model(bow_vector)
    print(f'logprobs: {log_probs}')

{'me': 0, 'gusta': 1, 'comer': 2, 'en': 3, 'la': 4, 'cafeteria': 5, 'Give': 6, 'it': 7, 'to': 8, 'No': 9, 'creo': 10, 'que': 11, 'sea': 12, 'una': 13, 'buena': 14, 'idea': 15, 'is': 16, 'not': 17, 'a': 18, 'good': 19, 'get': 20, 'lost': 21, 'at': 22, 'Yo': 23, 'si': 24, 'on': 25}
param: Parameter containing:
tensor([[ 0.0555,  0.0597,  0.0466,  0.1627, -0.0815, -0.0828, -0.1699, -0.0080,
         -0.0929,  0.0079, -0.0402,  0.0651,  0.1697,  0.0579, -0.0632, -0.0962,
         -0.1710,  0.1650, -0.0372,  0.0396,  0.0073, -0.1250,  0.1104,  0.1099,
          0.0099, -0.1115],
        [-0.0833,  0.0027, -0.1120, -0.1094, -0.0293, -0.0565,  0.0481, -0.0515,
         -0.0260, -0.0749, -0.1792,  0.1710,  0.0374,  0.1754, -0.0316, -0.0493,
         -0.1844, -0.0744,  0.1286, -0.1921, -0.0686,  0.1195,  0.1130,  0.0724,
         -0.0388, -0.0148]], requires_grad=True)
param: Parameter containing:
tensor([-0.0372, -0.0723], requires_grad=True)
logprobs: tensor([[-0.6190, -0.7733]])


Which of the above values corresponds to the log probability of ENGLISH,
and which to SPANISH? We never defined it, but we need to if we want to
train the thing.

In [17]:
label_to_ix = {"SPANISH": 0, "ENGLISH": 1}

So lets train! To do this, we pass instances through to get log
probabilities, compute a loss function, compute the gradient of the loss
function, and then update the parameters with a gradient step. Loss
functions are provided by Torch in the nn package. nn.NLLLoss() is the
negative log likelihood loss we want. It also defines optimization
functions in torch.optim. Here, we will just use SGD.

Note that the *input* to NLLLoss is a vector of log probabilities, and a
target label. It doesn't compute the log probabilities for us. This is
why the last layer of our network is log softmax. The loss function
nn.CrossEntropyLoss() is the same as NLLLoss(), except it does the log
softmax for you.

In [18]:
# Run on test data before we train, just to see a before-and-after
with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Print the matrix column corresponding to "creo"
print(next(model.parameters())[:, word_to_ix["creo"]])

loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
    for instance, label in data:
        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance
        model.zero_grad()

        # Step 2. Make our BOW vector and also we must wrap the target in a
        # Tensor as an integer. For example, if the target is SPANISH, then
        # we wrap the integer 0. The loss function then knows that the 0th
        # element of the log probabilities is the log probability
        # corresponding to SPANISH
        bow_vec = make_bow_vector(instance, word_to_ix)
        target = make_target(label, label_to_ix)

        # Step 3. Run our forward pass.
        log_probs = model(bow_vec)

        # Step 4. Compute the loss, gradients, and update the parameters by
        # calling optimizer.step()
        loss = loss_function(log_probs, target)
        loss.backward()
        optimizer.step()

with torch.no_grad():
    for instance, label in test_data:
        bow_vec = make_bow_vector(instance, word_to_ix)
        log_probs = model(bow_vec)
        print(log_probs)

# Index corresponding to Spanish goes up, English goes down!
print(next(model.parameters())[:, word_to_ix["creo"]])

tensor([[-0.6190, -0.7733]])
tensor([[-0.7499, -0.6395]])
tensor([-0.0402, -0.1792], grad_fn=<SelectBackward>)
tensor([[-0.1241, -2.1481]])
tensor([[-2.9001, -0.0566]])
tensor([ 0.3966, -0.6160], grad_fn=<SelectBackward>)


We got the right answer! You can see that the log probability for
Spanish is much higher in the first example, and the log probability for
English is much higher in the second for the test data, as it should be.

Now you see how to make a PyTorch component, pass some data through it
and do gradient updates. We are ready to dig deeper into what deep NLP
has to offer.

# Excercise 1. (2pt)
Read above tutorial and do the following:

1) Write down the formulas for tanh and ReLU functions, and for their derivatives.

2) Base on above example to train a neural network that classifies sentences into classes {'hp', 'lotr'}. Write code that evaluates accuracy on train and test set after every epoch. Obtain **75%** accuracy. 

hints:
- To prepare train/test datasets you can base on this code:

```
lotr_data = [(normalize(word_tokenize(el)), 'lotr') for el in sent_tokenize(lotr)]
data = lotr_data + hp_data
test_data = test_lotr + test_hp
```
- try to use preprocessing to make the data more suitable for training. (lab-2)

- optionally, you can modify the neural network architecture a little, for example (playing with preprocessing only should give 75%):
  - you can modify the number of training epochs,
  - you can modify the optimizer (i.e. Adam optimizer) and learning rate,
  - you can use batches instead of learning one example at a time,
  - you can add hidden layer(s), activation functions (like relu) etc. but be careful with overfitting! dropout or other regularization might help to mitigate overfitting,
  - eventually, you can play with the loss function, parameter initialization
 

Due to small dataset size (for a neural network) and the specificity of the task, it is harder to obtain higher accuracy than Naive Bayes, and some of above tricks might not work well. Don't worry - soon we will learn what neural nets are capable of!

Use the data located in https://drive.google.com/open?id=1xLS_gxqGidnFk24as498_SWNV4xF79JI


In [None]:
#  This time we will load the texts from google drive, to save some space in the notebook :)

from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

In [None]:
import os

# Download the folder to your drive and paste the path to your folder below
path_to_folder = '/content/gdrive/My Drive/nlp/lab-5/' # this should be path to the folder in your drive

with open(os.path.join(path_to_folder, 'hp.txt'), 'r') as f:
  hp = f.read().replace('\n', ' ')
with open(os.path.join(path_to_folder, 'lotr.txt'), 'r') as f:
  lotr = f.read().replace('\n', ' ')
with open(os.path.join(path_to_folder, 'hp_test.txt'), 'r') as f:
  test_hp = f.read().replace('\n', ' ')
with open(os.path.join(path_to_folder, 'lotr_test.txt'), 'r') as f:
  test_lotr = f.read().replace('\n', ' ')
  