# What is a deep learning framework?
- Good tools reduce errors, speed development, and increase
runtime performance.

# Introduce Tensor
*Tensors are an abstract form of vectors and matrices.*

Up to this point, we’ve been working exclusively with vectors and matrices as the basic data
structures for deep learning. Recall that a matrix is a list of vectors, and a vector is a list
of scalars (single numbers). A tensor is the abstract version of this form of nested lists of
numbers. A vector is a one-dimensional tensor. A matrix is a two-dimensional tensor, and
higher dimensions are referred to as n-dimensional tensors. Thus, the beginning of a new
deep learning framework is the construction of this basic type, which we’ll call `Tensor` :

In [None]:
import numpy as np

class Tensor (object):
    def __init__(self, data):
        self.data = np.array(data)
    def __add__(self, other):
        return Tensor(self.data + other.data)
    def __repr__(self):
        return str(self.data.__repr__())
    def __str__(self):
        return str(self.data.__str__())

x = Tensor([1,2,3,4,5])
print(x)

# Introduction to automatic gradient computation (autograd)
*Previously, you performed backpropagation by hand.*

Recall that this is done by moving backward
through the neural network: first compute the gradient at the output of the network, then
use that result to compute the derivative at the next-to-last component, and so on until all
weights in the architecture have correct gradients. This logic for computing gradients can
also be added to the tensor object. Let me show you what I mean.


In [None]:
class Tensor (object):
    def __init__(self, data, creators=None, creation_op=None):
        self.data = np.array(data)
        self.creation_op = creation_op
        self.creators = creators
        self.grad = None

    def backward(self, grad):
        self.grad = grad
        if(self.creation_op == "add"):
            self.creators[0].backward(grad)
            self.creators[1].backward(grad)
    
    def __add__(self, other):
        return Tensor(self.data + other.data,
                     creators=[self, other],
                     creation_op="add")

    def __repr__(self):
        return str(self.data.__repr__())
        
    def __str__(self):
        return str(self.data.__str__())

x = Tensor([1,2,3,4,5])
y = Tensor([2,2,2,2,2])

z = x + y
z.backward(Tensor(np.array([1,1,1,1,1])))

This method introduces two new concepts:
- First, each tensor gets two new attributes.
creators is a list containing any tensors used in the creation of the current tensor (which
defaults to None ). 
- Thus, when the two tensors x and y are added together, z has two creators , x and y . creation_op is a related feature that stores the instructions creators
used in the creation process. 
- Thus, performing z = x + y creates a computation graph with
three nodes ( x , y , and z ) and two edges ( z -> x and z -> y ). Each edge is labeled by the
creation_op add . This graph allows you to recursively backpropagate gradients.

The second new concept introduced in this version of Tensor is the ability to use this graph
to compute gradients. When you call z .backward() , it sends the correct gradient for x
and y given the function that was applied to create z ( add ).

# A quick checkpoint
*Everything in Tensor is another form of lessons already learned.*

You had a list of layers (dictionaries) and hand-coded the correct order of forward and
backpropagation operations. Now you’re building a nice interface so you don’t have to
write as much code.

- In particular, this notion of a graph that gets
built during forward propagation is called a *dynamic computation graph* because it’s built
on the fly during forward prop. 
- In general, dynamic computation graphs are easier to write/experiment with, and static
computation graphs have faster runtimes because of some fancy logic under the hood.
- But note that dynamic and static frameworks have lately been moving toward the middle:
    - allowing dynamic graphs to compile to static ones (for faster runtimes) or
    - allowing static graphs to be built dynamically (for easier experimentation)
- The primary difference is whether forward propagation is happening during graph construction or after the graph is already defined

# Tensors that are used multiple times
*The basic autograd has a rather pesky bug. Let’s squish it!*

The current version of Tensor supports backpropagating into a variable only once. But
sometimes, during forward propagation, you’ll use the same tensor multiple times (the
weights of a neural network), and thus multiple parts of the graph will backpropagate
gradients into the same tensor.

<img src="../../images/tensor_multiuse.png">

The code will currently compute the incorrect gradient when backpropagating into a variable that was used multiple times (is the parent of multiple
children).

The current implementation of Tensor merely overwrites each derivative with the
previous. First, d applies its gradient, and then it gets overwritten with the gradient from e .
We need to change the way gradients are written.

# Upgrading autograd to support multiuse tensors
*Add one new function, and update three old ones.*

This update to the Tensor object adds two new features. First, gradients can be accumulated so
that when a variable is used more than once, it receives gradients from all children

- Create a self.children counter that counts the number of gradients
received from each child during backpropagation. This way, you also prevent a variable from
accidentally backpropagating from the same child twice (which throws an exception).
- The second added feature is a new function with the rather verbose name `all_children_
grads_accounted_for()`. 
- The purpose of this function is to compute whether a tensor hasreceived gradients from all of its children in the graph. Normally, whenever .backward() is called on an intermediate variable in a graph, it immediately calls .backward() on its parents.
- But because some variables receive their gradient value from multiple parents, each variable
needs to wait to call .backward() on its parents until it has the final gradient locally.


## Adding support for negation
*Let’s modify the support for addition to support negation.*

## Adding support for additional functions
*Subtraction, multiplication, sum, expand, transpose, and matrix multiplication*

In [None]:
class Tensor (object):
    def __init__(self, data, 
                autograd=False,
                creators=None, 
                creation_op=None,
                id=None):
        self.data = np.array(data)
        self.creation_op = creation_op
        self.creators = creators
        self.grad = None
        self.autograd = autograd
        self.children = {}
        if(id is None):
            id = np.random.randint(0,1000000)
        self.id = id

        if(creators is not None):
            for c in creators:
                # Keeps track of how many children a tensor has
                if(self.id not in c.children):
                    c.children[self.id] = 1
                else:
                    c.children[self.id] += 1
                    
    #Check whether a tensor has received the correct num of gradients from each child
    def all_children_grads_accounted_for(self):
        for id, cnt in self.children.items():
            if(cnt != 0):
                return False
        return True

    def backward(self, grad=None, grad_origin=None):
        if(self.autograd):
            if(grad_origin is not None):
                #  Make sure u can backprop or if u're waiting for a gradient
                # in which case decrement the counter
                if(self.children[grad_origin.id] == 0):
                    raise Exception("cannot backprop more than once")
                else:
                    self.children[grad_origin.id] -= 1
            # Accumulates gradients from several children        
            if(self.grad is None):
                self.grad = grad
            else:
                self.grad += grad
            
            if (self.creators is not None and 
                (self.all_children_grads_accounted_for() or
                grad_origin is None)):
                if(self.creation_op == "add"):
                    self.creators[0].backward(self.grad, self)
                    self.creators[1].backward(self.grad, self)
                if(self.creation_op == "neg"):
                    self.creators[0].backward(self.grad.__neg__())
                if(self.creation_op == "sub"):
                    new = Tensor(self.grad.data)
                    self.creators[0].backward(new, self)
                    new = Tensor(self.grad.__neg__().data)
                    self.creators[1].backward(self)
                if(self.creation_op == "mul"):
                    new = self.grad * self.creators[1]
                    self.creators[0].backward(new , self)
                    new = self.grad * self.creators[0]
                    self.creators[1].backward(new, self)
                if(self.creation_op == "mm"):
                    #Usually an activation
                    act = self.creators[0] 
                    # weight matrix
                    weights = self.creators[1]
                    new = self.grad.mm(weights.transpose())
                    act.backward(new)
                    new = self.grad.transpose().mm(act).transpose()
                    weights.backward(new)
                if(self.creation_op == "transpose"):
                    self.creators[0].backward(self.grad.transpose())
                if("sum" in self.creation_op):
                    dim = int(self.creation_op.split("_")[1])
                    ds = self.creators[0].data.shape[dim]
                    self.creators[0].backward(self.grad.expand(dim,ds))
                if("expand" in self.creation_op):
                    dim = int(self.creation_op.split("_")[1])
                    self.creators[0].backward(self.grad.sum(dim))

    def __add__(self, other):
        if(self.autograd and other.autograd):
            return Tensor(self.data + other.data,
                        autograd=True,
                        creators=[self, other],
                        creation_op="add")
        return Tensor(self.data + other.data)

    def __neg__(self):
        if(self.autograd):
            return Tensor(self.data * -1,
                        autograd=True,
                        creators=[self],
                        creation_op="neg")
        return Tensor(self.data * -1)

    def __sub__(self, other):
        if(self.autograd and other.autograd):
            return Tensor(self.data - other.data,
                            autograd=True,
                            creators=[self,other],
                            creation_op="sub")
        
        return Tensor(self.data - other.data)

    def __mul__(self, other):
        if(self.autograd and other.autograd):
            return Tensor(self.data * other.data,
                            autograd=True,
                            creators=[self,other],
                            creation_op="mul")
        return Tensor(self.data * other.data)

    def sum(self, dim):
        if(self.autograd):
            return Tensor(self.data.sum(dim),
                            autograd=True,
                            creators=[self],
                            creation_op="sum_"+str(dim))
        return Tensor(self.data.sum(dim))
    
    def expand(self, dim,copies):
        trans_cmd = list(range(0,len(self.data.shape)))
        trans_cmd.insert(dim,len(self.data.shape))
        new_shape = list(self.data.shape) + [copies]
        new_data = self.data.repeat(copies).reshape(new_shape)
        new_data = new_data.transpose(trans_cmd)
        if(self.autograd):
            return Tensor(new_data,
                            autograd=True,
                            creators=[self],
                            creation_op="expand_"+str(dim))
        return Tensor(new_data)


    def transpose(self):
        if(self.autograd):
            return Tensor(self.data.transpose(),
                            autograd=True,
                            creators=[self],
                            creation_op="transpose")
        return Tensor(self.data.transpose())

    def mm(self, x):
        if(self.autograd):
            return Tensor(self.data.dot(x.data),
                            autograd=True,
                            creators=[self,x],
                            creation_op="mm")
        return Tensor(self.data.dot(x.data))


    def __repr__(self):
        return str(self.data.__repr__())
        
    def __str__(self):
        return str(self.data.__str__())


In [None]:
x = Tensor(np.array([[1,2,3], [4,5,6]]))
x.sum(0)

x.expand(dim=0, copies=4)

# Using autograd to train a neural network
*You no longer have to write backpropagation logic!*

In [None]:
import numpy
np.random.seed(0)
data = Tensor(np.array([[0,0],[0,1],[1,0],[1,1]]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True)

w = list()
w.append(Tensor(np.random.rand(2,3), autograd=True))
w.append(Tensor(np.random.rand(3,1), autograd=True))

for i in range(10):
    pred = data.mm(w[0]).mm(w[1])

    loss = ((pred - target)*(pred - target)).sum(0)
    loss.backward(Tensor(np.ones_like(loss.data)))

    for w_ in w:
        w_.data -= w_.grad.data * 0.1
        w_.grad.data *= 0
        print(loss)

# Adding automatic optimization
*Let’s make a stochastic gradient descent optimizer.*

In [None]:
class SGD(object):
    def __init__(self, parameters, alpha=0.1):
        self.parameters = parameters
        self.alpha = alpha
    def zero(self):
        for p in self.parameters:
            p.grad.data *= 0
    def step(self, zero=True):
        for p in self.parameters:
            p.data -= p.grad.data * self.alpha
            if(zero):
                p.grad.data *= 0

In [None]:
np.random.seed(0)

data = Tensor(np.array([[0,0],[0,1],[1,0],[1,1]]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True)

w = list()
w.append(Tensor(np.random.rand(2,3), autograd=True))
w.append(Tensor(np.random.rand(3,1), autograd=True))

optim = SGD(parameters=w, alpha=0.1)

for i in range(10):
    pred = data.mm(w[0]).mm(w[1])
    loss = ((pred - target)*(pred - target)).sum(0)
    loss.backward(Tensor(np.ones_like(loss.data)))
    optim.step()

# Adding support for layer types
*You may be familiar with layer types in Keras or PyTorch.*

- The weights are organized into a class (and I added bias
weights because this is a true linear layer). You can initialize the layer all together, such
that both the weights and bias are initialized with the correct sizes, and the correct forward
propagation logic is always employed.

- Also notice that I created an abstract class Layer , which has a single getter. This allows for
more-complicated layer types (such as layers containing other layers). All you need to do is
override get_parameters() to control what tensors are later passed to the optimizer (such
as the SGD class created in the previous section).

In [None]:
class Layer(object):
    def __init__(self):
        self.parameters = list()
    def get_parameters(self):
        return self.parameters
    
class Linear(Layer):
    def __init__(self, n_inputs, n_outputs):
        super().__init__()
        W = np.random.randn(n_inputs, n_outputs) * np.sqrt(2.0/(n_inputs))
        self.weight = Tensor(W, autograd=True)
        self.bias = Tensor(np.zeros(n_outputs), autograd=True)

        self.parameters.append(self.weight)
        self.parameters.append(self.bias)

    def forward(self, input):
        return input.mm(self.weight) + self.bias.expand(0, len(input.data))

# Layers that contain layers
*Layers can also contain other layers*

The most popular layer is a sequential layer that forward propagates a list of layers, where
each layer feeds its outputs into the inputs of the next layer:

In [None]:
class Sequential(Layer):
    def __init__(self, layers=list()):
        super().__init__()
        self.layers = layers

    def add(self, layer):
        self.layers.append(layer)

    def forward(self, input):
        for layer in self.layers:
            input = layer.forward(input)
        return input

    def get_parameters(self):
        params = list()
        for l in self.layers:
            params += l.get_parameters()
        return params

In [None]:
data = Tensor(np.array([[0,0],[0,1],[1,0],[1,1]]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True)
model = Sequential([Linear(2,3), Linear(3,1)])
optim = SGD(parameters=model.get_parameters(), alpha=0.05)
for i in range(10):
    pred = model.forward(data)
    loss = ((pred - target)*(pred - target)).sum(0)
    loss.backward(Tensor(np.ones_like(loss.data)))
    optim.step()
    print(loss)

# Loss-function layers
You can also create layers that are functions on the input. The most popular version of this
kind of layer is probably the loss-function layer, such as mean squared error:

In [None]:
class MSELoss(Layer):
    def __init__(self):
        super().__init__()

    def forward(self, pred, target):
        return ((pred - target)*(pred - target)).sum(0)

In [None]:
data = Tensor(np.array([[0,0],[0,1],[1,0],[1,1]]), autograd=True)
target = Tensor(np.array([[0],[1],[0],[1]]), autograd=True)
model = Sequential([Linear(2,3), Linear(3,1)])
optim = SGD(parameters=model.get_parameters(), alpha=0.05)

criterion = MSELoss()

for i in range(10):
    pred = model.forward(data)
    loss = criterion.forward(pred, target)
    loss.backward(Tensor(np.ones_like(loss.data)))
    optim.step()
    print(loss)

# Nonlinearity layers

Let’s add nonlinear functions to Tensor and then create some
layer types.

Let’s add them to the Tensor
class. You learned about the derivative for both quite some time ago, so this should be easy:

```py
def sigmoid(self):
    if(self.autograd):
        return Tensor(1 / (1 + np.exp(-self.data)),
                    autograd=True,
                    creators=[self],
                    creation_op='sigmoid')
def tanh(self):
    if(self.autograd):
        return Tensor(np.tanh(-self.data),
                    autograd=True,
                    creators=[self],
                    creation_op='tanh')
    return Tensor(np.tanh(self.data))

```

The following code shows the backprop logic added to the Tensor.backward() method:
```py
if(self.creation_op == "sigmoid"):
    ones = Tensor(np.ones_like(self.grad.data))
    self.creators[0].backward(self.grad * (self * (ones - self)))

if(self.creation_op == "tanh"):
    ones = Tensor(np.ones_like(self.grad.data))
    self.creators[0].backward(self.grad * (ones - (self * self)))
```

In [None]:
class Tanh(Layer):
    def __init__(self):
        super().__init__()
    def forward(self, input):
        return input.tanh()

class Sigmoid(Layer):
    def __init__(self):
        super().__init__()
    def forward(self, input):
        return input.sigmoid()

In [None]:
model = Sequential([Linear(2,3), Tanh(), Linear(3,1), Sigmoid()])

# Embedding Layer
*An embedding layer translates indices into activations.*

So far, so good. The matrix has a row (vector) for each word in the vocabulary. Now, how
will you forward propagate? Well, forward propagation always starts with the question,
“How will the inputs be encoded?” In the case of word embeddings, you obviously can’t pass
in the words themselves, because the words don’t tell you which rows in self.weight to
forward propagate with.

## Adding indexing to autograd
*Before you can build the embedding layer, autograd needs to
support indexing.*

In order to support the new embedding strategy (which assumes words are forward
propagated as matrices of indices), the indexing you played around with in the previous
section must be supported by autograd. This is a pretty simple idea:
You need to make sure that during backpropagation, the gradients are placed in the same rows as were indexed into for forward propagation.

```py
def index_select(self, indices):
    if(self.autograd):
        new = Tensor(self.data[indices.data],
                    autograd=True,
                    creators=[self],
                    creation_op="index_select")
        new.index_select_indices = indices
        return new
    return Tensor(self.data[indices.data])
```

```py
if(self.creation_op == "index_select"):
    new_grad = np.zeros_like(self.creators[0].data)
    indices_ = self.index_select_indices.data.flatten()
    grad_ = grad.data.reshape(len(indices_), -1)
    for i in range(len(indices_)):
        new_grad[indices_[i]] += grad_[i]
    self.creators[0].backward(Tensor(new_grad))
```
- First, use the NumPy trick to select the correct rows 
- Then, during backprop() , initialize a new gradient of the correct size (the size of the
original matrix that was being indexed into)
- Second, flatten the indices so you can iterate
through them
- Third, collapse grad_ to a simple list of rows. (The subtle part is that the list
of indices in indices_ and the list of vectors in grad_ will be in the corresponding order.)
- Then, iterate through each index, add it into the correct row of the new gradient you’re
creating, and backpropagate it into self.creators[0]


In [None]:
class Embedding(Layer):
    def __init__(self, vocab_size, dim):
        super().__init__()
        self.vocab_size = vocab_size
        self.dim = dim
        
        weight = (np.random.rand(vocab_size, dim) - 0.5) / dim
        self.weight = Tensor(weight, autograd=True)
        self.parameters.append(self.weight)

    def forward(self, input):
        return self.weight.index_select(input)

In [None]:
embed = Embedding(5,3)
model = Sequential([embed, Tanh(), Linear(3,1), Sigmoid()])

# The cross-entropy layer
*Let’s add cross entropy to the autograd and create a layer.*

```py
def cross_entropy(self, target_indices):
    temp = np.exp(self.data)
    softmax_output = temp / np.sum(temp,
                                    axis=len(self.data.shape)-1,
                                    keepdims=True)
    t = target_indices.data.flatten()
    p = softmax_output.reshape(len(t),-1)
    target_dist = np.eye(p.shape[1])[t]
    loss = -(np.log(p) * (target_dist)).sum(1).mean()
    if(self.autograd):
        out = Tensor(loss,
                    autograd=True,
                    creators=[self],
                    creation_op="cross_entropy")
        out.softmax_output = softmax_output
        out.target_dist = target_dist
        return out
    return Tensor(loss)
```

One noticeable thing about this loss is different from others:
both the final softmax and the computation of the loss are within the loss class. This is an
extremely common convention in deep neural networks. Nearly every framework will work
this way. 

When you want to finish a network and train with cross entropy, you can leave
off the softmax from the forward propagation step and call a cross-entropy class that will
automatically perform the softmax as a part of the loss function.

The reason these are combined so consistently is performance. It’s much faster to calculate
the gradient of softmax and negative log likelihood together in a cross-entropy function
than to forward propagate and backpropagate them separately in two different modules.
This has to do with a shortcut in the gradient math.

In [None]:
class CrossEntropyLoss(object):
    def __init__(self):
        super().__init__()
    def forward(self, input, target):
        return input.cross_entropy(target)

In [None]:
criterion = CrossEntropyLoss()
loss = criterion.forward(pred, target)


# The recurrent neural network layer


In [None]:
class RNNCell(Layer):
    def __init__(self, n_inputs,n_hidden,n_output,activation='sigmoid'):
        super().__init__()
        self.n_inputs = n_inputs
        self.n_hidden = n_hidden
        self.n_output = n_output
        if(activation == 'sigmoid'):
            self.activation = Sigmoid()
        elif(activation == 'tanh'):
            self.activation == Tanh()
        else:
            raise Exception("Non-linearity not found")
        self.w_ih = Linear(n_inputs, n_hidden)
        self.w_hh = Linear(n_hidden, n_hidden)
        self.w_ho = Linear(n_hidden, n_output)
        
        self.parameters += self.w_ih.get_parameters()
        self.parameters += self.w_hh.get_parameters()
        self.parameters += self.w_ho.get_parameters()
    def forward(self, input, hidden):
        from_prev_hidden = self.w_hh.forward(hidden)
        combined = self.w_ih.forward(input) + from_prev_hidden
        new_hidden = self.activation.forward(combined)
        output = self.w_ho.forward(new_hidden)
        return output, new_hidden
    def init_hidden(self, batch_size=1):
        return Tensor(np.zeros((batch_size,self.n_hidden)),autograd=True)

In [None]:
import sys,random,math
from collections import Counter
import numpy as np
f = open('tasksv11/en/qa1_single-supporting-fact_train.txt','r')
raw = f.readlines()
f.close()
tokens = list()
for line in raw[0:1000]:
    tokens.append(line.lower().replace("\n","").split(" ")[1:])
new_tokens = list()
for line in tokens:
    new_tokens.append(['-'] * (6 - len(line)) + line)
tokens = new_tokens

vocab = set()
for sent in tokens:
    for word in sent:
        vocab.add(word)
vocab = list(vocab)

word2index = {}
for i,word in enumerate(vocab):
    word2index[word]=i
def words2indices(sentence):
    idx = list()
    for word in sentence:
        idx.append(word2index[word])
    return idx
indices = list()
for line in tokens:
    idx = list()
    for w in line:
        idx.append(word2index[w])
    indices.append(idx)
data = np.array(indices)

In [None]:
embed = Embedding(vocab_size=len(vocab),dim=16)
model = RNNCell(n_inputs=16, n_hidden=16, n_output=len(vocab))
criterion = CrossEntropyLoss()
params = model.get_parameters() + embed.get_parameters()
optim = SGD(parameters=params, alpha=0.05)

In [None]:
for iter in range(1000):
    batch_size = 100
    total_loss = 0
    hidden = model.init_hidden(batch_size=batch_size)
    for t in range(5):
        input = Tensor(data[0:batch_size,t], autograd=True)
        rnn_input = embed.forward(input=input)
        output, hidden = model.forward(input=rnn_input, hidden=hidden)
    target = Tensor(data[0:batch_size,t+1], autograd=True)
    loss = criterion.forward(output, target)
    loss.backward()
    optim.step()
    total_loss += loss.data
    if(iter % 200 == 0):
        p_correct = (target.data == np.argmax(output.data,axis=1)).mean()
        print_loss = total_loss / (len(data)/batch_size)
        print("Loss:",print_loss,"% Correct:",p_correct)

In [None]:
batch_size = 1
hidden = model.init_hidden(batch_size=batch_size)
for t in range(5):
    input = Tensor(data[0:batch_size,t], autograd=True)
    rnn_input = embed.forward(input=input)
    output, hidden = model.forward(input=rnn_input, hidden=hidden)
target = Tensor(data[0:batch_size,t+1], autograd=True)
loss = criterion.forward(output, target)
ctx = ""
for idx in data[0:batch_size][0][0:-1]:
    ctx += vocab[idx] + " "
print("Context:",ctx)
print("Pred:", vocab[output.data.argmax()])