# Convolutional NNs in NumPy and PyTorch
Notebook created by Laura Dietz for ML4Seq

Tested with Python 3.7.6 on anaconda Numpy version 1.19.1 and PyTorch version 1.6.0 and 1.7.0

If you find bugs in this notebook (or something that could be a bug), please let me know via Piazza (I am also awarding bounty points for big bugs). 


The goals are for you to work through the basics of convolutional neural networks, both as

- numpy matrix operations (without training) 
- implement every matrix operation with a single PyTorch layer
- implement the as a PyTorch model with training.

Here a [good introduction to PyTorch](https://towardsdatascience.com/understanding-pytorch-with-an-example-a-step-by-step-tutorial-81fc5f8c4e8e) that covers more technical details than needed for this assignment. 

## Prediction Task:

Given an example sentence $words$ such as

```["Fall","colors", "in", "New", "Hampshire ", "are", "best", "seen", "right", "now","!"] ```

the task is to mark selected words such as "Hampshire hires", here $Y$ is a sequence that for each word in $w\in words$ marks whether the word is selected ($Y_w=1$) or not ($Y_w=0$).




## Your Assignment


You will be implementing the following convolutional neural network:

(bottom to top)

- X is the one-hot encoding of each word (provided)
- Projection: $X \rightarrow_{\Theta} E$
- Convolution $E \rightarrow_{\text{conv }W} H$
- Detection $H \rightarrow_{\Psi, \text{ReLU}} D$
- Prediction $Y \rightarrow_{\text{max pool}} Y$
- words $i$ with $Y_i>0$ will be selected

We will ignore the bias term for all of these layers. The model is a variant on CNNs, in that the detector layer has an additional linear component $\Psi$
.

See section "Network Setup" for sizes of convolution windows etc.


### Part 1: Prediction with given parameters using Numpy/Einsum

You are given parameters for both examples under ("Fixed Parameters.. ") and some settings for embedding dimension and convolutional window ("Network settings")

Your task is to emulate the CNN using Numpy's matrix operations and non-linear activations, to predict which words are selected.

Please give results for both examples (we provide reference results for one example)


### Part 2: Layer-by-Layer in PyTorch

Using the parameters from Part 1, your task is to implement every layer of the network using an indiviual PyTorch layer object.  (No training yet!)

You will use one of the inputs generated from Part 1, the parameter used in Part 1, to produce an output with a PyTorch layer. Then you will verify whether the output matches your resut from Part 1.



### Part 3: Training and Prediction with PyTorch

Now we add the layers from Part 1 to an Neural network and train it "end-to-end" (e.g. $X \rightarrow Y$).

Here we will be replace the one-hot encoding with pre-trained word embedding. (Setup code for word embedding below under "Pre-trained Word embeddings")


### Part 4: Replace One-hot Encoding with Pre-trained Word Embeddings

We provide stub code to download and use pre-trained word embeddings like word2vec or GloVE.

All you have to do is to change the inputs to use the embedding vectors in your model.

(If your project involves NLP, you probably want to do that anyway)


# Notebook setup and PyTorch Installation

In [1]:
import sys
import numpy as np

# uncomment one of these versions (depending on whether you are on a computer with a CPU or not)

# GPU version
#!conda install --yes --prefix {sys.prefix} pytorch torchvision cudatoolkit=10.2 -c pytorch

# Just CPU
#!conda install --yes --prefix {sys.prefix} pytorch torchvision cpuonly -c pytorch



# install `Einops` for einstein-style tensor manipulation in pytorch
# Also see https://github.com/arogozhnikov/einops
#!conda install --yes --prefix {sys.prefix} einops  -c conda-forge


In [2]:
# torch test

import torch
x = torch.rand(5, 3)
print(x)


print("GPU/CUDA available? ", torch.cuda.is_available())


print("Torch version", torch.__version__)

tensor([[0.0176, 0.0416, 0.3021],
        [0.7034, 0.1340, 0.6787],
        [0.8742, 0.8389, 0.0737],
        [0.6353, 0.2256, 0.0679],
        [0.8472, 0.1184, 0.3079]])
GPU/CUDA available?  False
Torch version 1.7.0


  return torch._C._cuda_getDeviceCount() > 0


In [3]:
import einops

flavor="pytorch"

In [4]:
# helper print function for numpy matrices of dtype float

def pretty(matrix):
    return np.round(matrix,2)

# Example Input Sentence

In [5]:
words = ["Fall","colors", "in", "New", "Hampshire ", "are", "best", "seen", "right", "now","!"]
Y_star=np.array([0,0,0,1,1,0,0,0,0,0,0], dtype=float)

list(zip(words,Y_star))  # print (word, label)

[('Fall', 0.0),
 ('colors', 0.0),
 ('in', 0.0),
 ('New', 1.0),
 ('Hampshire ', 1.0),
 ('are', 0.0),
 ('best', 0.0),
 ('seen', 0.0),
 ('right', 0.0),
 ('now', 0.0),
 ('!', 0.0)]

# Network Setup

Hyperparameters and a set of example parameters you will be using throughout the exercise

(For layers in the network see "Your Assignment" above)

In [6]:
# Basic setup of inputs and dimensions of hidden layers / convolution window / channels

maxw = len(words) # length of input sequence  (could also be characters)

vocab_dim = 11 # length of one-hot encoding
# this will be overwritten we construct the vocabulary
# alternatively set it to the dimension of your pre-trained word embedding

embed_dim = 3 # project 1-hot into a 3D space

conv_dim = 3  # convolution window of input around t: [t-conv_dim+1: t+1]

conv_channels = 2 # number of output channels from convolution

pooling_window = 1  # pooling window of t: [t-pooling_window : t+pooling_window] 
# is supposed to mean: pooling_window-many before and pooling_window-many after
# note this is not how pytorch defines it, you will need to transform it appropriately


# Fixed Parameters for Example

I just created random parameters with the code below. These parameters will not lead to the optimal result, they are just here so you can verify each of your layers.

I documented the dimensions for each parameter (and verified it in code). Please document your parts of the code accordingly.

In [7]:
# Embedding parameter: vocab_dim x embed_dim
Theta=np.array(
    [
        [0.14, 0.23, 0.19],
        [0.25, 0.7 , 0.54],
        [0.8 , 0.46, 0.92],
        [0.45, 0.53, 0.81],
        [0.83, 0.04, 0.12],
        [0.32, 0.88, 0.99],
        [0.89, 0.67, 0.64],
        [0.22, 0.33, 0.56],
        [0.25, 0.3 , 0.3 ],
        [0.44, 0.18, 0.97],
        [0.7 , 0.87, 0.9 ]
    ]
)

print("Theta dimension check:", Theta.shape == (vocab_dim, embed_dim), Theta.shape)

# convolution filter:  embed_dim x conv_dim x conv_channels
W=np.array([[[0.03668675,1], [0.5651157,1] , [1.94466826,1]],
            [[0.03668675,0.03], [0.5651157,0.56] , [1.94466826, 1.94]],
            [[0.03668675,-0.3], [0.5651157,-1] , [1.94466826,-2]]])

print("W dimension check:", W.shape == (embed_dim, conv_dim, conv_channels), W.shape)


# detector parameter: 1 x conv_channels 
Psi=np.array([[ 0.43,-0.3 ]])
print("Psi dimension check:", Psi.shape == (1, conv_channels) ,Psi.shape)


Theta dimension check: True (11, 3)
W dimension check: True (3, 3, 2)
Psi dimension check: True (1, 2)


# Part 1: Implement with Matrix Operations

In this section, implement the neural net using only regular matrix operators.

You can use 

- Numpy or 
- low level pytorch data structure [tensors](https://pytorch.org/docs/stable/tensors.html) or [torch.einsum](https://pytorch.org/docs/stable/generated/torch.einsum.html) --- [explanation of einsum](https://stackoverflow.com/questions/55894693/understanding-pytorch-einsum)
- or the Einstein Sum package [einops](https://github.com/arogozhnikov/einops) which works across numpy, pytorch, and tensorflow.


Please document and check your dimensions. The assignment becomes a lot harder if you disregard this advice.

# Inputs & Setup: $words \rightarrow_{\text{one hot}} X$

In [8]:
# setup the design matrix for words using one-hot encoding

# 1. use an orderd dict to assign unique words to consecutive indexs
# 2. generate a 1-hot encoding for each word
# 3. compose the design matrix X by loading the encoding of each word
# dimension of X = wmax x vocab_dim

# at the end of the assignment, you will replace this with vectors from a pre-trained word embedding.

from collections import OrderedDict 
dictionary = OrderedDict.fromkeys(words)
dictionary = {w: idx for idx, w in enumerate (dictionary.keys())}  # Step 1

vocab_dim = len(dictionary)

dictionary_one_hot = np.eye(vocab_dim, dtype=float) # Step 2

X = dictionary_one_hot[ [dictionary[w] for w in words] ] # Step 3

X

array([[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])

# Projection (Embedding): $X \rightarrow_\theta E$

I recommend to first get it right for one word t=5, then create the tensor for all words.



In [9]:
# == Your Code! === 

# Parameter theta: vocab_dim x embed_dim
# Input X : maxw x vocab_dim
# Output E = maxw x embed_dim

def projection(theta, X):
    
    return np.einsum('ij, jk -> ik', X, theta)


E = projection(Theta, X)
E[5]
# Partial Reference result
# E[5] = [0.3200, 0.8800, 0.9900]

array([0.32, 0.88, 0.99])

# Convolution:  $E \rightarrow_{\text{conv } W} H$

In [10]:
# If you need an integer division, this might be useful

import math

# integer ceil/floor
math.floor(7/2), math.ceil(7/2)


(3, 4)

In [11]:
# == Your Code! === 

# Input E: maxw x embed_dim
# Parameter W: embed_dim x conv_dim x conv_channels
# Hyperparameter: conv_dim 
# Output H: maxw x conv_channels

def convolution(W, E, conv_dim):
    
    pad = np.zeros((math.floor(conv_dim/2), E.shape[1]))
    padE = np.vstack((pad, E, pad))

    return np.array(
        [
            np.einsum(
                'ji, ijk -> k',
                padE[i:conv_dim+i],
                W
            )
            for i in range(0, E.shape[0])
        ]
    )

H = convolution(W, E, conv_dim)
H[5]
# Partial Reference result
# H[5] = array([5.55219344, 1.5278  ])

array([5.55219344, 1.5278    ])

# Detector: $H \rightarrow_\psi D$ with ReLU

See discussion on [fastest way to implement ReLU in numpy](https://stackoverflow.com/questions/32109319/how-to-implement-the-relu-function-in-numpy)

In [12]:
# == Your Code! === 

# Input H: maxw x conv_channels
# Parameter Psi: 1 x conv_channels
# Output D: maxw x 1

def ReLU(H, psi):
    
    x = np.einsum('ij, kj -> i', H, psi)
    
    return (x * (x > 0)).reshape(-1, 1)


D = ReLU(H, Psi)
D[5]
# Partial Reference Result
# D[5] = 1.9291

# I realize that my Psi parameter will not produce results below 0, hence ReLU does not filter. 
# Question for you, which Psi would be able to produce a result below 0?


array([1.92910318])

I realize that my Psi parameter will not produce results below 0, hence ReLU does not filter. 

Question for you, which Psi would be able to produce a result below 0?


In [13]:
# your Psi, your code again

psi = np.array([[1,-4]])

d = ReLU(H, psi)
d

array([[ 0.7868205 ],
       [ 4.92474378],
       [ 4.34837167],
       [-0.        ],
       [ 0.87395731],
       [-0.        ],
       [ 1.82298029],
       [-0.        ],
       [ 7.49669317],
       [ 4.4226483 ],
       [-0.        ]])

# Max Pooling: $D \rightarrow Y$

In [14]:
# == Your Code! === 

# Input D: maxw x 1 
# Parameter pooling_window: 1
# Output Y: maxw, 

def max_pool(pooling_window, D):
    
    pad = np.full((pooling_window, 1), np.min(D))
    dPad = np.vstack((pad, D, pad))
    return np.array(
        [
            np.max(dPad[i-pooling_window:i+pooling_window])
            for i in range(1, D.shape[0]+1)
        ]
    )

Y = max_pool(pooling_window, D)
Y[5]
# Partial reference result
# Y[5] = 1.9291031781249997

1.9291031781249997

# Word Selection and Evaluation

In [15]:
# use the predicted Y variable to tell us which words were selected

# replace Y with the variable name of your prediction

words_array = np.array(words, dtype=object)

print("selected words \n",words_array[Y>2]) 


# Reference result
# selected words   ['colors' 'in' 'New' 'now']

selected words 
 ['colors' 'in' 'New' 'now' '!']


In [16]:
# Measure Mean-squared-Error to a ground truth

print('Y',pretty(Y))
print('Y_star', Y_star)

mse=np.mean( (Y-Y_star)**2)
print("MSE for ground truth y_star", mse)


# Reference result
# MSE for ground truth y_star 2.683622245712982

Y [1.2  2.18 2.18 2.02 1.8  1.93 1.93 1.37 1.84 2.37 2.37]
Y_star [0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
MSE for ground truth y_star 3.3238344662803327



... well, that is not a great result. But, hey, what did you expect from random parameters!?




# $\star\star\star$ Now Everything in PyTorch  $\star\star\star$

[Tutorial using PyTorch for Convolutional Nets](https://www.tutorialspoint.com/pytorch/pytorch_convolutional_neural_network.htm)

[Cheat Sheet of PyTorch utilities](https://www.analyticsvidhya.com/blog/2019/09/introduction-to-pytorch-from-scratch/?utm_source=blog&utm_medium=building-image-classification-models-cnn-pytorch)


In [17]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F

from einops.layers.torch import Rearrange, Reduce # Maybe you are brave and are going to use these...



## Conv1d

My suggestion is to first understand the **convolution operation**:

- [nn.functional.conv1d](https://pytorch.org/docs/stable/nn.functional.html#conv1d)(inputs, filters)

Let's first create a new toy example:


In [18]:
num_batches=1; batch=0
seqlen=10
inchannels=2
outchannels=1
# with a filter of conv_dim 3, the shrunk_seqlen=8

filters = torch.ones(num_batches,inchannels,conv_dim)
filters[None][:][:]=torch.arange(-1,2)
filters[0][1]= filters[0][0]+1
                     
inputs = torch.ones(num_batches,inchannels,seqlen)
inputs[None][:][:]=torch.arange(1,11)

print("filters\n",pretty(filters))
print("inputs\n",pretty(inputs))

filters.shape, inputs.shape

filters
 tensor([[[-1.,  0.,  1.],
         [ 0.,  1.,  2.]]])
inputs
 tensor([[[ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.],
         [ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10.]]])


(torch.Size([1, 2, 3]), torch.Size([1, 2, 10]))

In [19]:
# Your code using nn.functional.conv1d (no padding)

C = nn.functional.conv1d(inputs, filters, padding=(0,))
C
#Reference result:
#    [[[10., 13., 16., 19., 22., 25., 28., 31.]]])

tensor([[[10., 13., 16., 19., 22., 25., 28., 31.]]])

In [20]:
inputs[0].T[0:3].T

tensor([[1., 2., 3.],
        [1., 2., 3.]])


Then implement first and second example of previous result with a **matrix operation**. I used [torch.einsum](https://pytorch.org/docs/stable/generated/torch.einsum.html) because I find it easier to use than transposed dot products -- you can use dot products if this works better for you.


In [21]:
# Your code  implementing the same with einsum for the first and second example

first = torch.einsum('ijk,jk->i', filters, inputs[0].T[0:3].T)
second = torch.einsum('ijk,jk->i', filters, inputs[0].T[1:4].T)

first, second
# Reference result 10, 13


(tensor([10.]), tensor([13.]))

Next, use a **convolution layer** 

*if you are confused about layers, first work through the next section then return here*

The input to Conv1d is a 3-ax tensor of $num\_batches \times inchannels \times seqlen$. 

- Num\_batches = 1, when we don't use minibatches.
- Inchannels will be the dimensionality of the embedding of your input.
- Seqlen ($L$) is the number of words (or characters)

Use a layer without bias and with padding=2

The output will be $num\_batches \times outchannels \times shrunk\_seqlen$

- Num\_batches is same as input
- Outchannel is a hyperparameter for you to set, it will be the dimensionality resulting from a matrix product of your filter $W$ and one input entry (of dimensionality $inchannel$). In this example we can set it to 1.
- Shrunk\_seqlen ($L_{out}$) is roughly seqlen, but the convolution will chop off entries at the boundary, unless you compensate with padding. See [pytorch doc section "shape"](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html) for details.


The parameter will be a tensor of shape $outchannels \times inchannels \times conv\_dim$


In [22]:
# Your code using nn.Conv1d

conv = nn.Conv1d(2, 1, 3, padding=2, padding_mode='zeros', bias=False)
conv.weight.data = filters
out = conv(inputs)
out
# Reference Result (you see two new elements on either end)
#[[[  3.,   7.,  10.,  13.,  16.,  19.,  22.,  25.,  28.,  31.,   1., -10.]]]

tensor([[[  3.,   7.,  10.,  13.,  16.,  19.,  22.,  25.,  28.,  31.,   1.,
          -10.]]], grad_fn=<SqueezeBackward1>)

I highly recommend the you create your own examples with different windows, channels, etc in a new notebook cell to make sure you understand how the convolution behaves. 

You may also want to think ahead about how to cut the left-over padding out of the output. If you do it wrong, your results will be shifted and you get some nasty off-by-one errors in your network. (Maybe max pooling was only invented because folks never got the cut-out right.)

# Check Each Layer on its Own

next we look at each layer on its own, to make sure we got the in/out dimensions right

Look at the pytorch documentation for more info:

- [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)(in_features: int, out_features: int, bias: bool = True)  -- param: out_features x in_features
- [nn.Conv1d](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html)(in_channels: int, out_channels: int, Union[T, Tuple[T]], padding: Union[T, Tuple[T]] = 0, ....')
- [nn.ReLU](https://pytorch.org/docs/stable/generated/torch.nn.ReLU.html)()
- [nn.MaxPool1d](https://pytorch.org/docs/stable/generated/torch.nn.MaxPool1d.html)(kernel_size:int, padding: int, ....)

- composing neural layers with [nn.Sequential](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html)

Whenever you apply the layer to a tensor `$layer(tensor)` the function `apply` is called in [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html)



In [23]:
# convert inputs and example parameters to torch tensors

X_torch = torch.unsqueeze(torch.tensor(X, dtype=torch.float), 0)
Theta_torch = torch.tensor(Theta, dtype=torch.float)
W_torch = torch.tensor(W, dtype=torch.float)
Psi_torch = torch.tensor(Psi, dtype=torch.float)

X_torch

tensor([[[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]]])

In [24]:
# We will not implement each layer on its own, using results from part 1 as inputs/outputs

# We want to obtain the same outputs as above

print("\nE",E,"\nH", H,"\nD",D,"\nY",Y)


# let's first convert our numpy results from part 1 to torch tensors
# if you implemented Part 1 using pytorch tensors, a conversion is not necessary (you will get some warnings)

### I am converting these tensors to the expectect size of the output of each layer
H_torch = torch.einsum('ijk->ikj', torch.unsqueeze(torch.tensor(H, dtype=torch.float), 0))
E_torch = torch.unsqueeze(torch.tensor(E, dtype=torch.float), 0)
D_torch = torch.einsum('ijk->ikj', torch.unsqueeze(torch.tensor(D, dtype=torch.float), 0))

Y_torch = torch.unsqueeze(torch.unsqueeze(torch.tensor(Y, dtype=torch.float), 0), 0)
Y_star_torch = torch.unsqueeze(torch.unsqueeze(torch.tensor(Y_star, dtype=torch.float), 0), 0)




E [[0.14 0.23 0.19]
 [0.25 0.7  0.54]
 [0.8  0.46 0.92]
 [0.45 0.53 0.81]
 [0.83 0.04 0.12]
 [0.32 0.88 0.99]
 [0.89 0.67 0.64]
 [0.22 0.33 0.56]
 [0.25 0.3  0.3 ]
 [0.44 0.18 0.97]
 [0.7  0.87 0.9 ]] 
H [[ 3.2140205   0.6068    ]
 [ 5.10194378  0.0443    ]
 [ 4.76757167  0.1048    ]
 [ 3.0167558   1.1422    ]
 [ 4.88395731  1.0025    ]
 [ 5.55219344  1.5278    ]
 [ 3.48218029  0.4148    ]
 [ 2.3609573   0.7949    ]
 [ 3.61309317 -0.9709    ]
 [ 5.7330483   0.3276    ]
 [ 1.45416771  0.4416    ]] 
D [[1.19998881]
 [2.18054583]
 [2.01861582]
 [0.95454499]
 [1.79935165]
 [1.92910318]
 [1.37289753]
 [0.77674164]
 [1.84490006]
 [2.36693077]
 [0.49281212]] 
Y [1.19998881 2.18054583 2.18054583 2.01861582 1.79935165 1.92910318
 1.92910318 1.37289753 1.84490006 2.36693077 2.36693077]


In [25]:
# verification helper using torch's MSE

def verify(predicted, expected, detail=False):
    """Verify that expected dimensions match and the entries in tensor are sufficiently close"""
    if detail:
           print("predicted \n",predicted, "\n expected \n", expected)
    if (not (predicted.size() == expected.size())):
        print ("Verify: Shapes don't match. Predicted ",predicted.size(), "Expected", expected.size())
        return -1
    else :
        diff = (predicted.detach()-expected)**2
        if detail: 
            print("diff \n", diff)
        mse = torch.mean(diff)
        return mse.item()    


# Part 2: Your Layer-by-Layer Implementation

Now we are reimplementing each layer of the network using torch's layers that we wire to

- inputs (data or from previous layer)
- parameters (the ones given above), set with `$layer.weight.data`
- outputs

then we verify that the outputs are the same as earlier.


I give you example code for the first layer.

**You will implement** the remaining layers, and verify that they are correct.

With your implementation:

- indicate the expected dimensions of your inputs, outputs, and parameter 
- take note how your input needs to be rotated/transposed to fit (I recommend [torch.einsum](https://pytorch.org/docs/stable/generated/torch.einsum.html) )
- verify that you obtain the expected result  (if applicable: describe the issue in a comment)

## First layer: Embedding  X --theta--> E

no bias, use rearranged `Theta_torch` as parameter, output `E_out`

(given as an example)

In [26]:
# Example First layer:  X --theta--> E

# Inputs X: seq x vocab_dim
# Output E: seq x embed_dim
# Theta: vocab_dim x embed_dim
# but param of this layer needs to be: embed_dim x vocab_dim


# init the layer
layer=nn.Linear(vocab_dim, embed_dim, bias=False)

# convert Theta and set the layers' parameter
layer.weight.data=torch.einsum('ve->ev',Theta_torch) # v: vocab, e: embedding

# produce outputs of the layer
E_out=layer(X_torch)


# verify that we obtain the same results as before
verify(E_out, E_torch )

# 0.0 is perfect! (close to 0 is also okay)

0.0

## Second layer: Convolution: E  --W--> H



Next, the **convolution layer** (if you are confused about layers, first work through the next section then return here):

Use window of length `conv_dim`, with output channeld `conv_channels`, no bias, padding=1 


The input to Conv1d is a 3-ax tensor of $num\_batches \times inchannels \times seqlen$. 

- Num\_batches = 1, when we don't use minibatches.
- Inchannels will be the dimensionality of the embedding of your input.
- Seqlen ($L$) is the number of words (or characters)


The output will be $num\_batches \times outchannels \times shrunk\_seqlen$

- Num\_batches is same as input
- Outchannel is a hyperparameter for you to set, it will be the dimensionality resulting from a matrix product of your filter $W$ and one input entry (of dimensionality $inchannel$). In this example we can set it to 1.
- Shrunk\_seqlen ($L_{out}$) is roughly seqlen, but the convolution will chop off entries at the boundary, unless you compensate with padding. See [pytorch doc section "shape"](https://pytorch.org/docs/stable/generated/torch.nn.Conv1d.html) for details.


The parameter will be a tensor of shape $outchannels \times inchannels \times conv\_dim$


In [27]:
W_torch.shape

torch.Size([3, 3, 2])

In [28]:
# Convolutional layer

# Input E_out: maxw x embed_dim
# Parameter W_torch: embed_dim x conv_dim x conv_channels
# Hyperparameter: conv_dim 
# Output H: maxw x conv_channels

conv = nn.Conv1d(in_channels=embed_dim, out_channels=conv_channels, kernel_size=conv_dim, padding=1, padding_mode='zeros', bias=False)
conv.weight.data = torch.einsum('edc->ced', W_torch)
H_out = conv(torch.einsum('bsi->bis', E_out))

 

print('H_out', H_out, H_out.size())
verify(H_out, H_torch, True ) # verify with previous result. This should return a scalar close to 0


H_out tensor([[[ 3.2140,  5.1019,  4.7676,  3.0168,  4.8840,  5.5522,  3.4822,
           2.3610,  3.6131,  5.7330,  1.4542],
         [ 0.6068,  0.0443,  0.1048,  1.1422,  1.0025,  1.5278,  0.4148,
           0.7949, -0.9709,  0.3276,  0.4416]]], grad_fn=<SqueezeBackward1>) torch.Size([1, 2, 11])
predicted 
 tensor([[[ 3.2140,  5.1019,  4.7676,  3.0168,  4.8840,  5.5522,  3.4822,
           2.3610,  3.6131,  5.7330,  1.4542],
         [ 0.6068,  0.0443,  0.1048,  1.1422,  1.0025,  1.5278,  0.4148,
           0.7949, -0.9709,  0.3276,  0.4416]]], grad_fn=<SqueezeBackward1>) 
 expected 
 tensor([[[ 3.2140,  5.1019,  4.7676,  3.0168,  4.8840,  5.5522,  3.4822,
           2.3610,  3.6131,  5.7330,  1.4542],
         [ 0.6068,  0.0443,  0.1048,  1.1422,  1.0025,  1.5278,  0.4148,
           0.7949, -0.9709,  0.3276,  0.4416]]])
diff 
 tensor([[[5.6843e-14, 0.0000e+00, 2.2737e-13, 5.6843e-14, 0.0000e+00,
          0.0000e+00, 0.0000e+00, 0.0000e+00, 5.6843e-14, 0.0000e+00,
          1.4211e

2.063564027770734e-14

In [29]:
linear = nn.Linear(
        in_features=conv_channels,
        out_features=1,
        bias=False
    )
linear.weight.data = Psi_torch
out = linear(torch.einsum('ijk->ikj', H_out))
print(out.shape)
relu = nn.ReLU()
out = relu(out)
print(out.shape)
out = torch.einsum('ijk->ikj', out)
max_pool = nn.MaxPool1d(pooling_window+1, padding=1, stride=1)
out = max_pool(out)
print(out.shape)
#[1.2000, 2.1805, 2.1805, 2.0186, 1.7994, 1.9291, 1.9291, 1.3729, 1.8449,   2.3669, 2.3669])

torch.Size([1, 11, 1])
torch.Size([1, 11, 1])
torch.Size([1, 1, 12])


# Third Layer: Detector H---Psi-->D  with ReLU

No bias

In [30]:
## Your job

# Input H: maxw x conv_channels
# Parameter Psi: 1 x conv_channels
# Output D: maxw x 1

relu = nn.ReLU()
D_out = relu(torch.einsum('bil,oi->bol', H_out, Psi_torch))

verify(D_out, D_torch, True)

predicted 
 tensor([[[1.2000, 2.1805, 2.0186, 0.9545, 1.7994, 1.9291, 1.3729, 0.7767,
          1.8449, 2.3669, 0.4928]]], grad_fn=<ReluBackward0>) 
 expected 
 tensor([[[1.2000, 2.1805, 2.0186, 0.9545, 1.7994, 1.9291, 1.3729, 0.7767,
          1.8449, 2.3669, 0.4928]]])
diff 
 tensor([[[0.0000e+00, 5.6843e-14, 5.6843e-14, 3.5527e-15, 0.0000e+00,
          1.4211e-14, 1.4211e-14, 3.5527e-15, 5.6843e-14, 5.6843e-14,
          3.5527e-15]]])


2.4223048426027377e-14

## Read off predictions D--maxpool--> Y

Use a pooling window of length `pooling_window`, but the MaxPool1d parameters is defined differently from my definition (see comment in definition of `pooling_window` above)

use stride=1, output variable `Y_out`

Use appropriate padding and cut out the result to get a single prediction for each word.

In [31]:
# your code

# Input D: maxw x 1 
# Parameter pooling_window: 1
# Output Y: maxw, 

maxpool = nn.MaxPool1d(pooling_window+1, padding=1, stride=1)
Y_out = torch.einsum('ijk->kji', torch.einsum('ijk->kji', maxpool(D_out))[:-1])

print(verify(Y_out, Y_torch, True))
verify(Y_out, Y_star_torch, True)
#it might not perfectly match up because of the padding, below my reference result
#[1.2000, 2.1805, 2.1805, 2.0186, 1.7994, 1.9291, 1.9291, 1.3729, 1.8449,   2.3669, 2.3669])

predicted 
 tensor([[[1.2000, 2.1805, 2.1805, 2.0186, 1.7994, 1.9291, 1.9291, 1.3729,
          1.8449, 2.3669, 2.3669]]], grad_fn=<ViewBackward>) 
 expected 
 tensor([[[1.2000, 2.1805, 2.1805, 2.0186, 1.7994, 1.9291, 1.9291, 1.3729,
          1.8449, 2.3669, 2.3669]]])
diff 
 tensor([[[0.0000e+00, 5.6843e-14, 5.6843e-14, 5.6843e-14, 0.0000e+00,
          1.4211e-14, 1.4211e-14, 1.4211e-14, 5.6843e-14, 5.6843e-14,
          5.6843e-14]]])
3.488118946242888e-14
predicted 
 tensor([[[1.2000, 2.1805, 2.1805, 2.0186, 1.7994, 1.9291, 1.9291, 1.3729,
          1.8449, 2.3669, 2.3669]]], grad_fn=<ViewBackward>) 
 expected 
 tensor([[[0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0.]]])
diff 
 tensor([[[1.4400, 4.7548, 4.7548, 1.0376, 0.6390, 3.7214, 3.7214, 1.8848,
          3.4037, 5.6024, 5.6024]]])


3.3238351345062256


# Part 3: Full model

Starting with a one-hot encoding, we could train our own embedding as part of the end-to-end network model. However, the length of the vectors will grow with the size of the vocabulary. Instead, we will use a pre-trained embedding called word2vec, that maps most word in the English language to a dense fixed length vector.  

(Dense means, each vector has very few zeros, in contrast to a one-hot embedding, 
fixed length means that if we grow our vocabulary, the embedding vectors will still have the same length, such as 300)





## Build a PyTorch model (use default initialization)

Here an example code stub to follow.

Please please please: instead of calling the layers `layer1` give them informative names.

Don't forget to re-arrange the output tensors as inputs for the next layer.

```python

# I suggest following this code stub

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.layer1 = ...
        self.layer2 = ....

    def forward(self, x):
        out = self.layer1(x)
        out = einops.rearrange(out,'i j k-> j k i') # re-arrange the output tensor to match the input
        out = self.layer2(out)
        ...
        out = out[:][0].narrow(1,1,out.shape[3]-1) # drop second dimension, then cut off last element from pooling with padding
        return out

model = MyModel()   
```

Alternatively you can use `nn.Sequence()` with a Rearrange layer



In [32]:
# Your task!

# model = nn.Sequential(
#     nn.Linear(
#         in_features=vocab_dim,
#         out_features=embed_dim,
#         bias=False
#     ),
#     Rearrange(
#         'i j k -> i k j'
#     ),
#     nn.Conv1d(
#         in_channels=embed_dim,
#         out_channels=conv_channels,
#         kernel_size=conv_dim,
#         padding=1,
#         padding_mode='zeros',
#         bias=False
#     ),
#     Rearrange(
#         'i j k -> i k j'
#     ),
#     nn.Linear(
#         in_features=conv_channels,
#         out_features=1,
#         bias=False
#     ),
#     Rearrange(
#         'i j k -> i k j'
#     ),
#     nn.ReLU(),
#     nn.MaxPool1d(
#         kernel_size=pooling_window,
#         stride=1
#     ),
#     Reduce
# )

class MyModel(nn.Module):
    
    def __init__(self):
        super(MyModel, self).__init__()
        self.embed = nn.Linear(
            in_features=vocab_dim,
            out_features=embed_dim,
            bias=False
        )
        self.convolve = nn.Conv1d(
            in_channels=embed_dim,
            out_channels=conv_channels,
            kernel_size=conv_dim,
            padding=1,
            padding_mode='zeros',
            bias=False
        )
        self.linear = nn.Linear(
            in_features=conv_channels,
            out_features=1,
            bias=False
        )
        self.detect = nn.ReLU()
        self.pool = nn.MaxPool1d(
            kernel_size=pooling_window+1,
            padding=1,
            stride=1
        )

    def forward(self, x):
        out = self.embed(x)
        out = einops.rearrange(out,'i j k -> i k j') # re-arrange the output tensor to match the input
        out = self.convolve(out)
        out = einops.rearrange(out,'i j k -> i k j')
        out = self.linear(out)
        out = einops.rearrange(out,'i j k -> i k j')
        out = self.detect(out)
        out = self.pool(out)
        out = einops.rearrange(einops.rearrange(out, 'i j k -> k j i')[:-1], 'i j k -> k j i')
        #out = einops.rearrange(einops.rearrange(out, 'i j k -> k j i')[0:11], 'i j k -> k j i')
        return out

model = MyModel() 

In [33]:
# get access to the internal state of your model

layers=list(model.modules())
print('layers \n', layers)
print()


print('Parameters \n', list(model.parameters()))

print()
print('Access parameters via their names \n')
for name, param in model.named_parameters():
    print(name,'\n', param.detach())


layers 
 [MyModel(
  (embed): Linear(in_features=11, out_features=3, bias=False)
  (convolve): Conv1d(3, 2, kernel_size=(3,), stride=(1,), padding=(1,), bias=False)
  (linear): Linear(in_features=2, out_features=1, bias=False)
  (detect): ReLU()
  (pool): MaxPool1d(kernel_size=2, stride=1, padding=1, dilation=1, ceil_mode=False)
), Linear(in_features=11, out_features=3, bias=False), Conv1d(3, 2, kernel_size=(3,), stride=(1,), padding=(1,), bias=False), Linear(in_features=2, out_features=1, bias=False), ReLU(), MaxPool1d(kernel_size=2, stride=1, padding=1, dilation=1, ceil_mode=False)]

Parameters 
 [Parameter containing:
tensor([[-0.2572, -0.1168,  0.2335, -0.0669,  0.0920, -0.0185, -0.0072, -0.2580,
         -0.2038,  0.0268, -0.0545],
        [ 0.1234,  0.1247, -0.2031,  0.1124, -0.2242,  0.0449, -0.2339, -0.1328,
         -0.2193,  0.0307,  0.1619],
        [-0.2972, -0.2047, -0.1752,  0.0485,  0.1938, -0.0009,  0.1328,  0.0664,
          0.1533, -0.2809, -0.1320]], requires_grad=Tr

In [34]:
# Now the parameters are set to some random-ish initialization.
# Before we train it, let's produce some random predictions.

# Forward pass: Compute predicted y by passing x to the model
# If your X is called differently, just change the variable

print("X_torch",X_torch, X_torch.shape)

y_pred = model(X_torch)
y_pred.detach()

X_torch tensor([[[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
         [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]]]) torch.Size([1, 11, 11])


tensor([[[0.0489, 0.0489, 0.0000, 0.0406, 0.0406, 0.0857, 0.0857, 0.0718,
          0.0718, 0.0181, 0.0181]]])

This might be a good opportunity to familiarize yourself with `torch.utils.data.Dataset` and `TensorDataset`


### Training Code Stub

construct a MSELoss function using the ground truth $Y^\star$  (`Y_star`)

If your variables are called differently, you may have to change them below.

(code stub below)

In [35]:
def train(xdata, ydata, model):
    '''Train the neural model with the given training data'''

    #Construct the loss function
    criterion = torch.nn.MSELoss()
    # Construct the optimizer (Stochastic Gradient Descent in this case)
    optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)  # lr is learning rate

    # Gradient Descent
    for epoch in range(500):
        # Forward pass: Compute predicted y by passing x to the model
        Y_pred = model(xdata)

        # Compute and print loss
        loss = criterion(Y_pred, ydata)
        if epoch % 100 == 0: print('epoch: ', epoch,' loss: ', loss.item()) 

        # Zero gradients, perform a backward pass, and update the weights.
        optimizer.zero_grad()

        # perform a backward pass (backpropagation)
        loss.backward()

        # Update the parameters
        optimizer.step()
        
    return model
        
model = train(X_torch, Y_star_torch, model)

epoch:  0  loss:  0.17012378573417664
epoch:  100  loss:  0.14087682962417603
epoch:  200  loss:  0.10477027297019958
epoch:  300  loss:  0.062069013714790344
epoch:  400  loss:  0.02598913200199604


In [36]:
# Print predictions after training finishes
Y_pred = model(X_torch)
print("\n Y_pred (learned) \n", Y_pred.detach())
print("\n Y_star_torch (truth) \n", Y_star_torch.detach())
print("\n Y_torch (result of fixed params) \n", Y_torch.detach())
print("\n trained params \n",  list(model.parameters()))


 Y_pred (learned) 
 tensor([[[0.0254, 0.0254, 0.0000, 0.8331, 0.8331, 0.0863, 0.0863, 0.0796,
          0.0796, 0.0000, 0.0589]]])

 Y_star_torch (truth) 
 tensor([[[0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0.]]])

 Y_torch (result of fixed params) 
 tensor([[[1.2000, 2.1805, 2.1805, 2.0186, 1.7994, 1.9291, 1.9291, 1.3729,
          1.8449, 2.3669, 2.3669]]])

 trained params 
 [Parameter containing:
tensor([[-0.2524, -0.1062,  0.5303, -0.1435, -0.1438, -0.0032, -0.0267, -0.2453,
         -0.1774,  0.0172, -0.0517],
        [ 0.1109,  0.1340, -0.2241,  0.3049, -0.3584,  0.0049, -0.1987, -0.1661,
         -0.1942,  0.0314,  0.1559],
        [-0.2824, -0.2255,  0.0032, -0.1869,  0.4837,  0.0449,  0.0407,  0.1044,
          0.0976, -0.2842, -0.1261]], requires_grad=True), Parameter containing:
tensor([[[ 0.1424,  0.2035,  0.2659],
         [ 0.0981, -0.0915, -0.0934],
         [ 0.0934, -0.3086, -0.1798]],

        [[-0.5467,  0.1088,  0.2126],
         [ 0.0612, -0.3350,  0.2650],
      

In [37]:
# Reference Result

# [...]
# epoch:  498  loss:  0.001912423176690936
# epoch:  499  loss:  0.0019004556816071272
# Y_pred (learned) tensor([0.0363, 0.0248, 0.0248, 0.0000, 0.9709, 0.9709, 0.1259, 0.0000, 0.0131,
#         0.0179, 0.0179], grad_fn=<SliceBackward>)
# Y_star_torch (truth) tensor([0., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0.])
# trained params [Parameter containing:
# tensor([[ 3.3846e-02,  1.5053e-01,  5.3871e-02, -2.7892e-01,  4.4570e-02,
#           2.3433e-01,  5.0296e-01, -2.8384e-01,  2.9082e-01,  4.3104e-02,
#          -5.5689e-02],
#         [ 1.8341e-01,  4.4496e-02,  6.9729e-02, -1.8623e-01,  5.0794e-01,
#          -9.3826e-02, -6.1965e-01, -2.6323e-01,  1.2437e-01,  4.4976e-02,
#          -6.0869e-02],
#         [-5.7301e-04, -1.2836e-01,  1.8387e-01,  1.3956e-02,  2.2719e-01,
#           7.8836e-02, -2.7487e-01,  2.9157e-01, -2.7401e-03, -2.8299e-01,
#          -4.6545e-03]], requires_grad=True), Parameter containing:
# tensor([[[ 0.0102, -0.1001,  0.2383],
#          [ 0.0295, -0.1764, -0.1163],
#          [ 0.1335,  0.1075, -0.2252]],

#         [[ 0.0554,  0.2765,  0.5179],
#          [ 0.5025, -0.2487, -0.6356],
#          [-0.0068,  0.2304, -0.1850]]], requires_grad=True), Parameter containing:
# tensor([[0.1969, 0.8541]], requires_grad=True)]

After training, produce some predictions 
(here we are using the training sequence again, something you should of course never do...)

In [38]:
# Which words were selected?


Y_pred = model(X_torch)

words_array = np.array(words, dtype=object)
print("selected words \n",words_array[Y_pred[0][0]>0.5])

print('Y_pred',Y_pred)
print('Y_star', Y_star)

criterion = nn.MSELoss()
print('loss', criterion(Y_pred, Y_star_torch))

mse=torch.mean( (Y_pred.detach()-Y_star)**2)  # detach takes the tensor out of the network
print("MSE for ground truth y_star", mse)

selected words 
 ['New' 'Hampshire ']
Y_pred tensor([[[0.0254, 0.0254, 0.0000, 0.8331, 0.8331, 0.0863, 0.0863, 0.0796,
          0.0796, 0.0000, 0.0589]]], grad_fn=<ViewBackward>)
Y_star [0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
loss tensor(0.0080, grad_fn=<MseLossBackward>)
MSE for ground truth y_star tensor(0.0080, dtype=torch.float64)


In [39]:
# Reference Results

# selected words 
#  ['New' 'Hampshire ']
# Y_pred tensor([0.0415, 0.0415, 0.0773, 0.7233, 0.7233, 0.1414, 0.1414, 0.0286, 0.0780,
#         0.0780, 0.0000], grad_fn=<SliceBackward>)
# Y_star [0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
# loss tensor(0.0196, grad_fn=<MseLossBackward>)
# MSE for ground truth y_star tensor(0.0196, dtype=torch.float64)


# Part 4: Replace One-Hot-Encoding with Pre-trained Word embeddings


We will follow a standard approach of first embedding all words with a pre-trained embedding, such as word2vec. Then using this as input for the network.

Below some template code to lookup words in word2vec which yield a 300dim vector. To start with a smaller example, we downsample 300 into 10, but subdividing the 50-dim word vector into batches of 5 entries, and summing them.

More info here word2vec with gensim:
- [Gensim - Topic modelling for humans: word2vec](https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html)
- [Word Embedding: Word2Vec with Genism, NLTK, and t-SNE Visualization](https://medium.com/analytics-vidhya/word-embedding-word2vec-with-genism-nltk-and-t-sne-visualization-43eae8ab3e2e).


We will be downloading a 50-dim GloVE embedding (because it does not take up much space).
I recommend to save the model file locally once, then load it with 

You can also download alternatives models, call `gensim.downloader.api.info()` for a list.

Below some code stub to access the word embeddings. All you have to do is to form a substitute for $X$


In [40]:
# Install gensim, to use word2vec word embeddings

# Install gensim (for pre-trained word embeddings)
#!conda install --yes --prefix {sys.prefix} gensim 


# test:

#import gensim

# ONLY if you get an error after `import gensim`:
#
# update your smart_open liberary
#!conda install --yes --prefix {sys.prefix} smart_open
# restart your notebook
# see if `import gensim` works now


In [41]:
import gensim
import gensim.downloader

wv = gensim.downloader.load("glove-wiki-gigaword-50")

# # lookup the word vector for a word "king"
# wv['king']


In [42]:
# downsampled embedding and zero vector for unknown words
# note the following code assums the the word embedding dimensions are dividible by 5

import einops
import numpy as np



def glove_embed(word:str, target_dim)->np.array:
    '''
        Looks up word in embedding (downsampled to five dimensions), pads with beginning of embedding.
        Returns zero vector for unknown words.
    '''
    # these parameters work for 50-dim glove embeddings (adjust for other embeddings)
    sampled_dim = 5
    sample_batches = 10
    
    empty_vec=np.zeros(target_dim)
    if word in wv:
        w2v = wv[word] # lookup 50 dim vector
        a=einops.reduce(w2v,'(d seg)-> d', "sum", seg=sample_batches)  # downsample
        b=w2v[0:target_dim-sampled_dim]
        return np.hstack([a,b])
    else:
        return empty_vec
    


    
print('embedding of `king`', glove_embed('king',7))
print('embedding of `a`', glove_embed('a',7))
print('embedding of `not a valid word`', glove_embed('not a valid word',7))


embedding of `king` [ 0.49562895  1.1722751  -4.3306704   2.07816    -2.7060602   0.50451
  0.68607   ]
embedding of `a` [ 1.58374    -1.50257    -2.3603408   3.70191     0.25554505  0.21705
  0.46515   ]
embedding of `not a valid word` [0. 0. 0. 0. 0. 0. 0.]


In [43]:
embed_dim=10

class MyModel(nn.Module):
    
    def __init__(self):
        
        super(MyModel, self).__init__()
        
        self.convolve = nn.Conv1d(
            in_channels=embed_dim,
            out_channels=conv_channels,
            kernel_size=conv_dim,
            padding=1,
            padding_mode='zeros',
            bias=False
        )
        self.linear = nn.Linear(
            in_features=conv_channels,
            out_features=1,
            bias=False
        )
        self.detect = nn.ReLU()
        self.pool = nn.MaxPool1d(
            kernel_size=pooling_window+1,
            padding=1,
            stride=1
        )

    def forward(self, x):
        
        out = self.convolve(x)
        out = einops.rearrange(out,'i j k -> i k j')
        out = self.linear(out)
        out = einops.rearrange(out,'i j k -> i k j')
        out = self.detect(out)
        out = self.pool(out)
        out = einops.rearrange(einops.rearrange(out, 'i j k -> k j i')[:-1], 'i j k -> k j i')

        return out

model = MyModel() 

In [44]:
# Your code

X2_torch = torch.einsum(
    'ijk->ikj',
    torch.unsqueeze(
        torch.tensor(
            np.array(
                [
                    glove_embed(word, embed_dim)
                    for word in words
                ]
            ),
            dtype=torch.float
        ),
        0
    )
)
X2_torch 

def train(xdata, ydata, model):
    '''Train the neural model with the given training data'''

    #Construct the loss function
    criterion = torch.nn.MSELoss()
    # Construct the optimizer (Stochastic Gradient Descent in this case)
    optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)  # lr is learning rate

    # Gradient Descent
    for epoch in range(500):
        # Forward pass: Compute predicted y by passing x to the model
        Y_pred = model(xdata)

        # Compute and print loss
        loss = criterion(Y_pred, ydata)
        if epoch % 100 == 0: print('epoch: ', epoch,' loss: ', loss.item()) 

        # Zero gradients, perform a backward pass, and update the weights.
        optimizer.zero_grad()

        # perform a backward pass (backpropagation)
        loss.backward()

        # Update the parameters
        optimizer.step()
        
    return model
model = train(X2_torch, Y_star_torch, model)
Y2 = model(X2_torch)



epoch:  0  loss:  0.37963974475860596
epoch:  100  loss:  0.015953142195940018
epoch:  200  loss:  0.0031155706383287907
epoch:  300  loss:  0.0005920543917454779
epoch:  400  loss:  0.00010298789129592478


In [45]:
words_array = np.array(words, dtype=object)

print("selected words \n",words_array[Y2[0][0]>0.5]) 


selected words 
 ['New' 'Hampshire ']


#  Bonus

How are results changing when you use a the loss function [nn.CrossEntropyLoss](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html)?

Currently the model is trained with a single sentence.
How to extend the model to train from multiple sentences?

If you feel inclined I recommend to implement POS tagging example from the previous programming assignment in this model.

In [46]:
class MyModel(nn.Module):
    
    def __init__(self):
        
        super(MyModel, self).__init__()
        
        self.convolve = nn.Conv1d(
            in_channels=embed_dim,
            out_channels=conv_channels,
            kernel_size=conv_dim,
            padding=1,
            padding_mode='zeros',
            bias=False
        )
        self.pool = nn.MaxPool1d(
            kernel_size=pooling_window+1,
            padding=1,
            stride=1
        )
        self.detect = nn.Softmax(dim=2)

    def forward(self, x):
        
        out = self.convolve(x)
        out = self.pool(out)
        out = einops.rearrange(einops.rearrange(out, 'i j k -> k j i')[:-1], 'k j i -> i k j')
        out = self.detect(out)
        out = torch.squeeze(out, 0)

        return out

model = MyModel() 

X2_torch = torch.einsum(
    'ijk->ikj',
    torch.unsqueeze(
        torch.tensor(
            np.array(
                [
                    glove_embed(word, embed_dim)
                    for word in words
                ]
            ),
            dtype=torch.float
        ),
        0
    )
)


def train(xdata, ydata, model):
    '''Train the neural model with the given training data'''

    #Construct the loss function
    criterion = torch.nn.CrossEntropyLoss()
    # Construct the optimizer (Stochastic Gradient Descent in this case)
    optimizer = torch.optim.SGD(model.parameters(), lr = 0.01)  # lr is learning rate

    # Gradient Descent
    for epoch in range(500):
        # Forward pass: Compute predicted y by passing x to the model
        Y_pred = model(xdata)
        
        
        # Compute and print loss
        loss = criterion(Y_pred, torch.squeeze(ydata).long())
        if epoch % 100 == 0: print('epoch: ', epoch,' loss: ', loss.item()) 

        # Zero gradients, perform a backward pass, and update the weights.
        optimizer.zero_grad()

        # perform a backward pass (backpropagation)
        loss.backward()

        # Update the parameters
        optimizer.step()
        
    return model

model = train(X2_torch, Y_star_torch, model)
Y2 = model(X2_torch)
Y2.T[1], Y2.T[0]

epoch:  0  loss:  0.9274263978004456
epoch:  100  loss:  0.4716719686985016
epoch:  200  loss:  0.43585205078125
epoch:  300  loss:  0.4091963469982147
epoch:  400  loss:  0.3729363679885864


(tensor([0.0400, 0.0309, 0.0425, 0.8390, 0.7686, 0.0643, 0.0095, 0.0147, 0.0242,
         0.0334, 0.1353], grad_fn=<SelectBackward>),
 tensor([0.9600, 0.9691, 0.9575, 0.1610, 0.2314, 0.9357, 0.9905, 0.9853, 0.9758,
         0.9666, 0.8647], grad_fn=<SelectBackward>))

In [47]:
words_array = np.array(words, dtype=object)

print("selected words \n",words_array[Y2.T[1]>0.5]) 

selected words 
 ['New' 'Hampshire ']


As can be seen above I created a new model to use cross-entropy. You cannot directly use the cross entropy loss on our prvious models, as the output only value for a single class but we have two. I started by simply adding a linear model that expanded out output to 2 demensions and then put that through a softmax activation to get probabilities for both classes. This did not work well as we are over contraining our problem, and the model gets stuck. I realized that mapping our results for the convolution to a single output was getting ride of important information in our new context, so i removed this layer. Next I sumrmized that the ReLU was redundent as we are using a softmax at the end as a detector, so i removed that too. Now our Pooling layer feeds directly into softmax which gives as a probability for each word for each class. This is now the output we desire and using the cross entropy we can detect the words correctly. The value for the loss is much worse than our other models, but I think that that is due to the cross entropy loss being less forgiving that MSE. 