# HW 1 Classification

In this homework you will be building several varieties of text classifiers.

## Goal

We ask that you construct the following models in PyTorch:

1. A naive Bayes unigram classifer (follow Wang and Manning http://www.aclweb.org/anthology/P/P12/P12-2.pdf#page=118: you should only implement Naive Bayes, not the combined classifer with SVM).
2. A logistic regression model over word types (you can implement this as $y = \sigma(\sum_i W x_i + b)$) 
3. A continuous bag-of-word neural network with embeddings (similar to CBOW in Mikolov et al https://arxiv.org/pdf/1301.3781.pdf).
4. A simple convolutional neural network (any variant of CNN as described in Kim http://aclweb.org/anthology/D/D14/D14-1181.pdf).
5. Your own extensions to these models...

Consult the papers provided for hyperparameters. 


## Setup

This notebook provides a working definition of the setup of the problem itself. You may construct your models inline or use an external setup (preferred) to build your system.

In [1]:
!pip install torchtext



In [39]:
# Text text processing library and methods for pretrained word embeddings
import torchtext
from torchtext.vocab import Vectors, GloVe

import numpy as np
import torch
from torch.autograd import Variable
import torch.nn as nn

The dataset we will use of this problem is known as the Stanford Sentiment Treebank (https://nlp.stanford.edu/~socherr/EMNLP2013_RNTN.pdf). It is a variant of a standard sentiment classification task. For simplicity, we will use the most basic form. Classifying a sentence as positive or negative in sentiment. 

To start, `torchtext` requires that we define a mapping from the raw text data to featurized indices. These fields make it easy to map back and forth between readable data and math, which helps for debugging.

In [2]:
# Our input $x$
TEXT = torchtext.data.Field()

# Our labels $y$
LABEL = torchtext.data.Field(sequential=False)

Next we input our data. Here we will use the standard SST train split, and tell it the fields.

In [3]:
train, val, test = torchtext.datasets.SST.splits(
    TEXT, LABEL,
    filter_pred=lambda ex: ex.label != 'neutral')

Let's look at this data. It's still in its original form, we can see that each example consists of a label and the original words.

In [4]:
type(TEXT), type(train)

(torchtext.data.field.Field, torchtext.datasets.sst.SST)

In [5]:
print('len(train)', len(train))
print('vars(train[0])', vars(train[0]))

len(train) 6920
vars(train[0]) {'label': 'positive', 'text': ['The', 'Rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'Century', "'s", 'new', '``', 'Conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'Arnold', 'Schwarzenegger', ',', 'Jean-Claud', 'Van', 'Damme', 'or', 'Steven', 'Segal', '.']}


In order to map this data to features, we need to assign an index to each word an label. The function build vocab allows us to do this and provides useful options that we will need in future assignments.

In [6]:
TEXT.build_vocab(train)
LABEL.build_vocab(train)
print('len(TEXT.vocab)', len(TEXT.vocab))
print('len(LABEL.vocab)', len(LABEL.vocab))

len(TEXT.vocab) 16286
len(LABEL.vocab) 3


Finally we are ready to create batches of our training data that can be used for training and validating the model. This function produces 3 iterators that will let us go through the train, val and test data. 

In [7]:
train_iter, val_iter, test_iter = torchtext.data.BucketIterator.splits(
    (train, val, test), batch_size=10, device=-1, repeat=False) # added repeat=False based on piazza comment

Let's look at a single batch from one of these iterators. The library automatically converts the underlying words into indices. It then produces tensors for batches of x and y. In this case it will consist of the number of words of the longest sentence (with padding) followed by the number of batches. We can use the vocabulary dictionary to convert back from these indices to words.

In [8]:
batch = next(iter(train_iter))
print("Size of text batch [max sent length, batch size]", batch.text.size())
print("Second in batch", batch.text[:, 0])
print("Converted back to string: ", " ".join([TEXT.vocab.itos[i] for i in batch.text[:, 0].data]))
print(batch.label)

Size of text batch [max sent length, batch size] torch.Size([29, 10])
Second in batch Variable containing:
    28
  6294
   371
     3
   300
   918
     5
   167
 11872
     3
    17
     7
   398
   136
     3
 13622
    69
     3
  2210
   217
     5
     7
  5057
  7308
     6
  1005
  1730
   609
     2
[torch.LongTensor of size 29]

Converted back to string:  ... flat-out amusing , sometimes endearing and often fabulous , with a solid cast , noteworthy characters , delicious dialogue and a wide supply of effective sight gags .
Variable containing:
 1
 1
 2
 1
 1
 1
 1
 1
 1
 2
[torch.LongTensor of size 10]



In [10]:
i = 0
for batch in val_iter:
    if i > 5:
        break
    i += 1
    print(batch.text.size())
    print(' '.join([TEXT.vocab.itos[i] for i in batch.text[:, 0].data]))
    print(' '.join([TEXT.vocab.itos[i] for i in batch.text[:, 9].data]))

torch.Size([5, 10])
It 's fun <unk> .
Cool ? <pad> <pad> <pad>
torch.Size([5, 10])
<unk> inept and ridiculous .
One from the heart .
torch.Size([6, 10])
Good old-fashioned <unk> is back !
<unk> a real downer ? <pad>
torch.Size([6, 10])
One long string of cliches .
It 's a beautiful madness .
torch.Size([7, 10])
A tender , heartfelt family drama .
At once half-baked and <unk> . <pad>
torch.Size([7, 10])
Almost gags on its own gore .
Good film , but very glum .


Similarly it produces a vector for each of the labels in the batch. 

In [11]:
print("Size of label batch [batch size]", batch.label.size())
print("Second in batch", batch.label[0])
print("Converted back to string: ", LABEL.vocab.itos[batch.label.data[0]])

Size of label batch [batch size] torch.Size([10])
Second in batch Variable containing:
 1
[torch.LongTensor of size 1]

Converted back to string:  positive


Finally the Vocab object can be used to map pretrained word vectors to the indices in the vocabulary. This will be very useful for part 3 and 4 of the problem.  

In [68]:
# Build the vocabulary with word embeddings
url = 'https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.simple.vec'
TEXT.vocab.load_vectors(vectors=Vectors('wiki.simple.vec', url=url))

print("Word embeddings size ", TEXT.vocab.vectors.size())
print("Word embedding of 'follows', first 10 dim ", TEXT.vocab.vectors[TEXT.vocab.stoi['follows']][:10])

.vector_cache\wiki.simple.vec: 293MB [01:55, 2.54MB/s]                                                                 
100%|████████████████████████████████████████████████████████████████████████| 111052/111052 [00:25<00:00, 4352.78it/s]


Word embeddings size  torch.Size([16286, 300])
Word embedding of 'follows', first 10 dim  
 0.3925
-0.4770
 0.1754
-0.0845
 0.1396
 0.3722
-0.0878
-0.2398
 0.0367
 0.2800
[torch.FloatTensor of size 10]



## Assignment

Now it is your turn to build the models described at the top of the assignment. 

Using the data given by this iterator, you should construct 4 different torch models that take in batch.text and produce a distribution over labels. 

When a model is trained, use the following test function to produce predictions, and then upload to the kaggle competition:  https://www.kaggle.com/c/harvard-cs281-hw1

In [71]:
def test(model):
    "All models should be able to be run with following command."
    upload = []
    # Update: for kaggle the bucket iterator needs to have batch_size 10
    test_iter = torchtext.data.BucketIterator(test, train=False, batch_size=10)
    for batch in test_iter:
        # Your prediction data here (don't cheat!)
        probs = model(batch.text)
        _, argmax = probs.max(1)
        upload += list(argmax.data)

    with open("predictions.txt", "w") as f:
        for u in upload:
            f.write(str(u) + "\n")

In addition, you should put up a (short) write-up following the template provided in the repository:  https://github.com/harvard-ml-courses/cs287-s18/blob/master/template/

# 1

In [21]:
alpha = 0 # smoothing
p = np.zeros(len(TEXT.vocab)) + alpha
q = np.zeros(len(TEXT.vocab)) + alpha
ngood = 0
nbad = 0

In [22]:
for batch in train_iter:
	for i in range(batch.text.size()[1]):
		x = batch.text.data.numpy()[:,i]
		y = batch.label.data.numpy()[i]
		sparse_x = np.zeros(len(TEXT.vocab))
		for word in x:
			sparse_x[word] = 1 # += 1
		if y == 1:
			p += sparse_x
			ngood += 1
		elif y == 2:
			q += sparse_x
			nbad += 1
		else:
			pass

In [27]:
r = np.log((p/np.linalg.norm(p))/(q/np.linalg.norm(q)))
b = np.log(ngood/nbad)

  if __name__ == '__main__':
  if __name__ == '__main__':
  if __name__ == '__main__':


In [36]:
# model needs to take in a batch.text and return a bs*2 tensor of probs
def predict(text):
	ys = torch.zeros(text.size()[1],2)
	for i in range(text.size()[1]):
		x = text.data.numpy()[:,i]
		sparse_x = np.zeros(len(TEXT.vocab))
		for word in x:
			sparse_x[word] = 1
		y = np.dot(r,sparse_x) + b
		if y > 0:
			ys[i,1] = 1
		else:
			ys[i,0] = 1
	return ys

In [121]:
# Training

mat = np.zeros([3,len(TEXT.vocab)])

cntr = 0
    
for batch in train_iter:
    for i in range(batch.text.size()[1]):
        x = batch.text[:,i]
        y = batch.label[i]
            
        for word in x:
            mat[y.data.numpy()[0],word.data.numpy()[0]] += 1
            
    cntr += 1
    
    if not cntr % 100:
        print(cntr)

100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
3200
3300
3400
3500
3600
3700
3800
3900
4000
4100
4200
4300
4400
4500
4600
4700
4800
4900
5000
5100
5200
5300
5400
5500
5600
5700
5800
5900
6000
6100
6200
6300
6400
6500
6600
6700
6800
6900
7000
7100
7200
7300
7400
7500
7600
7700
7800
7900
8000
8100
8200
8300
8400
8500
8600
8700
8800
8900
9000
9100
9200
9300
9400
9500
9600
9700
9800
9900
10000
10100
10200
10300
10400
10500
10600
10700
10800
10900
11000
11100
11200
11300
11400
11500
11600
11700
11800
11900
12000
12100


KeyboardInterrupt: 

In [None]:
def NBunigram(batchtext):
    probs = []
    
    for i in range(batchtext.size()[1]):
        x = batchtext[:,i]
        y = batch.label[i]
        
        prob0, prob1, prob2 = 1
        
        for word in x:
            prob0 *= mat[0,word.data.numpy()[0]]
            prob1 *= mat[1,word.data.numpy()[0]]
            prob2 *= mat[2,word.data.numpy()[0]]
            
        probs.append([prob0,prob1,prob2])
        
    return np.array(probs)

# 2

In [48]:
learning_rate = 0.001
bs = 10
num_epochs = 12
input_size = len(TEXT.vocab)

In [40]:
class LogisticRegression(nn.Module):
    def __init__(self, input_size):
        super(LogisticRegression, self).__init__()
        self.linear = nn.Linear(input_size, 2)
    
    def forward(self, x):
        out = self.linear(x)
        return out

model = LogisticRegression(input_size)
criterion = nn.CrossEntropyLoss()  
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

In [49]:
for epoch in range(num_epochs):
	ctr = 0
	for batch in train_iter:
		# TODO: is there a better way to sparsify?
		sentences = Variable(torch.zeros(bs,input_size))
		for i in range(batch.text.size()[1]):
			x = batch.text.data.numpy()[:,i]
			for word in x:
				sentences[i,word] = 1 # += 1
		labels = (batch.label==1).type(torch.LongTensor)
		# change labels from "1" and "2"
		optimizer.zero_grad()
		outputs = model(sentences)
		loss = criterion(outputs, labels)
		loss.backward()
		optimizer.step()
		ctr += 1
		if ctr % 100 == 0:
			print ('Epoch [%d/%d], Iter [%d/%d] Loss: %.4f' 
				%(epoch+1, num_epochs, ctr, len(train)//bs, loss.data[0]))

Epoch [1/12], Iter [100/692] Loss: 0.6881
Epoch [1/12], Iter [200/692] Loss: 0.6956
Epoch [1/12], Iter [300/692] Loss: 0.7239
Epoch [1/12], Iter [400/692] Loss: 0.6681
Epoch [1/12], Iter [500/692] Loss: 0.6972
Epoch [1/12], Iter [600/692] Loss: 0.7245
Epoch [2/12], Iter [700/692] Loss: 0.6843
Epoch [2/12], Iter [800/692] Loss: 0.6959
Epoch [2/12], Iter [900/692] Loss: 0.6603
Epoch [2/12], Iter [1000/692] Loss: 0.7038
Epoch [2/12], Iter [1100/692] Loss: 0.6991
Epoch [2/12], Iter [1200/692] Loss: 0.6993
Epoch [2/12], Iter [1300/692] Loss: 0.6934
Epoch [3/12], Iter [1400/692] Loss: 0.7083
Epoch [3/12], Iter [1500/692] Loss: 0.7116
Epoch [3/12], Iter [1600/692] Loss: 0.6904
Epoch [3/12], Iter [1700/692] Loss: 0.6993
Epoch [3/12], Iter [1800/692] Loss: 0.6716
Epoch [3/12], Iter [1900/692] Loss: 0.6661
Epoch [3/12], Iter [2000/692] Loss: 0.6451
Epoch [4/12], Iter [2100/692] Loss: 0.6495
Epoch [4/12], Iter [2200/692] Loss: 0.6913
Epoch [4/12], Iter [2300/692] Loss: 0.6946
Epoch [4/12], Iter [

In [64]:
correct = 0
total = 0
for batch in val_iter:
	bsz = batch.text.size()[1] # batch size might change
	sentences = Variable(torch.zeros(bsz,input_size))
	for i in range(bsz):
		x = batch.text.data.numpy()[:,i]
		for word in x:
			sentences[i,word] = 1 # += 1
	labels = (batch.label==1).type(torch.LongTensor).data
	# change labels from 1,2 to 1,0
	outputs = model(sentences)
	_, predicted = torch.max(outputs.data, 1)
	total += labels.size(0)
	correct += (predicted == labels).sum()

In [None]:
print('test accuracy', correct/total)

# 3

# 4

In [69]:
filter_window = 3
n_featmaps = 100
bs = 10
dropout_rate = 0.5
num_epochs = 1
learning_rate = 0.001

In [None]:
class CNN(nn.Module):
	def __init__(self):
		super(CNN, self).__init__()
		self.embeddings = nn.Embedding(TEXT.vocab.vectors.size())
		self.embeddings.weight = TEXT.vocab.vectors
		self.conv = nn.Conv2d(1,n_featmaps,kernel_size=(filter_window,300))
		self.maxpool = nn.AdaptiveMaxPool1d(1)
		self.linear = nn.Linear(n_featmaps, 2)
		self.dropout = nn.Dropout(dropout_rate)

	def forward(self, inputs):
		embeds = self.embeddings(inputs) # .view((1, -1))
		out = F.tanh(self.conv(embeds))
		out = out.view(bs,n_featmaps,-1)
		out = self.maxpool(out)
		out = out.view(bs,-1)
		out = self.linear(out)
		out = self.dropout(out,dim=1)

# 5