# Homework 1

- Deadline: 4:59pm, Monday, July 9th, 2018
- Name: [Write down your name here]

In this project, you will work on a task of sentiment classification. You will work on large movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/). The task is to classify movie reviews into two categories, POSITIVE or NEGATIVE. 

You are provided with a training set (TRAIN), a development set (DEV), and a test set (TEST). Your classifier is trained on TRAIN, evaluated and tuned on DEV, and tested on TEST. 

Your will build two classifiers in this homework, a naive bayes classifier and a logistic regression classifier with bag of words features. You have learned these two models in the lecture. We will give some additional introduction in this assignment to help you implement them. 

In [1]:
import torch
import torch.utils.data as tud
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import Counter, defaultdict
import operator
import os, math
import numpy as np
import random
import copy
# from nltk import word_tokenize

def word_tokenize(s):
    return s.split()

# set the random seeds so the experiments can be replicated exactly
random.seed(53113)
np.random.seed(53113)
torch.manual_seed(53113)
if torch.cuda.is_available():
    torch.cuda.manual_seed(53113)

# Global class labels.
POS_LABEL = 'pos'
NEG_LABEL = 'neg'     

In [2]:
def load_data(data_file):
    data = []
    with open(data_file,'r') as fin:
        for line in fin:
            label, content = line.split(",", 1)
            data.append((content.lower(), label))
    return data
data_dir = "large_movie_review_dataset"
train_data = load_data(os.path.join(data_dir, "train.txt"))
dev_data = load_data(os.path.join(data_dir, "dev.txt"))

In [3]:
print("number of TRAIN data", len(train_data))
print("number of DEV data", len(dev_data))

number of TRAIN data 25000
number of DEV data 5000


We define a generic model class as below. The model has 2 functions, train and classify. 

In [4]:
VOCAB_SIZE = 5000
class Model:
    def __init__(self, data):
        # Vocabulary is a set that stores every word seen in the training data
        self.vocab = Counter([word for content, label in data for word in word_tokenize(content)]).most_common(VOCAB_SIZE-1) 
        self.word_to_idx = {k[0]: v+1 for v, k in enumerate(self.vocab)} # word to index mapping
        self.word_to_idx["UNK"] = 0 # all the unknown words will be mapped to index 0
        self.idx_to_word = {v:k for k, v in self.word_to_idx.items()}
        self.label_to_idx = {POS_LABEL: 0, NEG_LABEL: 1}
        self.idx_to_label = [POS_LABEL, NEG_LABEL]
        self.vocab = set(self.word_to_idx.keys())
        
    def train_model(self, data):
        '''
        Train the model with the provided training data
        '''
        raise NotImplementedError
        
    def classify(self, data):
        '''
        classify the documents with the model
        '''
        raise NotImplementedError

# Logistic Regression with Bag of Words

You will implement logistic regression with bag of words features in the following. 

In [28]:
class TextClassificationDataset(tud.Dataset):
    '''
    PyTorch provide a common dataset interface. 
    https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
    The dataset encodes documents into indices. 
    With the PyTorch dataloader, you can easily get batched data for training and evaluation. 
    '''
    def __init__(self, word_to_idx, data):
        
        self.data = data
        self.word_to_idx = word_to_idx
        self.label_to_idx = {POS_LABEL: 0, NEG_LABEL: 1}
        self.vocab_size = VOCAB_SIZE
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = np.zeros(self.vocab_size)
        
        item = torch.from_numpy(item)
        if len(self.data[idx]) == 2: # in training or evaluation, we have both the document and label
            for word in word_tokenize(self.data[idx][0]):
                item[self.word_to_idx.get(word, 0)] += 1
            label = self.label_to_idx[self.data[idx][1]]
            return item, label
        else: # in testing, we only have the document without label
            for word in word_tokenize(self.data[idx]):
                item[self.word_to_idx.get(word, 0)] += 1
            return item

In [30]:
best_model = None
class BoWLRClassifier(nn.Module, Model):
    '''
    Define your logistic regression model with bag of words features.
    '''
    def __init__(self, data):
        nn.Module.__init__(self)
        Model.__init__(self, data)
        
        '''
        In this model initialization phase, you will do the following: 
        1. Define a linear layer to transform bag of words features into 2 classes. 
        2. Define the loss function, you will use cross entropy loss
            https://pytorch.org/docs/stable/nn.html?highlight=crossen#torch.nn.CrossEntropyLoss
        3. Define an optimizer for the model, you may choose to use SGD, Adam or other optimizers you know
            https://pytorch.org/docs/stable/optim.html?highlight=sgd#torch.optim.SGD
        '''
        # TODO
        # pass
        self.linear = nn.Linear(VOCAB_SIZE, 2)
        self.loss_fn = nn.CrossEntropyLoss()
        self.optimizer = optim.Adam(self.parameters(), lr=0.001, weight_decay=0.01)
        
    def forward(self, bow):
        '''
        Run the model. You may only need to run the linear layer defined in the init function. 
        '''
        return self.linear(bow)
    
    def train_epoch(self, train_data):
        '''
        Train the model for one epoch with the training data
        When training a model, you will repeat the following procedures:
        1. get one batch of features and labels
        2. make a forward pass with the features to get predictions
        3. calculate the loss with the predictions and target labels
        4. run a backward pass from the loss function to get the gradients
        5. apply the optimizer step to update the model paramters
        '''
        dataset = TextClassificationDataset(self.word_to_idx, train_data)
        dataloader = tud.DataLoader(dataset, batch_size=8, shuffle=True)
        self.train()
        for i, (X, y) in enumerate(dataloader):
            X = X.float()
            y = y.long()
            if torch.cuda.is_available():
                X = X.cuda()
                y = y.cuda()
            self.optimizer.zero_grad()
            preds = self.forward(X)
            loss = self.loss_fn(preds, y)
            loss.backward()
            if i % 500 == 0:
                print("loss: {}".format(loss.item()))
            self.optimizer.step()
    
    def train_model(self, train_data, dev_data):
        """
        This function processes the entire training set for multiple epochs.
        After each training epoch, you will evaluate your model on the DEV set. 
        The best performing model on the DEV set shall be saved to best_model
        """  
        dev_accs = [0.]
        for epoch in range(2):
            self.train_epoch(train_data)
            dev_acc = self.evaluate(dev_data)
            print("dev acc: {}".format(dev_acc))
            if dev_acc > max(dev_accs):
                best_model = copy.deepcopy(self)
            dev_accs.append(dev_acc)

    def classify(self, docs):
        '''
        This function classifies documents into their categories. 
        docs are documents only, without labels.
        '''
        dataset = TextClassificationDataset(self.word_to_idx, docs)
        dataloader = tud.DataLoader(dataset, batch_size=1, shuffle=False)
        results = []
        with torch.no_grad():
            for i, X in enumerate(dataloader):
                X = X.float()
                if torch.cuda.is_available():
                    X = X.cuda()
                preds = self.forward(X)
                results.append(preds.max(1)[1].cpu().numpy().reshape(-1))
        results = np.concatenate(results)
        results = [self.idx_to_label[p] for p in results]
        return results
                
    def evaluate(self, data):
        '''
        This function evaluate the data with the current model. 
        data contains documents and labels. 
        It calls function "classify" to make predictions, 
        and compare with the correct labels to return the model accuracy on "data". 
        '''
        self.eval()
        preds = self.classify([d[0] for d in data])
        targets = [d[1] for d in data]
        correct = 0.
        total = 0.
        for p, t in zip(preds, targets):
            if p == t: 
                correct += 1
            total += 1
        return correct/total
        

In [31]:
lr_model = BoWLRClassifier(train_data)
if torch.cuda.is_available():
    lr_model = lr_model.cuda()
lr_model.train_model(train_data, dev_data)

loss: 0.7611655592918396
loss: 0.45513585209846497
loss: 0.4699181616306305
loss: 0.6984012126922607
loss: 0.169261172413826
loss: 0.38961079716682434
loss: 0.3471197187900543
dev acc: 0.859
loss: 0.25644347071647644
loss: 0.5411393046379089
loss: 0.4370480477809906
loss: 0.23390710353851318
loss: 0.3318834900856018
loss: 0.22103242576122284
loss: 0.6189835071563721
dev acc: 0.8514


Now spend some time to tune your models. At least try the following: 

- try another optimizer
- change the learning rate
- change the number of epochs to train

Report your results and analysis in the writeup. 

Finally, make predictions on the TEST set, and submit your predictions to out [Kaggle competition page](https://www.kaggle.com/c/mpcs-53113-hw1-logistic-regression)

In [32]:
preds = lr_model.classify(test_data)
write_to_file(preds, "lr_test_preds.txt")

Identify the top 10 features with the maximum weights for POSITIVE category. Explain your findings. 

In [33]:
weights = lr_model.linear.weight.data.cpu().numpy()[0]
pos_indices = weights.argsort()[-10:][::-1]
[lr_model.idx_to_word[i] for i in pos_indices]

['excellent',
 'great',
 'wonderful',
 'favorite',
 'amazing',
 'perfect',
 'definitely',
 'best',
 'loved',
 'highly']

Identify the top 10 features with the maximum negative weights for POSITIVE category. Explain your findings. 

In [34]:
pos_indices = weights.argsort()[:10]
[lr_model.idx_to_word[i] for i in pos_indices]

['worst',
 'waste',
 'awful',
 'bad',
 'boring',
 'poorly',
 'poor',
 'nothing',
 'bad.',
 'worse']

In [35]:
pos_indices = lr_model.linear.weight.data.cpu().numpy()[0].argsort()[:10][::-1]
idx_to_word = {v:k for k, v in lr_model.word_to_idx.items()}
[idx_to_word[i] for i in pos_indices]

['worse',
 'bad.',
 'nothing',
 'poor',
 'poorly',
 'boring',
 'bad',
 'awful',
 'waste',
 'worst']

Identify the top 10 features with the maximum positive weights for NEGATIVE category. Explain your findings. 

In [64]:
weights = lr_model.linear.weight.data.cpu().numpy()[1]
pos_indices = weights.argsort()[-10:][::-1]
[lr_model.idx_to_word[i] for i in pos_indices]

['waste',
 'worst',
 'awful.',
 'poorly',
 'terrible.',
 'forgettable',
 'fails',
 'horrible.',
 'awful',
 'disappointing']

Identify the top 10 features with the maximum negative weights for NEGATIVE category. Explain your findings. 

In [68]:
pos_indices = lr_model.linear.weight.data.cpu().numpy()[1].argsort()[:10][::-1]
idx_to_word = {v:k for k, v in lr_model.word_to_idx.items()}
[idx_to_word[i] for i in pos_indices]

['superbly',
 'excellent',
 'favorite',
 'excellent.',
 'perfect.',
 'refreshing',
 'perfect,',
 'amazing.',
 '8/10',
 'wonderfully']