# Homework 1

- Due: 11:59pm, April 26, 2019

In this project, you will work on sentiment classification with a logistic regression classifier in Python 3.  Using a large movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/), you will classify movie reviews into two categories, POSITIVE or NEGATIVE. 

You are provided with a training set (TRAIN), a development set (DEV), and a test set (TEST). Your classifier will be trained on TRAIN, evaluated and tuned on DEV, and tested on TEST. 

Using the PyTorch library, you will build the logistic regression classifier with bag of words features.  Some code has been provide  to help get you started.

You need to fill in the missing code, run all cells, and submit this notebook along with a PDF with a writeup on your model tuning results and  solutions to the other problems in Homework 1.

Credits: This assignment and notebook was originally created by Zewei Chu (zeweichu@uchicago.edu)

In [10]:
import torch
import torch.utils.data as tud
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from collections import Counter, defaultdict
import operator
import os, math
import numpy as np
import random
import copy

# Feel free to define your own word_tokenizer instead of this naive 
# implementation. You may also use word_tokenize from nltk library 
# (from nltk import word_tokenize), which works better but slower. 
def word_tokenize(s):
    return s.split()

# set the random seeds so the experiments can be replicated exactly
seed = 30255
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)

# Global class labels.
POS_LABEL = 'pos'
NEG_LABEL = 'neg'     

In [2]:
def load_data(data_file):
    data = []
    with open(data_file,'r') as fin:
        for line in fin:
            label, content = line.split(",", 1)
            data.append((content.lower(), label))
    return data
data_dir = "large_movie_review_dataset"
train_data = load_data(os.path.join(data_dir, "train.txt"))
dev_data = load_data(os.path.join(data_dir, "dev.txt"))

In [3]:
print("number of TRAIN data", len(train_data))
print("number of DEV data", len(dev_data))

number of TRAIN data 25000
number of DEV data 5000


We have defined a generic model class as below. The model has 2 functions, train and classify. 

In [4]:
VOCAB_SIZE = 5000
class Model:
    def __init__(self, data):
        # Vocabulary is a set that stores every word seen in the 
        # training data
        self.vocab = Counter([word for content, label in data 
                              for word in word_tokenize(content)]
                            ).most_common(VOCAB_SIZE-1)
        # word to index mapping
        self.word_to_idx = {k[0]: v+1 for v, k in 
                            enumerate(self.vocab)}
        # all the unknown words will be mapped to index 0
        self.word_to_idx["UNK"] = 0 
        self.idx_to_word = {v:k for k, v in self.word_to_idx.items()}
        self.label_to_idx = {POS_LABEL: 0, NEG_LABEL: 1}
        self.idx_to_label = [POS_LABEL, NEG_LABEL]
        self.vocab = set(self.word_to_idx.keys())
        
    def train_model(self, data):
        '''
        Train the model with the provided training data
        '''
        raise NotImplementedError 

        
    def classify(self, data):
        '''
        Classify the documents with the model
        '''
        raise NotImplementedError

In [5]:
model = Model(train_data)
vocab = Counter([word for content, label in train_data 
                              for word in word_tokenize(content)]
                            ).most_common(VOCAB_SIZE-1)

# Logistic Regression with Bag of Words

(65 points)

You will implement logistic regression with bag of words features. The code template is written with PyTorch. Reading the first two sections of the [PyTorch tutorial](https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html) will give you enough knowledge to code a logistic regression model with PyTorch. 

(When used for deep learning PyTorch code is usually run on GPUs (via the CUDA system).  In this homework, however, we'll use regular CPUs.)


In [6]:
class TextClassificationDataset(tud.Dataset):
    '''
    PyTorch provides a common dataset interface. 
    See https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
    The dataset encodes documents into indices. 
    With the PyTorch dataloader, you can easily get batched data for 
    training and evaluation. 
    '''
    def __init__(self, word_to_idx, data):
        
        self.data = data
        self.word_to_idx = word_to_idx
        self.label_to_idx = {POS_LABEL: 0, NEG_LABEL: 1}
        self.vocab_size = VOCAB_SIZE
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = np.zeros(self.vocab_size)
        
        item = torch.from_numpy(item)
        # in training or tuning, we use both the document (review)
        # and its corresponding label
        if len(self.data[idx]) == 2: 
            for word in word_tokenize(self.data[idx][0]):
                item[self.word_to_idx.get(word, 0)] += 1
            label = self.label_to_idx[self.data[idx][1]]
            return item, label
        else: # in testing, we only use the document without label
            for word in word_tokenize(self.data[idx]):
                item[self.word_to_idx.get(word, 0)] += 1
            return item

In [55]:
best_model = None
class BoWLRClassifier(nn.Module, Model):
    '''
    Define your logistic regression model with bag of words features.
    '''
    def __init__(self, train_data):
        nn.Module.__init__(self)
        Model.__init__(self, train_data)
     
        '''
        In this model initialization phase, write code to do the 
        following: 
        1. Define a linear layer to transform bag of words features 
           into 2 classes. 
        2. Define the loss function; use cross entropy loss (see
            https://pytorch.org/docs/stable/nn.html?highlight=crossen#torch.nn.CrossEntropyLoss)
        3. Define an optimizer for the model; choose the Adam optimizer,
           which uses a version of the stochastic gradient descent 
           algorithm. (See https://pytorch.org/docs/stable/optim.html?highlight=sgd#torch.optim.Adam)
        '''
        
        self.linear = nn.Linear(VOCAB_SIZE, 2)
        self.loss = nn.CrossEntropyLoss()
        self.optimizer = optim.Adam(self.parameters())
        self.train_data = TextClassificationDataset(self.word_to_idx, train_data)
        
    def forward(self, bow):
        '''
        Run the linear layer in the model for a single bag of words vector. 
        '''
        # WRITE YOUR CODE HERE
        # (You might be wondering why we don't explicitly have a
        # softmax component in our model. It is included in something
        # defined earlier. In what?)
        # Note: the softmax component is included in the loss function
        
        bow = bow.float()
        return self.linear(bow)
    
    def train_epoch(self):
        '''
        Train the model for one epoch with the training data
        When training a model, you repeat the following procedure:
        1. Get one batch of features and labels
        2. Make a forward pass with the features to get predictions
        3. Calculate the loss with the predictions and target labels
        4. Run a backward pass from the loss function to get the gradients
        5. Apply the optimizer step to update the model paramters
        
        For (1) you will have to understand how the PyTorch dataloader
        functions.
        '''
        # Step 1. Remember that PyTorch accumulates gradients.
        # We need to clear them out before each instance (from tutorial)
        #model.zero_grad()
        
        dataloader = DataLoader(self.train_data, batch_size=8,
                        shuffle=True, num_workers=4)
        
        
        for bows, targets in dataloader:
            prediction = self.forward(bows)
            loss = self.loss(prediction, targets)
            loss.backward()
            self.optimizer.step()

    def classify(self, doc):
        '''
        This function classifies a single document into its categories. 
        the input is a document that has been processed into a bag of words.
        '''
                
        return self.forward(doc)
        
    def evaluate_classifier_accuracy(self, data):
        '''
        This function evaluates the data with the current model. 
        data contains both documents and labels. 
        It calls classify() to make predictions, 
        and compares with the correct labels to return 
        the model accuracy on "data". 
        '''
        
        denom = len(data)
        correct = 0
        
        # need absolute value?
        for bow, target in data:
            prediction = self.classify(bow.float())
            
            if prediction[0] < prediction[1]:
                classification = 1
            else:
                classification = 0
            if classification == target:
                correct += 1
        
        return correct/denom
                
    
    def train_model(self, train_data, dev_data):
        """
        This function processes the entire training set for multiple epochs.
        After each training epoch, evaluate your model on the DEV set. 
        Save the best performing model on the DEV set to best_model
        """  
        dev = TextClassificationDataset(self.word_to_idx, dev_data)
        
        for epoch in range(5):
            self.train_epoch()
            accuracy = self.evaluate_classifier_accuracy(dev)
            print("The accuracy for epoch {} is {}".format(epoch, accuracy))

# zero grad? 

Train the model

In [9]:
len(train_data)

25000

In [56]:
lr_model = BoWLRClassifier(train_data)

In [57]:
lr_model.train_model(train_data, dev_data)

The accuracy for epoch 0 is 0.8072
The accuracy for epoch 1 is 0.823
The accuracy for epoch 2 is 0.8176
The accuracy for epoch 3 is 0.8306
The accuracy for epoch 4 is 0.8054


In [65]:
param_list = []
for param in lr_model.parameters():
    param_list.append(param)

In [82]:
pos = param_list[0][]
val, idx = pos.max(0)
pos

tensor([  0.4454,  -0.4569,  -0.7188,  ..., -13.5333,  14.6355, -14.9937],
       grad_fn=<SelectBackward>)

In [85]:
neg = param_list[0][1]
idx_t = torch.topk(pos, k=10, dim=0)[1]
for idx in idx_t:
    print(lr_model.idx_to_word[idx.item()])

seagal
awful.
lacks
uninteresting
mst3k
forgettable
struggling
disappointing
unconvincing
pathetic.


In [87]:
pos = param_list[0][0]
idx_t = torch.topk(pos, k=10, dim=0)[1]
for idx in idx_t:
    print(lr_model.idx_to_word[idx.item()])

criticism
r
8/10
amazing.
perfect.
tight
excellent.
faced
contrast
rural


In [89]:
pos = param_list[0][0]
idx_t = torch.topk(pos, k=10, dim=0, smallest=True)[1]
for idx in idx_t:
    print(lr_model.idx_to_word[idx.item()])

TypeError: topk() got an unexpected keyword argument 'smallest'

In [None]:
lr_model.train_model(train_data, dev_data)

# Tuning the model

(25 points)

Now tune your model, by experimenting with

- another optimizer
- changing the learning rate
- changing the number of epochs to train
- adding regularization into your optimzer.

Finally evaluate your tuned model on the TEST set.

Report your results in a writeup, and submit that as a
separate PDF file.



In [2]:
# store best model, accuracy, epoch

optimizers = []
best_model = copy.deepcopy(self)

# Feature analysis

(10 points)

Write code for each of the following, and include an analysis of the results in your writup.


- Identify the top 10 features with the maximum weights for POSITIVE category. 

- Identify the top 10 features with the maximum negative weights for POSITIVE category. 

- Identify the top 10 features with the maximum positive weights for NEGATIVE category. 

- Identify the top 10 features with the maximum negative weights for NEGATIVE category. 

In [None]:
# WRITE YOUR CODE HERE FOR FEATURE ANALYSIS