# Homework 1

- Due: 11:59pm, April 26, 2019

In this project, you will work on sentiment classification with a logistic regression classifier in Python 3.  Using a large movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/), you will classify movie reviews into two categories, POSITIVE or NEGATIVE. 

You are provided with a training set (TRAIN), a development set (DEV), and a test set (TEST). Your classifier will be trained on TRAIN, evaluated and tuned on DEV, and tested on TEST. 

Using the PyTorch library, you will build the logistic regression classifier with bag of words features.  Some code has been provide  to help get you started.

You need to fill in the missing code, run all cells, and submit this notebook along with a PDF with a writeup on your model tuning results and  solutions to the other problems in Homework 1.

Credits: This assignment and notebook was originally created by Zewei Chu (zeweichu@uchicago.edu)

In [1]:
import torch
import torch.utils.data as tud
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import Counter, defaultdict
import operator
import os, math
import numpy as np
import random
import copy
import sys
#UnicodeDecodeError

# Feel free to define your own word_tokenizer instead of this naive 
# implementation. You may also use word_tokenize from nltk library 
# (from nltk import word_tokenize), which works better but slower. 
def word_tokenize(s):
    return s.split()

# set the random seeds so the experiments can be replicated exactly
seed = 30255
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)

# Global class labels.
POS_LABEL = 'pos'
NEG_LABEL = 'neg'     

In [2]:
def load_data(data_file):
    data = []
    with open(data_file,'r', encoding=sys.getdefaultencoding()) as fin:
        for line in fin:
            label, content = line.split(",", 1)
            data.append((content.lower(), label))
    return data

data_dir = "large_movie_review_dataset"
train_data = load_data(os.path.join(data_dir, "train.txt"))
dev_data = load_data(os.path.join(data_dir, "dev.txt"))
test_data = load_data(os.path.join(data_dir, "test.txt"))

In [3]:
print("number of TRAIN data", len(train_data))
print("number of DEV data", len(dev_data))
print("number of TEST data", len(test_data))

number of TRAIN data 25000
number of DEV data 5000
number of TEST data 20000


We have defined a generic model class as below. The model has 2 functions, train and classify. 

In [4]:
VOCAB_SIZE = 5000
class Model:
    def __init__(self, data):
        # Vocabulary is a set that stores every word seen in the 
        # training data
        self.vocab = Counter([word for content, label in data 
                              for word in word_tokenize(content)]
                            ).most_common(VOCAB_SIZE-1)
        # word to index mapping
        self.word_to_idx = {k[0]: v+1 for v, k in 
                            enumerate(self.vocab)}
        # all the unknown words will be mapped to index 0
        self.word_to_idx["UNK"] = 0 
        self.idx_to_word = {v:k for k, v in self.word_to_idx.items()}
        self.label_to_idx = {POS_LABEL: 0, NEG_LABEL: 1}
        self.idx_to_label = [POS_LABEL, NEG_LABEL]
        self.vocab = set(self.word_to_idx.keys())
        
    def train_model(self, data):
        '''
        Train the model with the provided training data
        '''
        raise NotImplementedError
        
    def classify(self, data):
        '''
        Classify the documents with the model
        '''
        raise NotImplementedError

# Logistic Regression with Bag of Words

(65 points)

You will implement logistic regression with bag of words features. The code template is written with PyTorch. Reading the first two sections of the [PyTorch tutorial](https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html) will give you enough knowledge to code a logistic regression model with PyTorch. 

(When used for deep learning PyTorch code is usually run on GPUs (via the CUDA system).  In this homework, however, we'll use regular CPUs.)


In [5]:
class TextClassificationDataset(tud.Dataset):
    '''
    PyTorch provides a common dataset interface. 
    See https://pytorch.org/tutorials/beginner/data_loading_tutorial.html
    The dataset encodes documents into indices. 
    With the PyTorch dataloader, you can easily get batched data for 
    training and evaluation. 
    '''
    def __init__(self, word_to_idx, data):
        
        self.data = data
        self.word_to_idx = word_to_idx
        self.label_to_idx = {POS_LABEL: 0, NEG_LABEL: 1}
        self.vocab_size = VOCAB_SIZE
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = np.zeros(self.vocab_size)
        item = torch.from_numpy(item)
        # in training or tuning, we use both the document (review)
        # and its corresponding label
        if len(self.data[idx]) == 2: 
            for word in word_tokenize(self.data[idx][0]):
                item[self.word_to_idx.get(word, 0)] += 1
            label = self.label_to_idx[self.data[idx][1]]
            return item, label
        else: # in testing, we only use the document without label
            for word in word_tokenize(self.data[idx]):
                item[self.word_to_idx.get(word, 0)] += 1
            return item

In [13]:
best_model = None
best_model = None
class BoWLRClassifier(nn.Module, Model):
    '''
    Define your logistic regression model with bag of words features.
    '''
    def __init__(self, data, loss="Cross", optimizer="Adam", learning_rate=1e-3):
        nn.Module.__init__(self)
        Model.__init__(self, data)
        
        '''
        In this model initialization phase, write code to do the 
        following: 
        1. Define a linear layer to transform bag of words features 
           into 2 classes. 
        2. Define the loss function; use cross entropy loss (see
            https://pytorch.org/docs/stable/nn.html?highlight=crossen#torch.nn.CrossEntropyLoss)
        3. Define an optimizer for the model; choose the Adam optimizer,
           which uses a version of the stochastic gradient descent 
           algorithm. (See https://pytorch.org/docs/stable/optim.html?highlight=sgd#torch.optim.Adam)
        '''
        # linear layer
        self.linear = nn.Linear(VOCAB_SIZE, 2)
        
        # define loss function
        if loss == "Cross":
            self.loss_function = nn.CrossEntropyLoss()
        elif loss == "NLL":
            self.loss_function = nn.NLLLoss()
        
        # define optimizer
        if optimizer == "Adam":
            self.optimizer = optim.Adam(self.parameters(), lr=learning_rate)
        elif optimizer == "SGD":
            self.optimizer = optim.SGD(self.parameters(), lr=learning_rate)
        
        
        
    def forward(self, bow):
        '''
        Run the linear layer in the model for a single bag of words vector. 
        '''
        return F.log_softmax(self.linear(bow), dim=1)
    
    def train_epoch(self, train_data):
        '''
        Train the model for one epoch with the training data
        When training a model, you repeat the following procedure:
        1. Get one batch of features and labels
        2. Make a forward pass with the features to get predictions
        3. Calculate the loss with the predictions and target labels
        4. Run a backward pass from the loss function to get the gradients
        5. Apply the optimizer step to update the model paramters
        
        For (1) you will have to understand how the PyTorch dataloader
        functions.
        '''
        indexed_data = TextClassificationDataset(self.word_to_idx, train_data)
        #print(indexed_data)
        #trainloader = tud.DataLoader(indexed_data, batch_size=batch_size, shuffle=True, num_workers=4)
        
        for feature, label in indexed_data:
            log_probs = self.forward(feature.float().view(1,-1))
            loss = self.loss_function(log_probs, torch.LongTensor([label]))
            loss.backward()
            self.optimizer.step()
    
    def classify(self, docs):
        '''
        This function classifies documents into their categories. 
        docs are documents without labels.
        '''
        log_probs = self.forward(docs.view(1, -1).float())
        max_value, max_index = torch.max(log_probs, 1)
        return max_value.item(), max_index.item()
                
    def evaluate_classifier_accuracy(self, data):
        '''
        This function evaluates the data with the current model. 
        data contains both documents and labels. 
        It calls classify() to make predictions, 
        and compares with the correct labels to return 
        the model accuracy on "data". 
        '''
        num_right = 0
        total = 0
        correct_indexes = []
        
        for feature, index in data:
            total += 1
            pred_value, pred_index = self.classify(feature)
            if pred_index == index:
                num_right += 1
                correct_indexes.append(index)

        return num_right/total
    
    def train_model(self, train_data, dev_data, epochs):
        """
        This function processes the entire training set for multiple epochs.
        After each training epoch, evaluate your model on the DEV set. 
        Save the best performing model on the DEV set to best_model
        """ 
        best_acc = 0
        save_list = []
        for epoch in range(epochs):
            print(epoch)
            
            self.train_epoch(train_data)
            print("trained")
        
            indexed_data = TextClassificationDataset(self.word_to_idx, dev_data)
            print("data indexed")
            
            mod_acc = self.evaluate_classifier_accuracy(indexed_data)
            print("accurracy calced")
            
            if mod_acc > best_acc:
                best_acc = mod_acc
                print(best_acc)
                save_list.append((epoch, best_acc*100)) 
                
                best_mod = copy.deepcopy(self)
            
        return best_mod, best_acc
            


Train the model

In [8]:
lr_model = BoWLRClassifier(train_data, "Cross", "Adam", 1e-3)
best_model, best_acc = lr_model.train_model(train_data, dev_data, 3)

0
trained
data indexed
accurracy calced
0.8134
1
trained
data indexed
accurracy calced
0.8196
2
trained
data indexed
accurracy calced


# Tuning the model

(25 points)

Now tune your model, by experimenting with

- another optimizer
- changing the learning rate
- changing the number of epochs to train
- adding regularization into your optimzer.

Finally evaluate your tuned model on the TEST set.

Report your results in a writeup, and submit that as a
separate PDF file.



Change Optimizer

In [14]:
lr_model = BoWLRClassifier(train_data, "Cross", "SGD", 1e-3)
best_model_2, best_acc_2 = lr_model.train_model(train_data, dev_data, 3)

0
trained
data indexed
accurracy calced
0.7804
1
trained
data indexed
accurracy calced
2
trained
data indexed
accurracy calced
0.827


Change Learning Rate

In [15]:
lr_model = BoWLRClassifier(train_data, "Cross", "Adam", 0.01)
best_model_3, best_acc_3 = lr_model.train_model(train_data, dev_data, 3)

0
trained
data indexed
accurracy calced
0.8162
1
trained
data indexed
accurracy calced
0.8274
2
trained
data indexed
accurracy calced


Change Number of Epochs

In [30]:
lr_model = BoWLRClassifier(train_data, "Cross", "Adam", 1e-3)
best_model_4, best_acc_4 = lr_model.train_model(train_data, dev_data, 5)

0
trained
data indexed
accurracy calced
0.8166
1
trained
data indexed
accurracy calced
0.8268
2
trained
data indexed
accurracy calced
3
trained
data indexed
accurracy calced
4
trained
data indexed
accurracy calced
0.831


In [31]:
accs = [best_acc, best_acc_2, best_acc_3, best_acc_4]
mods = [best_model, best_model_2, best_model_3, best_model_4]
high = np.argmax(accs)

Test the accurracy of the chosen model

In [32]:
chosen_model = mods[high]
correct = 0
total = 0

features = [i[0] for i in test_data]
indices = [i[1] for i in test_data]

indexed_features = TextClassificationDataset(chosen_model.word_to_idx, features)
for info in indexed_features:
    val, idx = chosen_model.classify(info)
    classification = chosen_model.idx_to_label[idx]
    if classification == indices[total]:
        correct += 1
    
    total += 1
    
    if total % 1000  == 0:
        print(total)

print(correct)
print(total)
print(correct/total)

1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
16537
20000
0.82685


# Feature analysis

(10 points)

Write code for each of the following, and include an analysis of the results in your writup.


- Identify the top 10 features with the maximum positive weights for POSITIVE category. 

- Identify the top 10 features with the maximum negative weights for POSITIVE category. 

- Identify the top 10 features with the maximum positive weights for NEGATIVE category. 

- Identify the top 10 features with the maximum negative weights for NEGATIVE category. 

In [33]:
mod = chosen_model.state_dict()
pos = mod['linear.weight'][0]
neg = mod['linear.weight'][1]

def get_words(idx, words_to_idx):
    top_words = []
    for word, val in words_to_idx.items():
        if val in idx:
            top_words.append(word)
    return top_words

In [34]:
top_pos_wt_pos_cat_idx = np.argsort(pos)[-10:]
words = get_words(top_pos_wt_pos_cat_idx, chosen_model.word_to_idx)
words

['wonderfully',
 'noir',
 'delightful',
 'excellent,',
 'lonely',
 'perfect.',
 'perfect,',
 'complaint',
 '8/10',
 'can.']

In [35]:
top_neg_wt_pos_cat_idx = np.argsort(pos)[:10]
words = get_words(top_neg_wt_pos_cat_idx, chosen_model.word_to_idx)
words

['redeeming',
 'insult',
 'disappointment',
 'unfunny',
 'horrible.',
 'dull,',
 'wasting',
 'garbage.',
 'unconvincing',
 'pathetic.']

In [36]:
top_pos_wt_neg_cat_idx = np.argsort(neg)[-10:]
words = get_words(top_pos_wt_neg_cat_idx, chosen_model.word_to_idx)
words

['redeeming',
 'insult',
 'disappointment',
 'unfunny',
 'horrible.',
 'dull,',
 'wasting',
 'garbage.',
 'unconvincing',
 'pathetic.']

In [37]:
top_neg_wt_neg_cat_idx = np.argsort(neg)[:10]
words = get_words(top_neg_wt_neg_cat_idx, chosen_model.word_to_idx)
words

['wonderfully',
 'noir',
 'delightful',
 'excellent,',
 'lonely',
 'perfect.',
 'perfect,',
 'complaint',
 '8/10',
 'can.']