# Homework 1 Problem 4

In this programming assignment, you will work on sentiment analysis using logistic regression with bag of words.  To help you quickly get started much of the required code has already been provided.  You primary task is to understand the provided code and fill in the gaps.  In particular, explicit preprocessing code has been provided so help you understand it clearly.  Gaps that you have to fill are marked with "##YOUR CODE HERE ##".

This assignment will get you started with PyTorch.  We strongly recommend reading the first two tutorials on "Deep Learning for NLP with PyTorch" by Robert Guthrie https://pytorch.org/tutorials/beginner/deep_learning_nlp_tutorial.html before you begin this assignment.  

You should also seek our help unhesitatingly.  We want you to learn a reasonable amount of material in a short period of time, and our help will make it easier.  Please post your questions on Piazza or meet with us during office hours.

You will work on part of the large movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/). The task is to classify movie reviews into two categories, POSITIVE or NEGATIVE. 

You are provided with a training set (TRAIN), a development set (DEV), and a test set (TEST). Your classifier is trained on TRAIN, evaluated and tuned on DEV, and tested on TEST.  It is best to work with a small sample training set while you are developing the code, but you should report your results on the full training set.

OPTIONAL EXTENSIONS (Do not submit for grading):
- Try to improve the model's performance by tuning the hyper parameters such as number of epochs, batch size, learning rate, choice of optimizer, etc. Also try to add regularization to the loss function.
- The provided code takes advantage of a GPU (via cuda) if one is available.  If you find that your machine takes 10-15 minutes for one epoch of training, you can optionally run your code on Google collab.  Go to https://colab.research.google.com and log in using your Google account.  To upload a python notebook, click on "Files" dropdown menu and the upload notebook. To use GPU click the “Runtime” dropdown menu. Select “Change runtime type”. Select Python3 from “Runtime type” dropdown menu and choose hardware accelerator as GPU. You can find further instructions on how to use google collab in, e.g., the following pages: https://www.geeksforgeeks.org/how-to-use-google-colab/ and https://towardsdatascience.com/downloading-datasets-into-google-drive-via-google-colab-bcb1b30b0166

In [3]:
import torch
import torch.utils.data as tud
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from collections import Counter
import os
import numpy as np
import random
import copy
import pandas as pd

def word_tokenize(s):
    return s.split()

# set the random seeds so the experiments can be replicated exactly
random.seed(30255)
np.random.seed(30255)
torch.manual_seed(30255)
if torch.cuda.is_available():
    torch.cuda.manual_seed(30255)

# Global class labels.
POS_LABEL = 'pos'
NEG_LABEL = 'neg'  

In [4]:
training = pd.read_csv("training_with_text.csv")

In [9]:
training.columns

Index(['_allegation_id', 'title', 'url', 'text_bad', 'incident_date',
       'category', 'allegation_name', 'tag', 'nudity_penetration',
       'sexual_relations_with_a_minor_', 'sexual_harassment_sexual_remarks',
       'domestic_violence_police_committing_',
       'sexual_humiliation_sexual_extortion_prostitution_sex_work',
       'tasers_baton_aggressive_physical_touch_gun', 'trespass_robbery',
       'biometric_surveillance_fitting_a_description_gang_related_',
       'racial_slurs_xenophobic_remarks_',
       'undocumented_status_asking_for_someone_s_status_calling_ice_',
       'planting_drug_guns', 'neglect_of_duty_failure_to_serve',
       'refusing_to_provide_medical_assistance', 'workplace_harassment',
       '_irrational_aggressive_unstable_', 'suicide_in_jail_improper_care_',
       'dcfs_threats', 'pregnant_women', 'school',
       'searching_patting_down_arresting_minors', 'allegation_id', 'title.1',
       'text_content', 'incident_date.1', 'most_common_category_id',
  

In [12]:
#Look at average of each category

for cat in ['nudity_penetration',
       'sexual_relations_with_a_minor_', 'sexual_harassment_sexual_remarks',
       'domestic_violence_police_committing_',
       'sexual_humiliation_sexual_extortion_prostitution_sex_work',
       'tasers_baton_aggressive_physical_touch_gun', 'trespass_robbery',
       'biometric_surveillance_fitting_a_description_gang_related_',
       'racial_slurs_xenophobic_remarks_',
       'undocumented_status_asking_for_someone_s_status_calling_ice_',
       'planting_drug_guns', 'neglect_of_duty_failure_to_serve',
       'refusing_to_provide_medical_assistance', 'workplace_harassment',
       '_irrational_aggressive_unstable_', 'suicide_in_jail_improper_care_',
       'dcfs_threats', 'pregnant_women', 'school',
       'searching_patting_down_arresting_minors']:
    print(cat)
    print(training[cat].mean())
    print()

nudity_penetration
0.010810810810810811

sexual_relations_with_a_minor_
0.0

sexual_harassment_sexual_remarks
0.010810810810810811

domestic_violence_police_committing_
0.013513513513513514

sexual_humiliation_sexual_extortion_prostitution_sex_work
0.010810810810810811

tasers_baton_aggressive_physical_touch_gun
0.34864864864864864

trespass_robbery
0.16216216216216217

biometric_surveillance_fitting_a_description_gang_related_
0.008108108108108109

racial_slurs_xenophobic_remarks_
0.07567567567567568

undocumented_status_asking_for_someone_s_status_calling_ice_
0.0

planting_drug_guns
0.06504065040650407

neglect_of_duty_failure_to_serve
0.03523035230352303

refusing_to_provide_medical_assistance
0.02168021680216802

workplace_harassment
0.0

_irrational_aggressive_unstable_
0.008130081300813009

suicide_in_jail_improper_care_
0.0027100271002710027

dcfs_threats
0.0

pregnant_women
0.0027100271002710027

school
0.0027100271002710027

searching_patting_down_arresting_minors
0.008130081

In [37]:
#Use tasers_baton_aggressive_physical_touch_gun
training['tasers_baton_aggressive_physical_touch_gun'] = training.apply(lambda x: 'pos' if x['tasers_baton_aggressive_physical_touch_gun'] == 1 else 'neg', axis=1)

In [58]:
training['text_content'] = training['text_content'].fillna('')

In [59]:
from string import digits

def remove_digits(row):
    remove_digits = str.maketrans('', '', digits)
    return row['text_content'].translate(remove_digits)

training['text_content'] = training.apply(remove_digits, axis=1)

In [61]:
split_1 = np.random.rand(len(training)) < 0.8
train_dev = training[split_1]
test_data = training[~split_1]
split_2 = np.random.rand(len(train_dev)) < 0.6
train_data = train_dev[split_2]
dev_data = train_dev[~split_2]

In [62]:
train_data = [(a[1], a[0]) for a in list(train_data[['tasers_baton_aggressive_physical_touch_gun', 'text_content']].to_records(index=False))]
test_data = [(a[1], a[0]) for a in list(test_data[['tasers_baton_aggressive_physical_touch_gun', 'text_content']].to_records(index=False))]
dev_data = [(a[1], a[0]) for a in list(dev_data[['tasers_baton_aggressive_physical_touch_gun', 'text_content']].to_records(index=False))]

In [43]:
print("number of TRAIN data", len(train_data))
print("number of TEST data", len(test_data))
print("number of DEV data", len(dev_data))

number of TRAIN data 182
number of TEST data 71
number of DEV data 117


#### We define a abstract model class as below. The model preforms preprocessing in its __init__ and has 2 functions, train and classify, which are implemented in subclasses.

In [45]:
VOCAB_SIZE = 5000
class Model:
    def __init__(self, data):
        # Vocabulary is a set that stores every word seen in the training data
        self.vocab = Counter([word for content, label in data for word in word_tokenize(content)]).most_common(VOCAB_SIZE-1) 
        self.word_to_idx = {k[0]: v+1 for v, k in enumerate(self.vocab)} # word to index mapping
        self.word_to_idx["UNK"] = 0 # all the unknown words will be mapped to index 0
        self.idx_to_word = {v:k for k, v in self.word_to_idx.items()}
        self.label_to_idx = {POS_LABEL: 0, NEG_LABEL: 1}
        self.idx_to_label = [POS_LABEL, NEG_LABEL]
        self.vocab = set(self.word_to_idx.keys())
        
    def train_model(self, data):
        '''
        Train the model with the provided training data
        '''
        raise NotImplementedError
        
    def classify(self, data):
        '''
        classify the documents with the model
        '''
        raise NotImplementedError

#### When training it helps to process multiple examples in a minibatch.  We shall use dataloading tools provided by PyTorch to create such minibatches.  The following class helps us interface with these PyTorch tools.

You may optionally wish to see a tutorial on these tools: https://pytorch.org/tutorials/beginner/data_loading_tutorial.html 


In [46]:
class TextClassificationDataset(tud.Dataset):
    '''
    Our customized Dataset class.
    '''
    def __init__(self, word_to_idx, data):
        self.data = data
        self.word_to_idx = word_to_idx
        self.label_to_idx = {POS_LABEL: 0, NEG_LABEL: 1}
        self.vocab_size = VOCAB_SIZE
        
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        bowvector = torch.zeros(self.vocab_size)
        for word in word_tokenize(self.data[idx][0]):
            bowvector[self.word_to_idx.get(word, 0)] += 1
        label = self.label_to_idx[self.data[idx][1]]
        return bowvector, label

In [47]:
# We use early stopping to store the model that achieves that best performance
# on DEV during training
best_model = None

class BoWLRClassifier(nn.Module, Model):
    '''
    Define your logistic regression model with bag of words features.
    '''
    def __init__(self, data):
        nn.Module.__init__(self)
        Model.__init__(self, data)
        
        '''
        In this model initialization phase, do the following: 
        1. Define a linear layer to transform bag of words features into 2 classes. 
        2. Define the loss function corresponding to cross entropy loss.
           See https://pytorch.org/docs/stable/nn.html?highlight=crossen#torch.nn.CrossEntropyLoss
        3. Define an optimizer for the model.  Choose SGD or Adam.
            https://pytorch.org/docs/stable/optim.html?highlight=sgd#torch.optim.SGD
        '''
        self.linear = nn.Linear(len(self.word_to_idx), 2)  ##YOUR CODE HERE##
        self.loss_function = nn.CrossEntropyLoss() ##YOUR CODE HERE##
        self.optimizer = optim.SGD(self.parameters(), lr=.1) ##YOUR CODE HERE##
        
    def forward(self, x):
        '''
        Run the model, which is just the linear layer and return the output.
        '''
        return self.linear(x)
    
    def train_epoch(self, train_data):
        '''
        Train the model for one epoch (iterate through each example once).
        '''
        # For each minibatch:
        # make a forward pass to get predictions
        # compute loss using predictions and true y
        # make a backward pass to get gradients
        # update parameters by calling optimizer step
        
        dataset = TextClassificationDataset(self.word_to_idx, train_data)
        dataloader = tud.DataLoader(dataset, batch_size=8)
        self.train()
        for i, (x, y) in enumerate(dataloader):
            x = x.float()
            y = y.long()
            if torch.cuda.is_available():
                x = x.cuda()
                y = y.cuda()
            self.optimizer.zero_grad()
            predictions = self.forward(x) ##YOUR CODE HERE##
            loss = self.loss_function(predictions, y)
            loss.backward()
            if i % 500 == 0:
                print(f"loss at {i}: {loss.item()}")
            self.optimizer.step()
    
    def train_model(self, train_data, dev_data):
        """
        Train for multiple epochs and after each evaluate DEV accuracy.
        Store the model with best DEV accuracy in best_model.
        """
        global best_model
        dev_accuracies = [0.]
        highest_acc = 0
        for epoch in range(10):
            print(f"Epoch {epoch}")
            self.train_epoch(train_data)
            dev_acc = self.evaluate(dev_data)
            print(f"DEV accuracy: {dev_acc}")
            
            # The following code copies the current model to best_model
            # if the DEV accuracy is the best so far
            if dev_acc > highest_acc: ##YOUR CODE HERE##
                highest_acc = dev_acc
                best_model = copy.deepcopy(self)
            dev_accuracies.append(dev_acc)
            
    def evaluate(self, data):
        '''
        Compute the accuracy for data, i.e., the fraction of examples in 
        data for which the current model correctly predicts the class.
        '''
        
        self.eval()
        predictions = self.predict(data)
        ys = [d[1] for d in data]
        
        correct = 0
        for i, prediction in enumerate(predictions):
            if prediction == ys[i]:
                correct += 1
                
        return correct / len(predictions)
        
                

    def predict(self, data):
        '''
        Predict the classes for the examples in data.
        '''
        dataset = TextClassificationDataset(self.word_to_idx, data)
        dataloader = tud.DataLoader(dataset, batch_size=1, shuffle=False)
        results = []
        with torch.no_grad():
            for i, (x, y) in enumerate(dataloader):
                x = x.float()
                if torch.cuda.is_available():
                    x = x.cuda()
                predictions = self.forward(x)
                results.append(predictions.max(1)[1].cpu().numpy().reshape(-1))
        results = np.concatenate(results)
        results = [self.idx_to_label[p] for p in results]
        return results
                
    

In [63]:
train_d = train_data # Uncomment this once your code is working properly
dev_d = dev_data

lr_model = BoWLRClassifier(train_d)
if torch.cuda.is_available():
    lr_model = lr_model.cuda()
lr_model.train_model(train_d, dev_d)
best_model.evaluate(test_data ) # Uncomment this once your code is working properly


Epoch 0


RuntimeError: size mismatch, m1: [8 x 5000], m2: [4999 x 2] at C:\w\1\s\tmp_conda_3.7_100118\conda\conda-bld\pytorch_1579082551706\work\aten\src\TH/generic/THTensorMath.cpp:136

#### Identify the top 10 features with the maximum weights for POSITIVE category. Explain your findings. 

These were the 10 words which had the highest weights indicating label=0 which was POSITIVE, meaning that these words were associated with positive movies.

In [8]:
values, idxs = torch.topk(best_model.state_dict()['linear.weight'][0], 10)

In [9]:
[best_model.idx_to_word[a] for a in idxs.tolist()]

['excellent',
 'great',
 'best',
 'beautiful',
 'job',
 'wonderful',
 'perfect',
 'both',
 'loved',
 'amazing']

#### Identify the top 10 features with the maximum negative weights for POSITIVE category. Explain your findings. 

These were the 10 words which had the lowest weights indicating label=0 which was POSITIVE, meaning that these words were least associated with positive movies.

In [10]:
values, idxs = torch.topk(best_model.state_dict()['linear.weight'][0], 10, largest=False)

In [11]:
[best_model.idx_to_word[a] for a in idxs.tolist()]

['worst',
 'waste',
 'poor',
 'nothing',
 'bad',
 'script',
 'poorly',
 'boring',
 'awful',
 'supposed']