<a href="https://colab.research.google.com/github/emmarogge/cs280r/blob/master/key_classification_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2: Text classification with Colab and PyTorch

Emma Rogge, Tasha Schoenstein & Zilin Ma

## Set up

###Import relevant libraries and dependencies

In [2]:
import torch
import torch.nn as nn
from torch import optim
from torchtext import data
import math
from matplotlib import pyplot as plt
import os
from collections import Counter

## GPU check, make sure to set runtime type to "GPU"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (device)

cpu


### Read in data files from GitHub

In [3]:
!wget https://raw.githubusercontent.com/sriniiyer/nl2sql/master/data/atis/train.nl
!wget https://raw.githubusercontent.com/sriniiyer/nl2sql/master/data/atis/train.sql
!wget https://raw.githubusercontent.com/sriniiyer/nl2sql/master/data/atis/test.nl
!wget https://raw.githubusercontent.com/sriniiyer/nl2sql/master/data/atis/test.sql

--2019-12-10 19:46:20--  https://raw.githubusercontent.com/sriniiyer/nl2sql/master/data/atis/train.nl
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 281581 (275K) [text/plain]
Saving to: ‘train.nl’


2019-12-10 19:46:20 (5.52 MB/s) - ‘train.nl’ saved [281581/281581]

--2019-12-10 19:46:21--  https://raw.githubusercontent.com/sriniiyer/nl2sql/master/data/atis/train.sql
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2840349 (2.7M) [text/plain]
Saving to: ‘train.sql’


2019-12-10 19:46:21 (27.1 MB/s) - ‘train.sql’ saved [2840349/2840349]

--

##Data format

We're going to use `torchtext` to handle processing the data. This library is useful for processing and batching text data in Python. More information on `torchtext` can be found [in this tutorial](https://mlexplained.com/2018/02/08/a-comprehensive-tutorial-to-torchtext/).

To begin, we set up two instances of the PyTorch class  one for the natural-language queries and one for the SQL query intent labels.
Next, we create instances of the PyTorch [`data.Field`](https://torchtext.readthedocs.io/en/latest/data.html#fields) class for the input (text) and output (labels) fields.
This class contains common text-processing datatypes that can be converted to tensors.

HINT: Utilize the PyTorch's `preprocess` method on your text field objects in your dataset implementation--this method tokenizes the content of the 'text' field in an Example using `spacy`.

In [0]:
# We set `batch_first` = True to ensure the data is batched before it is processed.
TEXT = data.Field(lower=True, sequential=True, include_lengths=False, batch_first=True, tokenize="spacy") 
LABEL = data.Field(batch_first=True, sequential=False, unk_token=None)

###Implement torchtext Dataset
Implement the class below to prepare the data for classification. It is highly recommended that you make use of the [`Example class`](https://github.com/pytorch/text/blob/master/torchtext/data/example.py) to store each corresponding text and label.

#### Hints:
- Start by populating a list with each processed, tokenized query in your dataset.
- Each text field object in your list should then have its `label` field populated with the appropriate label.
- Leverage the `__init__` method of the parent class, [`pytorch.utils.data.Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset).

---

In [0]:
## Convert to standard format
class ATIS(data.Dataset):
    dirname = 'data'
    name = 'atis'

    @staticmethod
    def sort_key(ex):
        return len(ex.text)

    def __init__(self, path, text_field, label_field, **kwargs):
        """Create an ATIS dataset instance given a path and fields.
        Arguments:
            path: Path to the data file
            text_field: The field that will be used for text data.
            label_field: The field that will be used for label data.
            Remaining keyword arguments: Passed to the constructor of
                data.Dataset.
        """
        fields = [('text', text_field), ('label', label_field)]
        
        examples = []
        # Get text
        with open(path+'.nl', 'r') as f:
            for line in f:
                ex = data.Example()
                # Preprocess automatically does spacy tokenization
                ex.text = text_field.preprocess(line.strip()) 
                examples.append(ex)
        
        # Get labels
        with open(path+'.sql', 'r') as f:
            for i, line in enumerate(f):
                label = self._get_label_from_query(line.strip())
                examples[i].label = label
                
        super(ATIS, self).__init__(examples, fields, **kwargs)
    
    # Simple function to get question labels from query
    def _get_label_from_query(self, query):
        parts = query.split(' ')
        if parts[1] == 'DISTINCT':
            label = parts[2]
        else:
            label = parts[1]
        
        if '.' in label:
            label = label.split('.')[-1]
        
        return label

    @classmethod
    def splits(cls, text_field, label_field, path='./',
               train='train', validation='dev', test='test',
               **kwargs):
        """Create dataset objects for splits of the ATIS dataset.
        Arguments:
            text_field: The field that will be used for the sentence.
            label_field: The field that will be used for label data.
            root: The root directory that the dataset's zip archive will be
                expanded into; therefore the directory in whose trees
                subdirectory the data files will be stored.
            train: The filename of the train data. Default: 'train.txt'.
            validation: The filename of the validation data, or None to not
                load the validation set. Default: 'dev.txt'.
            test: The filename of the test data, or None to not load the test
                set. Default: 'test.txt'.
            Remaining keyword arguments: Passed to the splits method of
                Dataset.
        """

        train_data = None if train is None else cls(
            os.path.join(path, train), text_field, label_field, **kwargs)
        val_data = None if validation is None else cls(
            os.path.join(path, validation), text_field, label_field, **kwargs)
        test_data = None if test is None else cls(
            os.path.join(path, test), text_field, label_field, **kwargs)
        return tuple(d for d in (train_data, val_data, test_data)
                     if d is not None)

###Implement tortchtext Iterators





We will use the `ATIS.splits` class method to build the `ATIS` instances for train and test data. This method splits the data into either two (train & test) or three (train, test, validation) subsets.

In [0]:
# Make splits for data
train_data, test_data = ATIS.splits(TEXT, LABEL, validation=None)

# Build vocabulary for data fields
TEXT.build_vocab(train_data)
LABEL.build_vocab(train_data)

Once the data is processed we build the vocabulary and then construct iterators which loop over the datasets in batches. This will be important for SGD for logistic regression and for other models later in the course.

In [0]:
# Make iterator for splits
BATCH_SIZE = 32
train_iter = data.BucketIterator(
    train_data,
    batch_size=BATCH_SIZE,
    device=device)

test_iter = data.Iterator(test_data, batch_size=BATCH_SIZE, sort=False, device=device)

### Bag-of-Words Text Representation
####Your Naive Bayes, logistic regression and MLP classifiers will use a bag of words representation for the data. The `torchtext` iterators output tokenized natural language, which you must convert into a bag of words representation.
---
####HINT: Your vocabulary should be derived ONLY from your training set, not your entire dataset. Make certain that your bag-of-words representations account for this. You may have unknown words in your test set and we leave it up to you to decide the best way of handling this.

Additionally, you may find the following methods for tensor manipulation useful to ensure your tensors are of the appropriate dimensions.



* https://pytorch.org/docs/stable/torch.html#torch.unsqueeze
* https://pytorch.org/docs/stable/tensors.html#torch.Tensor.scatter_
* https://pytorch.org/docs/stable/torch.html#torch.sum



In [8]:
# Compute size of vocabulary
vocab_size = len(TEXT.vocab.itos)
labels_size = len(LABEL.vocab.itos)
print("Size of vocab: {}".format(vocab_size))

# Given a batch, provide a bag-of-words vector.
def batch_to_bow(batch, vocab_size):
    # Create a tensor with dimensions (number of examples, max example length, vocab size)
    batch_one_hot = torch.zeros(batch.shape[0], batch.shape[1], vocab_size).to(batch.device)
    batch_one_hot.scatter_(2, batch.unsqueeze(2), 1)
    batch_bow = batch_one_hot.sum(1)
    return batch_bow

Size of vocab: 862


## Establish a majority baseline

By defining a lower bound on performance, we know at minimum what to expect from any reasonable system. A simple baseline for classification tasks is to measure the accuracy of prediction when the most common class is always predicted. 

**Write code in the cell below that, given train and test data, prints information concerning the majority baseline.**

HINT: Use the `Counter` class from Python's `collections` library.

In [9]:
def majority_baseline_accuracy(train, test):
  # Find majority on training data
  counts = Counter()
  for ex in train:
      counts[ex.label] += 1

  most_common = counts.most_common(1)[0][0]
  print("Most common label: {}".format(most_common))
  # Evaluate accuracy on test data
  test_counts = Counter()
  for ex in test:
      test_counts[ex.label] += 1

  total_count = len(list(test_counts.elements()))
  most_common_count = test_counts[most_common]
  print('Count of most common label:', most_common_count)
  print('Count of total things labelled:', total_count)
  print('Portion of labels that are the most common one:', most_common_count/total_count)

# Call method on train, test data 
majority_baseline_accuracy(train_data, test_data)

Most common label: flight_id
Count of most common label: 306
Count of total things labelled: 448
Portion of labels that are the most common one: 0.6830357142857143


# Naive Bayes

Naive Bayes classification is based on the "naive" assumption that all features are independent. This dramatically reduces the number of parameters required for Bayesian classification, which utilizes Bayes' Theorem which utilizes known information ($P(Y)$, $P(X)$ and $(P(X_i|Y)$) to obtain the desired unknown probability of $P(Y|X)$. This is evaluated for each possible label and the label with greatest likelihood is the prediction for a given text. 

---
Let $ c_{NB} $ be the maximum value in a vector containing the conditional probabilities of label $c$ given each word in the text. Then, we can compute $c_{NB}$ by evaluating the probability of the label overall and the probability of the label given the presence of each word contained in a given text, as 
$$ c_{NB} = \text{argmax}_{c \in C} \left( \log P(c) + \sum_{w \in W}\log P(w|c) \right) $$

Where $c_{NB}$ is the naive Bayes classification of a bag of words, $C$ is the set of classifications, and $W$ is the bag of words.

We can calculate $P(c) = \frac{N_c}{N}$ where $N$ is the total number of data points in our training data and $N_c$ is the total number of data points in our training data with classification $c$. 

We can calculate $P(w_0 | c)$ using Laplace smoothing such that $$P(w_0 | c) = \frac{count(w_0, c) + 1}{\left( \sum_{w \in V} count(w,c)\right) + |V|}$$ where $V$ is the vocabulary.

----
##Below, implement the NaiveBayes class methods.
 

###1.  `train`: Populates the log probabilities table to contain $log(P(c))$ and $log(𝑃(𝑤_i|𝑐)) $ for each label for each word in the vocabulary.**
###2.   `evaluate_performance`: Evaluates the performance of the model on given datset and prints accuracy.

In [0]:
class NaiveBayes():
    def __init__ (self, text, label):
        self.text = text
        self.label = label
        self._vocab_size = len(text.vocab.stoi)
        self._num_labels = len(label.vocab.stoi)
        self.log_probs = {}
    
    ''' 
    Helper methods which students may or may not choose to implement.
    '''
    def get_vocab_size(self):
        return self._vocab_size

    def get_vocab(self):
        return self.text.vocab.stoi

    def get_num_labels(self):
        return self._num_labels

    def get_labels(self):
        return self.label.vocab.stoi

    def train(self, data):
        """
        Populates log probabilities table for training data.
        """
        vocab = self.get_vocab()
        labels = self.get_labels()
        for label in labels:
            self.log_probs[label] = {}

            # Calculate the log prior (logP(c))
            N = self.get_num_labels()
            Nc = sum(example.label == label for example in data.examples)
            self.log_probs[label]['log_prior'] = math.log(Nc / N)

            # Calculate the log likelyhood (logP(w | c)) for all words in vocab for each label
            self.log_probs[label]['log_likelihood'] = {}
            for word in vocab:
                count_wc = 0
                sum_count_wc = 0
                for example in data.examples:
                    if example.label == label:
                        count_wc += sum(token == word for token in example.text)
                        sum_count_wc += len(example.text)
                        Pwc = (count_wc + 1) / (sum_count_wc + len(vocab))
                        self.log_probs[label]['log_likelihood'][word] = math.log(Pwc)
    
    def evaluate_performance(self, dataset):
      """
      Takes a dataset and prints the model's performance that dataset.
      """
      # Count the number of correct guesses
      correct_guesses = 0
      vocab = self.get_vocab()
      for example in dataset.examples:
       # For each example, find the score of each label 
        scores = {}
        for label in self.log_probs:
          class_prediction = self.log_probs[label]['log_prior']
          for word in example.text:
            if word in vocab:
              class_prediction += self.log_probs[label]['log_likelihood'][word]
          scores[label] = class_prediction

        # Find the maximum score to determine our guess for the label
        argmax = max(scores, key=scores.get)

        # If it matches the actual label, we guessed correctly!
        if argmax == dataset.examples[dataset.examples.index(example)].label:
          correct_guesses += 1

      # Print our accuracy
      print('Accuracy: ', correct_guesses / len(dataset.examples))
      return correct_guesses/len(dataset.examples)

## Putting it all together

If you have implemented the class methods, the following should result in a trained model.

In [15]:
# Instantiate and train classifier
nb_classifier = NaiveBayes(TEXT, LABEL)
nb_classifier.train(train_data)

# Evaluate model performance
print("Train: ")
nb_classifier.evaluate_performance(train_data)
print("Test: ")
nb_classifier.evaluate_performance(test_data)

Train: 
Accuracy:  0.8942680977392099
Test: 
Accuracy:  0.84375


0.84375

# Logistic Regression

Unlike Naive Bayes, logistic regression calculates the conditional probabilities directly. If we let $c\in C$ be a label, $\mathbf{w} \in W$ be a bag-of-words representation of a natural language query, $\mathbf{d}$ be weights in the model tied to the compatability of $c$ and $\mathbf{w}$, and $f$ is $\mathbf{d}^T \mathbf{w}$, we use the softmax to get: 
$$ p(c|\mathbf{w}, \mathbf{d})= \frac{\exp (f(\mathbf{w},c,\mathbf{d}))}{\sum_{c'\in C}\exp(f(\mathbf{w},c',\mathbf{d}))}. $$

The weights are learned in the process of training by using a loss function--here the cross entropy loss--to compare the results produced by the current version of the model and the target results. 
---
###Below, implement the LogisticRegression class methods.
####1.  `__init__` : Takes the TEXT and LABEL `data.Field` instances and initializes the model.
####2.  `forward` : Given an input tensor, performs the forward step of the logistic regression.
HINT: Your LogisticRegression implementation may inherit from PyTorch's [nn.Module](https://pytorch.org/docs/stable/nn.html#module) class.

In [0]:
class LogisticRegression(nn.Module):
    def __init__ (self, label, text):
        super (LogisticRegression, self).__init__ ()
        self._num_labels = len(label.vocab.itos)
        self._vocab_size = len(text.vocab.itos)
        # Linear layer
        self.fc = nn.Linear(vocab_size, labels_size)
    
    def get_num_labels(self):
        return self._num_labels

    def get_vocab_size(self):
        return self._vocab_size

    def forward (self, input):
        # Apply the linear layer
        output = self.fc(input) # Batch size by number of labels
        return output

### Implement the method `train` for the LogisticRegression model.
<b>Parameters:</b> LogisticRegression model, data iterator, criterion, optimizer and # of epochs.

Trains the model for n epochs with provided optimizer and learning rate.


In [0]:
def train (model, data_iter, criterion, optim, n_epochs = 8):
    loss_values = []
    epochs = []
    for epoch in range (n_epochs):
        c_num = 0
        total = 0
        running_loss = 0.0
        for index, batch in enumerate(data_iter):
            # Zero parameter gradients
            optim.zero_grad()

            # Input and target
            input = batch_to_bow(batch.text, vocab_size)
            target = batch.label.long()

            # Feed the input and hidden state to the model
            scores = model(input)

            # Compute the loss
            loss = criterion(scores, target)

            # Perform backpropogation
            loss.backward()
            optim.step()

            # Prepare to compute the accuracy
            predictions = torch.argmax(scores, dim = 1)
            total += len(target)
            c_num += (predictions == target).sum().item()        
            running_loss += loss.item() * len(input)
        
            # Report the loss every 200 steps
            if index % 200 == 0:
                print ('Epoch :', epoch,
                        'Step: ', index,
                        'Loss: ', loss.item(),
                        'Accuracy:', float (c_num)/total)
        epoch_loss = running_loss / len(train_data)
        loss_values.append(epoch_loss)
        epochs.append(epoch)

    p0 = plt.figure(0)
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.plot(epochs, loss_values)
    plt.title("Epoch vs Loss")
    p0.show()

### Implement the method `evalaute_performance`.
This method takes a model & dataset, and returns the accuracy of the model on the dataset.

In [0]:
def evaluate_performance(model, data_iter):
    # Turn on eval mode
    model.eval()
    c_num = 0
    total = 0
    with torch.no_grad():
        for index, batch in enumerate(data_iter):
            # Input and target
            input = batch_to_bow(batch.text, model.get_vocab_size())
            target = batch.label.long()

            # Feed the input and hidden state to the model
            scores = model(input)

            # Determine the index of the maximum value for each test item
            predictions = torch.argmax(scores, dim=1)

        # Prepare to compute the accuracy
        total += len(target)
        c_num += (predictions == target).sum().item()

    # Return the accuracy
    return float (c_num)/total

## Putting it all together

If you have implemented the LogisticRegression class methods, the following should result in a trained model.

In [0]:
# Instantiate classifier
logreg_model = LogisticRegression(LABEL, TEXT).to(device) 
print(logreg_model)

# Build criterion (loss), optimizer
loss = nn.CrossEntropyLoss()
learning_rate = 0.01
optimizer = torch.optim.Adam(logreg_model.parameters(), lr=learning_rate)

# Train classifier model on training split
epochs = []
train_acc = []
test_acc = []
for n in range (5, 15):
    # Train model for n epochs
    train(logreg_model, train_iter, loss, optimizer, n)
    epochs.append(n)
    # Evaluate model performance on training, test splits
    train_acc.append(evaluate_performance(logreg_model, test_iter))
    test_acc.append(evaluate_performance(logreg_model, test_iter))

# Graph loss vs accuracy for train, test
p1 = plt.figure(1)
plt.title("Epochs vs Accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
train_line = plt.plot(epochs, train_acc, 'b', label="Train")
test_line = plt.plot(epochs, test_acc, 'r', label="Test")
plt.legend()
p1.show()

# Multilayer Perceptron

An MLP is composed of at least three fully connected layers of nodes, referred to as the input, output and "hidden" layers. Learning occurs by adjusting the connection weights between nodes based on the amount of error in the output compared to the prediction. 

---
Let the degree of error in an output node $j$ in the $n$th training query be $e_j(n) = d_j(n) - y_j(n)$, where $d$ is the true label and $y$ is the predicted label.

Then we can adjust the weights to minimize the entire output layer's cumulative error:
$$\mathcal{E}(n)=\frac{1}{2}\sum_j e_j^2(n)$$

The change in each weight according to gradient descent is 
$$\Delta w_{ji} (n) = -\eta\frac{\partial\mathcal{E}(n)}{\partial v_j(n)} y_i(n)$$.

---
##Implement the methods of the class MultiLayerPerceptron below.
####1.  `__init__` : Takes the TEXT and LABEL `data.Field` instances and initializes the model.
####2.  `forward` : Given an input tensor, performs the forward steps of the multilayer perceptron.

In [0]:
class MultiLayerPerceptron(nn.Module):
    def __init__(self, label, text, n_hidden=128):
        super(MultiLayerPerceptron, self).__init__()
        self._labels_size = len(label.vocab.stoi)
        self._vocab_size = len(text.vocab.stoi)
        self.input2hidden = nn.Linear(self._vocab_size, n_hidden)
        self.relu = nn.ReLU()
        self.hidden2output = nn.Linear(n_hidden, self._labels_size)
        self.softmax = nn.LogSoftmax(dim=1)
    
    def get_vocab_size(self):
        return self._vocab_size
    
    def get_labels_size(self):
        return self._labels_size

    def forward(self, data):
        output = self.input2hidden(data)
        output = self.relu(output)
        output = self.hidden2output(output)
        output = self.softmax(output)
        return output

### Implement the method `train` for the MultiLayerPerceptron model.
<b>Parameters:</b> MultiLayerPerceptron model, data iterator, criterion, optimizer and # of epochs.

Trains the model for n epochs with provided optimizer and learning rate.


In [0]:
def train(model, data_iter, criterion, optim, n_epochs = 10):
    loss_values = []
    epochs = []
    for epoch in range (n_epochs):
        curr_loss = 0.0
        running_loss = 0.0
        c_num = 0
        total = 0
        for index, batch in enumerate(data_iter):
            if (len(batch) == 32):
                # Zero the parameter gradients
                optim.zero_grad()
                
                # Input and target
                input = batch_to_bow(batch.text, model.get_vocab_size())
                target = batch.label.long()
                
                # Forward step
                predictions = model(input)
                criterion = nn.NLLLoss()

                # Compute loss
                loss = criterion(predictions, target)
                total += len(target)
                c_num += (predictions == target).sum().item()
                running_loss += loss.item() * len(input)
                
                # Backward step
                loss.backward()
                optim.step()

                # Report the loss every 200 steps
                curr_loss += loss.item()
                if index % 200 == 0:
                    print ('Epoch :', epoch,
                        'Step: ', index,
                        'Loss: ', loss.item(),
                        'Accuracy:', float (c_num)/total)
                    
            epoch_loss = running_loss / len(train_data)
            loss_values.append(epoch_loss)
            epochs.append(epoch)

    p2 = plt.figure(2)
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.plot(epochs, loss_values)
    plt.title("Epoch vs Loss")
    p2.show()

### Implement the method `evalaute_performance`.
This method takes a model & dataset, and returns the accuracy of the model on the dataset.

In [0]:
def evaluate_performance(model, data_iter):
    c_num = 0
    total = 0
    with torch.no_grad():
        for index, batch in enumerate(data_iter):
            # Input and target
            input = batch_to_bow(batch.text, model.get_vocab_size())
            target = batch.label.long()

            # Feed the input and hidden state to the model then determine the 
            # index of the maximum value for each test item
            scores = model(input)
            predictions = torch.argmax(scores, dim=1)

            # Prepare to compute the accuracy
            total += len(target)
            c_num += (predictions == target).sum().item()

            # Return the accuracy
            print("Accuracy: {}".format(float(c_num)/total))
            return float (c_num)/total

## Putting it all together

If you have implemented the MultiLayerPerceptron class and associated methods, the following code will  train the model on the training set and evaluates its performance on both train and test sets.

In [0]:
# Instantiate classifier
mlp_model = MultiLayerPerceptron(LABEL, TEXT).to(device)
print(mlp_model)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()  
learning_rate = 0.005
optimizer = torch.optim.Adam(mlp_model.parameters(), lr=learning_rate)

# Train classifier model on training split
epochs = []
train_acc = []
test_acc = []
for n in range (5, 50, 5):
    # Train model for n epochs
    train(mlp_model, train_iter, loss, optimizer, n)
    epochs.append(n)
    # Evaluate model performance on training, test splits
    train_acc.append(evaluate_performance(mlp_model, train_iter))
    test_acc.append(evaluate_performance(mlp_model, test_iter))

# Graph loss vs accuracy for train, test
p3 = plt.figure(3)
plt.title("MLP - Epochs vs Accuracy")
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
train_line = plt.plot(epochs, train_acc, 'b', label="Train")
test_line = plt.plot(epochs, test_acc, 'r', label="Test")
plt.legend()
p3.show()