# Quora Questions Classification with Recurrent Neural Networks
---

<img src="assets/word_cloud.png" width=60%>

An existential problem for any major website today is how to handle toxic and divisive content. Quora wants to tackle this problem head-on to keep their platform a place where users can feel safe sharing their knowledge with the world.

[Quora](https://www.quora.com) is a platform that empowers people to learn from each other. On Quora, people can ask questions and connect with others who contribute unique insights and quality answers. A key challenge is to weed out insincere questions -- those founded upon false premises, or that intend to make a statement rather than look for helpful answers.

In this notebook, we will be predicting whether a question asked on Quora is sincere or not. An insincere question is defined as a question intended to make a statement rather than look for helpful answers. Some characteristics that can signify that a question is insincere:
- Has a non-neutral tone
  - Has an exaggerated tone to underscore a point about a group of people
  - Is rhetorical and meant to imply a statement about a group of people
- Is disparaging or inflammatory
  - Suggests a discriminatory idea against a protected class of people, or seeks confirmation of a stereotype
  - Makes disparaging attacks/insults against a specific person or group of people
  - Based on an outlandish premise about a group of people
  - Disparages against a characteristic that is not fixable and not measurable
- Isn't grounded in reality
  - Based on false information, or contains absurd assumptions
- Uses sexual content (incest, bestiality, pedophilia) for shock value, and not to seek genuine answers

The training data includes the question that was asked, and whether it was identified as insincere (target = 1). The ground-truth labels contain some amount of noise: they are not guaranteed to be perfect.

Before moving to the next section, we need to import all packages required to do the analysis by calling the following:

In [32]:
# Data analysis packages
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Deep learning packages
import torch
import torch.nn as nn
import torch.utils.data
from torch.utils.data import TensorDataset, DataLoader

# Miscellaneous
import bcolz
import pickle
import re

---
## 1.0. Import the Data
The data is acquired from [here](https://www.kaggle.com/c/quora-insincere-questions-classification/data).

In [13]:
# Import dataset
df = pd.read_csv('data/train.csv')
df = df.iloc[0:1300000]

# Show the first 5 rows the dataset
df.head()

Unnamed: 0,qid,question_text,target
0,00002165364db923c7e6,How did Quebec nationalists see their province...,0
1,000032939017120e6e44,"Do you have an adopted dog, how would you enco...",0
2,0000412ca6e4628ce2cf,Why does velocity affect time? Does velocity a...,0
3,000042bf85aa498cd78e,How did Otto von Guericke used the Magdeburg h...,0
4,0000455dfa3e01eae3af,Can I convert montra helicon D to a mountain b...,0


We only need data from `question_text` and `target` column. So, we put them in the `sentences` and `labels` variable, respectively. 

In [14]:
sentences = df['question_text']
labels = df['target'].values

---
## 2.0. Clean the Data
To improve the predictive performance of the model, the following data cleaning are performed:
- Lowercase all the words in the data
- Remove short forms and mispellings
- Ensure symbols and punctuations not to be attached to a certain word
- Transform numbers

### 2.1. Lowercase All the Words
We will make all words in our questions to be lowercases.

In [15]:
# Lowercase all words
sentences = sentences.apply(lambda x: x.lower())

### 2.2. Remove Short Forms and Mispellings
There are some short form words and mispellings in the dataset. We will fix this.

In [16]:
# Dictionary of short form words and mispellings
mispell_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", 
                "'cause": "because", "could've": "could have", "couldn't": "could not", 
                "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", 
                "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", 
                "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", 
                "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", 
                "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", 
                "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", 
                "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", 
                "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", 
                "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not",
                "mightn't've": "might not have", "must've": "must have", "mustn't": "must not", 
                "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have",
                "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", 
                "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", 
                "she'd": "she would", "she'd've": "she would have", "she'll": "she will", 
                "she'll've": "she will have", "she's": "she is", "should've": "should have", 
                "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have",
                "so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", 
                "that's": "that is", "there'd": "there would", "there'd've": "there would have", 
                "there's": "there is", "here's": "here is","they'd": "they would", 
                "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", 
                "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", 
                "we'd": "we would", "we'd've": "we would have", "we'll": "we will", 
                "we'll've": "we will have", "we're": "we are", "we've": "we have", 
                "weren't": "were not", "what'll": "what will", "what'll've": "what will have", 
                "what're": "what are",  "what's": "what is", "what've": "what have", 
                "when's": "when is", "when've": "when have", "where'd": "where did", 
                "where's": "where is", "where've": "where have", "who'll": "who will", 
                "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", 
                "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", 
                "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", 
                "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have",
                "y'all're": "you all are","y'all've": "you all have","you'd": "you would", 
                "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", 
                "you're": "you are", "you've": "you have", 'colour': 'color', 'centre': 'center', 
                'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 
                'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 
                'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 
                'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 
                'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 
                'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 
                'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 
                'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'vegina': 'vagina',
                'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', 
                '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', 
                "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 
                'demonitization': 'demonetization', 'demonetisation': 'demonetization'}

We define `clean_mispell` function to fix short forms and mispellings.

In [17]:
def clean_mispell(text):
    clean_text = text
    for mispell in mispell_dict.keys():
        if re.search(mispell, text):
            clean_text = re.sub(mispell, mispell_dict[mispell], text)
    return clean_text

# remove short forms and mispellings
sentences = sentences.apply(lambda x: clean_mispell(x))

### 2.3. Punctuations and Symbols
Punctuations and symbols are often attached to a particular word. We define `clean_symbol` to ensure symbols and punctuations not to be attached to a certain word.  

In [18]:
symbols = [',', '.', '"', ':', ')', '(', '-', '!', '?', '|', 
           ';', "'", '$', '&', '/', '[', ']', '>', '%', '=', 
           '#', '*', '+', '\\', '•',  '~', '@', '£', '·', '_', 
           '{', '}', '©', '^', '®', '`',  '<', '→', '°', '€', 
           '™', '›',  '♥', '←', '×', '§', '″', '′', 'Â', '█', 
           '½', 'à', '…', '“', '★', '”', '–', '●', 'â', '►', 
           '−', '¢', '²', '¬', '░', '¶', '↑', '±', '¿', '▾', 
           '═', '¦', '║', '―', '¥', '▓', '—', '‹', '─', '▒', '：', 
           '¼', '⊕', '▼', '▪', '†', '■', '’', '▀', '¨', '▄', '♫', 
           '☆', 'é', '¯', '♦', '¤', '▲', 'è', '¸', '¾', 'Ã', '⋅', 
           '‘', '∞', '∙', '）', '↓', '、', '│', '（', '»', '，', '♪', 
           '╩', '╚', '³', '・', '╦', '╣', '╔', '╗', '▬', '❤', 'ï', 'Ø', 
           '¹', '≤', '‡', '√', ]

In [19]:
def clean_symbol(text):
    text = str(text)
    for symbol in symbols:
        text = text.replace(symbol, f' {symbol} ')
    return text

# ensure symbols and punctuations not to be attached to a certain word
sentences = sentences.apply(lambda x: clean_symbol(x))

### 2.4. Transform Numbers
All number with more than 5 digits are transformed into '###'.

In [20]:
def clean_numbers(x):
    x = re.sub('[0-9]{5,}', '###', x)
    return x

sentences = sentences.apply(lambda x: clean_numbers(x))

---
## 3.0. Pre-process the Data
We will implement the following pre-processing functions:
- Tokenize the questions
- Track vocabulary
- Encode the data
- Pad the questions

### 3.1. Tokenize the Questions
We will be splitting the questions into a word array using spaces as delimiters.

In [21]:
# tokenize all questions in the data
sentences_token = sentences.apply(lambda x: x.split())

In [22]:
print('Average word length of questions is {0:.0f}.'.format(np.mean(df['question_text'].apply(lambda x: len(x)))))
print('Max word length of questions is {0:.0f}.'.format(np.max(df['question_text'].apply(lambda x: len(x)))))
print('Min word length of questions is {0:.0f}.'.format(np.min(df['question_text'].apply(lambda x: len(x)))))

Average word length of questions is 71.
Max word length of questions is 1017.
Min word length of questions is 1.


### 3.2. Track Vocabulary
We define `track_vocab` to track our training vocabulary, which goes through all our text and counts the occurence of the contained words.

In [23]:
def track_vocab(sentences, verbose =  True):
    
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
                
    return vocab

In [24]:
# count the occurrence of all words in the data
vocab_count = track_vocab(sentences_token)
print({k: vocab_count[k] for k in list(vocab_count)[:5]})

{'how': 289121, 'did': 44478, 'quebec': 167, 'nationalists': 151, 'see': 9668}


### 3.3. Encode the Data
Since we're using embedding layers, we'll need to encode each word with an integer. To create a word embedding, we first need to transform the words to ids.  In this function, we create two dictionaries:
- Dictionary to go from the words to an id, we'll call `vocab_to_int`
- Dictionary to go from the id to word, we'll call `int_to_vocab`

We return these dictionaries in the **tuple** `(vocab_to_int, int_to_vocab)`

In [25]:
def create_lookup_tables(vocab_count):
    
    # sorting the words from most to least frequent in text occurrence
    sorted_vocab = sorted(vocab_count, key=vocab_count.get, reverse=True)
    # create vocab_to_int dictionary
    int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
    vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
    
    # return tuple
    return (vocab_to_int, int_to_vocab)

vocab_to_int, int_to_vocab = create_lookup_tables(vocab_count)

Then, we can convert our data into integers, so they can be passed into the network.

In [26]:
# encode the data
sentence_ints = []
for sentence in sentences_token:
    sentence_ints.append([vocab_to_int[word] for word in sentence])

### 3.4. Pad the Questions
To deal with both short and very long question, we'll pad or truncate all our questions to a specific length. For questions shorter than some `seq_length`, we'll pad with 0s. For questions longer than `seq_length`, we can truncate them to the first `seq_length` words. A good `seq_length`, in this case, is 71, because the average question length from the data is 71.  

In [27]:
def pad_features(sentences_token, seq_length):
    # getting the correct rows x cols shape
    features = np.zeros((len(sentences_token), seq_length), dtype=int)

    # for each review, I grab that review and 
    for i, row in enumerate(sentences_token):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    return features

# pad the questions
seq_length = 71
features = pad_features(sentence_ints, seq_length)

---
## 4.0. Define Training, Validation, and Test Set
With our data in nice shape, we'll split it into training, validation, and test sets.

In [28]:
train_X, val_test_X, train_y, val_test_y = train_test_split(features, labels, 
                                                            test_size=0.4, 
                                                            random_state=42, shuffle=True,
                                                            stratify=labels)

val_X, test_X, val_y, test_y = train_test_split(val_test_X, val_test_y, 
                                                test_size=0.5, 
                                                random_state=42, shuffle=True,
                                                stratify=val_test_y)

### DataLoaders and Batching
After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:
1. Create a known format for accessing our data, using [TensorDataset](https://pytorch.org/docs/stable/data.html#) which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

In [29]:
# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_X), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_X), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_X), torch.from_numpy(test_y))

# dataloaders
batch_size = 50
num_workers = 8

# make sure to SHUFFLE the training data
train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, num_workers=num_workers)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size, num_workers=num_workers)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size, num_workers=num_workers)

In [30]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([50, 71])
Sample input: 
 tensor([[    0,     0,     0,  ..., 28272,  1899,     0],
        [    0,     0,     0,  ...,  1908,   316,     0],
        [    0,     0,     0,  ...,   516,   222,     0],
        ...,
        [    0,     0,     0,  ..., 18328,    25,     0],
        [    0,     0,     0,  ...,    25,   113,     0],
        [    0,     0,     0,  ...,    25,   123,     0]])

Sample label size:  torch.Size([50])
Sample label: 
 tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1,
        0, 0])


## 5.0. Build Network Architecture
### 5.1. Import Pre-Trained Word Embeddings
**Note**: This part is taken from a Medium article [How to use Pre-trained Word Embeddings in PyTorch](https://medium.com/@martinpella/how-to-use-pre-trained-word-embeddings-in-pytorch-71ca59249f76). 

Rather than training our own word vectors from scratch for word embedding, we will leverage on GloVe. Its authors have released four text files with word vectors trained on different massive web datasets. They are available for download [here](https://nlp.stanford.edu/projects/glove/). We will use “Wikipedia 2014 + Gigaword 5” which is the smallest file (“ [glove.6B.zip](http://nlp.stanford.edu/data/wordvecs/glove.6B.zip)”) with 822 MB. It was trained on a corpus of 6 billion tokens and contains a vocabulary of 400 thousand tokens.

We need to parse the file to get as output: list of words, dictionary mapping each word to their id (position) and array of vectors. Given that the vocabulary have 400k tokens, we will use [bcolz](https://github.com/Blosc/bcolz) to store the array of vectors. It provides columnar, chunked data containers that can be compressed either in-memory and on-disk. It is based on NumPy, and uses it as the standard data container to communicate with bcolz objects. We then save the outputs to disk for future uses.

In [33]:
words = []
idx = 0
word2idx = {}
vectors = bcolz.carray(np.zeros(1), rootdir=f'embedding/glove/6B.50.dat', mode='w')

with open(f'embedding/glove/glove.6B.50d.txt', 'rb') as f:
    for l in f:
        line = l.decode().split()
        word = line[0]
        words.append(word)
        word2idx[word] = idx
        idx += 1
        vect = np.array(line[1:]).astype(np.float)
        vectors.append(vect)
    
vectors = bcolz.carray(vectors[1:].reshape((400000, 50)), rootdir=f'embedding/glove/6B.50.dat', mode='w')
vectors.flush()
pickle.dump(words, open(f'embedding/glove/6B.50_words.pkl', 'wb'))
pickle.dump(word2idx, open(f'embedding/glove/6B.50_idx.pkl', 'wb'))

Using those objects we can now create a dictionary that given a word returns its vector.

In [34]:
vectors = bcolz.open(f'embedding/glove/6B.50.dat')[:]
words = pickle.load(open(f'embedding/glove/6B.50_words.pkl', 'rb'))
word2idx = pickle.load(open(f'embedding/glove/6B.50_idx.pkl', 'rb'))

glove = {w: vectors[word2idx[w]] for w in words}

What we need to do at this point is to create an embedding layer, that is a dictionary mapping integer indices (that represent words) to dense vectors. It takes as input integers, it looks up these integers into an internal dictionary, and it returns the associated vectors.

We have already built a Python dictionary with similar characteristics, but it does not support auto differentiation so can not be used as a neural network layer and was also built based on GloVe’s vocabulary, likely different from our dataset’s vocabulary. In PyTorch an embedding layer is available through torch.nn.Embedding class.

We must build a matrix of weights that will be loaded into the PyTorch embedding layer. Its shape will be equal to (dataset’s vocabulary length, word vectors dimension).

For each word in dataset’s vocabulary, we check if it is on GloVe’s vocabulary. If it do it, we load its pre-trained word vector. Otherwise, we initialize a random vector.

In [35]:
sorted_vocab = sorted(vocab_count, key=vocab_count.get, reverse=True)
target_vocab = sorted_vocab
emb_dim = 50

matrix_len = len(target_vocab)
weights_matrix = np.zeros((matrix_len, 50))
words_found = 0

for i, word in enumerate(target_vocab):
    try: 
        weights_matrix[i] = glove[word]
        words_found += 1
    except KeyError:
        weights_matrix[i] = np.random.normal(scale=0.6, size=(emb_dim, ))

In [36]:
def create_emb_layer(weights_matrix, non_trainable=False):
    num_embeddings, embedding_dim = weights_matrix.shape
    emb_layer = nn.Embedding(num_embeddings, embedding_dim)
    if non_trainable:
        emb_layer.weight.requires_grad = False

    return emb_layer, num_embeddings, embedding_dim

### 5.2. Define RNN Architecture

In this model, we use multiple bidirectional GRU/LSTM layers in the network. The bidirectional LSTM/GRU keeps the contextual information in both directions which is pretty useful in text classification task. 

We also use attention model to build our model architecture. In the past conventional methods like TFIDF/CountVectorizer etc., we used to find features from text by doing a keyword extraction. Some word are more helpful in determining the category of a text than others. But in this method we sort of lost the sequential structure of text. With LSTM and deep learning methods while we are able to take case of the sequence structure we lose the ability to give higher weightage to more important words. Attention mechanism is introduced to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector.

In [37]:
# First checking if GPU is available
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

No GPU available, training on CPU.


In [38]:
# implementation of attention layer
class Attention(nn.Module):
    def __init__(self, feature_dim, step_dim, bias=True, **kwargs):
        super(Attention, self).__init__(**kwargs)
        
        self.supports_masking = True

        self.bias = bias
        self.feature_dim = feature_dim
        self.step_dim = step_dim
        self.features_dim = 0
        
        weight = torch.zeros(feature_dim, 1)
        nn.init.kaiming_uniform_(weight)
        self.weight = nn.Parameter(weight)
        
        if bias:
            self.b = nn.Parameter(torch.zeros(step_dim))
        
    def forward(self, x, mask=None):
        feature_dim = self.feature_dim 
        step_dim = self.step_dim

        eij = torch.mm(
            x.contiguous().view(-1, feature_dim), 
            self.weight
        ).view(-1, step_dim)
        
        if self.bias:
            eij = eij + self.b
            
        eij = torch.tanh(eij)
        a = torch.exp(eij)
        
        if mask is not None:
            a = a * mask

        a = a / (torch.sum(a, 1, keepdim=True) + 1e-10)

        weighted_input = x * torch.unsqueeze(a, -1)
        return torch.sum(weighted_input, 1)

>**First, we'll pass in words to an embedding layer.** We need an embedding layer because we have tens to hundreds of thousands of words, so we will need a more efficient representation for our input data than one-hot encoded vectors. 

>**After input words are passed to an embedding layer, the new embeddings will be passed to bidirectional LSTM/GRU layers.** These layers will add *recurrent* connections to the network and give us the ability to include information about the *sequence* of words in our data. The bidirectional LSTM/GRU layers keep the contextual information in both directions which is pretty useful in text classification task. 

>**The outputs of the bidirectional LSTM/GRU layers will be passed to the attention layer.** Attention mechanism is introduced to extract such words that are important to the meaning of the sentence and aggregate the representation of those informative words to form a sentence vector.

>**Finally, the outputs will go to a output layer.** We are using a fully-connected neural network layer.

In [39]:
import torch.nn as nn

class RNN(nn.Module):
    """
    The RNN model that will be used to perform classification.
    """

    def __init__(self, weights_matrix, output_size, hidden_dim, drop_prob=0.3):
        """
        Initialize the model by setting up the layers.
        """
        super(RNN, self).__init__()

        self.output_size = output_size
        self.hidden_dim = hidden_dim
        
        # embedding layers
        self.embedding, self.num_embeddings, self.embedding_dim = create_emb_layer(weights_matrix, True)
        
        # embedding dropout
        self.dropout = nn.Dropout2d(0.1)
        
        # lstm and GRU
        self.lstm = nn.LSTM(self.embedding_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.gru = nn.GRU(hidden_dim * 2, hidden_dim, bidirectional=True, batch_first=True)
        
        # attention layer
        self.attention = Attention(hidden_dim*2, seq_length)
        
        # linear
        self.fc = nn.Linear(hidden_dim*2, 64)
        self.out = nn.Linear(64, 1)
        
        self.relu = nn.ReLU()
        self.sig = nn.Sigmoid()
        

    def forward(self, x):
        """
        Perform a forward pass of our model on some inputs.
        """
        batch_size = x.size(0)

        # embedding output
        x = x.long()
        embeds = self.embedding(x)
        embeds = torch.squeeze(torch.unsqueeze(embeds, 0))
        
        # lstm, gru, and attention outputs
        lstm_out, _ = self.lstm(embeds)
        gru_out, _ = self.gru(lstm_out)
        attention_out = self.attention(gru_out, 256)
        
        # linear outputs
        fc_out = self.relu(self.fc(attention_out))
        final_out = self.out(fc_out)
        
        # sigmoid function
        sig_out = self.sig(final_out)
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels
    
        return sig_out

### 5.3. Instantiate the network
Here, we'll instantiate the network. First up, defining the hyperparameters.

* `weights_matrix`: The pre-trained word vector.
* `output_size`: Size of our desired output.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3.

In [40]:
# Instantiate the model w/ hyperparams
weights_matrix = weights_matrix
output_size = 1
hidden_dim = 60

net = RNN(weights_matrix, output_size, hidden_dim)

print(net)

RNN(
  (embedding): Embedding(194090, 50)
  (dropout): Dropout2d(p=0.1)
  (lstm): LSTM(50, 60, batch_first=True, bidirectional=True)
  (gru): GRU(120, 60, batch_first=True, bidirectional=True)
  (attention): Attention()
  (fc): Linear(in_features=120, out_features=64, bias=True)
  (out): Linear(in_features=64, out_features=1, bias=True)
  (relu): ReLU()
  (sig): Sigmoid()
)


### 5.4. Training
We'll use a cross entropy loss, which is designed to work with a single Sigmoid output. [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss), or **Binary Cross Entropy Loss**, applies cross entropy loss to a single value between 0 and 1. We also have some data and training hyparameters:

* `lr`: Learning rate for our optimizer.
* `epochs`: Number of times to iterate through the training dataset.
* `clip`: The maximum gradient value to clip at (to prevent exploding gradients).

In [41]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, net.parameters()), lr=lr)

In [32]:
# training params

epochs = 1

counter = 0
print_every = 100
clip=5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output = net(inputs)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs.
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output = net(inputs)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Epoch: 1/1... Step: 100... Loss: 0.180951... Val Loss: 0.228777
Epoch: 1/1... Step: 200... Loss: 0.122095... Val Loss: 0.217856
Epoch: 1/1... Step: 300... Loss: 0.246794... Val Loss: 0.193550
Epoch: 1/1... Step: 400... Loss: 0.241540... Val Loss: 0.189776
Epoch: 1/1... Step: 500... Loss: 0.237808... Val Loss: 0.181189
Epoch: 1/1... Step: 600... Loss: 0.133767... Val Loss: 0.179677
Epoch: 1/1... Step: 700... Loss: 0.193125... Val Loss: 0.175078
Epoch: 1/1... Step: 800... Loss: 0.178750... Val Loss: 0.171360
Epoch: 1/1... Step: 900... Loss: 0.201808... Val Loss: 0.169327
Epoch: 1/1... Step: 1000... Loss: 0.079640... Val Loss: 0.167092
Epoch: 1/1... Step: 1100... Loss: 0.069104... Val Loss: 0.163757
Epoch: 1/1... Step: 1200... Loss: 0.063664... Val Loss: 0.168971
Epoch: 1/1... Step: 1300... Loss: 0.099035... Val Loss: 0.162384
Epoch: 1/1... Step: 1400... Loss: 0.062179... Val Loss: 0.160535
Epoch: 1/1... Step: 1500... Loss: 0.129470... Val Loss: 0.157751
Epoch: 1/1... Step: 1600... Loss: 

Epoch: 1/1... Step: 12700... Loss: 0.271732... Val Loss: 0.127527
Epoch: 1/1... Step: 12800... Loss: 0.161472... Val Loss: 0.127624
Epoch: 1/1... Step: 12900... Loss: 0.122112... Val Loss: 0.128823
Epoch: 1/1... Step: 13000... Loss: 0.090110... Val Loss: 0.127764
Epoch: 1/1... Step: 13100... Loss: 0.181880... Val Loss: 0.129444
Epoch: 1/1... Step: 13200... Loss: 0.056729... Val Loss: 0.126862
Epoch: 1/1... Step: 13300... Loss: 0.151670... Val Loss: 0.126507
Epoch: 1/1... Step: 13400... Loss: 0.075217... Val Loss: 0.128032
Epoch: 1/1... Step: 13500... Loss: 0.126853... Val Loss: 0.130052
Epoch: 1/1... Step: 13600... Loss: 0.056019... Val Loss: 0.125949
Epoch: 1/1... Step: 13700... Loss: 0.113785... Val Loss: 0.128121
Epoch: 1/1... Step: 13800... Loss: 0.191027... Val Loss: 0.129054
Epoch: 1/1... Step: 13900... Loss: 0.170780... Val Loss: 0.127327
Epoch: 1/1... Step: 14000... Loss: 0.083526... Val Loss: 0.125631
Epoch: 1/1... Step: 14100... Loss: 0.193275... Val Loss: 0.126956
Epoch: 1/1

#### Save the Model
After running the above training cell, the trained model will be saved by name, `trained_rnn`.

In [38]:
import os

def save_model(filename, decoder):
    save_filename = os.path.splitext(os.path.basename(filename))[0] + '.pt'
    torch.save(decoder, save_filename)

In [39]:
# saving the trained model
save_model('./save/trained_rnn', trained_rnn)
print('Model Trained and Saved')

  "type " + obj.__name__ + ". It won't be checked "
  "type " + obj.__name__ + ". It won't be checked "


Model Trained and Saved


#### Load the Model
We can load the trained model that have been saved to disk.

In [44]:
import os
import pickle
import torch

def load_model(filename):
    save_filename = os.path.splitext(os.path.basename(filename))[0] + '.pt'
    return torch.load(save_filename, map_location='cpu')

In [45]:
net = load_model('./save/trained_rnn')



### 5.5. Testing

We'll see how our trained model performs on all of our defined test_data, above. We'll calculate the average loss and accuracy over the test data.

In [46]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    # get predicted outputs
    output = net(inputs)
    
    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # convert output probabilities to predicted class (0 or 1)
    pred = torch.round(output.squeeze())  # rounds to the nearest integer
    
    # compare predictions to true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.125
Test accuracy: 0.952
