# Deep Learning in Medicine

## BMSC-GA 4493, BMIN-GA 3007¶

## Lab 7: Word Embedding & RNNs

## Goal:
- Basics of word embedding and how to use them
- Understand the mechanics of RNNs in Pytorch
- Train RNN based neural networks on text data

### Word Embedding

When using deep learning methods on NLP tasks, we usually utilize [word embedding](https://en.wikipedia.org/wiki/Word_embedding). To put it briefly, word embedding represent words, or tokens, in a vocabulary as a distributed numerical vector. There are a lot of methods to obtain a word embedding, with some of the most famous being Word2Vec, GloVe, and fastText. It is not difficult to find a general purpose word embedding trained by one of the aforementioned methods on the Internet that's been trained with a massive amount of data. It is usually a good idea to use these pre-trained embedding to save yourself some time and computing resource.

In this lab, we will be using the [GloVe embedding](https://nlp.stanford.edu/projects/glove/) developed by Stanford,  one of the state-of-the-art word embedding. Please download the file ```glove.6B.50d.txt``` from the Resources tab in NYUClasses.

In [1]:
import numpy as np
from tqdm import tqdm
# load embedding
emb_dim = 50
with open('./glove.6B.50d.txt') as f:
    loaded_embeddings = []
    words = {}
    chars = {}
    idx2words = {}
    ordered_words = []

    for i, line in tqdm(enumerate(f)):
        s = line.split()
        loaded_embeddings.append(np.asarray(s[1:]))
        
        words[s[0]] = len(words)
        idx2words[i] = s[0]
        ordered_words.append(s[0])
        
# add unknown to word and char
loaded_embeddings.append(np.random.rand(emb_dim))
words["<UNK>"] = len(words)

# add padding
loaded_embeddings.append(np.zeros(emb_dim))
words["<PAD>"] = len(words)

chars["<UNK>"] = len(chars)
chars["<PAD>"] = len(chars)

loaded_embeddings = np.array(loaded_embeddings).astype(float)

400000it [00:07, 54212.32it/s]


Now we have three variables
- ```loaded_embeddings``` of shape [106687, 50] consisting of the actual vectors,
- ```words```, a dictionary consisting of each token in the vocabulary and its corresponding row in ```loaded_embeddings```, and
- ```idx2words```, a list consisting of all the words in their order in ```loaded_embeddings```

With these, we can use the function below to play around with words to get a sense of how word embedding can be used to represent words. To lookup a word's vector:

In [93]:
loaded_embeddings[words['this']]

array([ 5.3074e-01,  4.0117e-01, -4.0785e-01,  1.5444e-01,  4.7782e-01,
        2.0754e-01, -2.6951e-01, -3.4023e-01, -1.0879e-01,  1.0563e-01,
       -1.0289e-01,  1.0849e-01, -4.9681e-01, -2.5128e-01,  8.4025e-01,
        3.8949e-01,  3.2284e-01, -2.2797e-01, -4.4342e-01, -3.1649e-01,
       -1.2406e-01, -2.8170e-01,  1.9467e-01,  5.5513e-02,  5.6705e-01,
       -1.7419e+00, -9.1145e-01,  2.7036e-01,  4.1927e-01,  2.0279e-02,
        4.0405e+00, -2.4943e-01, -2.0416e-01, -6.2762e-01, -5.4783e-02,
       -2.6883e-01,  1.8444e-01,  1.8204e-01, -2.3536e-01, -1.6155e-01,
       -2.7655e-01,  3.5506e-02, -3.8211e-01, -7.5134e-04, -2.4822e-01,
        2.8164e-01,  1.2819e-01,  2.8762e-01,  1.4440e-01,  2.3611e-01])

These numbers might not make sense to us, but we can use this in a different way. Please complete the function below, which looks up the most similar words to a given word:

In [2]:
from sklearn.metrics.pairwise import cosine_similarity
def find_nearest(ref_vec, words, embedding, topk=5):
    """
    Finds the top-k most similar words to "word" in terms of cosine similarity in the given embedding
    :param ref_vec: reference word vector
    :param words: dict, word to its index in the embedding
    :param embedding: numpy array of shape [V, embedding_dim]
    :param topk: number of top candidates to return
    :return a list of top-k most similar words
    """
    # TODO: compute cosine similarities
    scored_words = None
    
    # TODO: sort the words by similarity and return the topk
    sorted_words = None
    
    return [(idx2words[w], scored_words[w]) for w in sorted_words[:topk]]

In [3]:
find_nearest(loaded_embeddings[words['hate']], words, loaded_embeddings, topk=5)

[('hate', 0.9999999999999998),
 ('hatred', 0.7746837233748827),
 ('shame', 0.7489536581704521),
 ('racist', 0.7371559111440316),
 ('anyone', 0.7364716727627106)]

In [4]:
find_nearest(loaded_embeddings[words['worse']] - loaded_embeddings[words['better']] + loaded_embeddings[words['best']],
            words, loaded_embeddings, topk=1)

[('worst', 0.8109660213826736)]

In [5]:
find_nearest(loaded_embeddings[words['king']] - loaded_embeddings[words['queen']] + loaded_embeddings[words['woman']],
            words, loaded_embeddings, topk=1)

[('man', 0.8706067438874707)]

## Preprocessing

First, let's read in the data. We will be using the [First GOP Debate Twitter Sentiment dataset](https://www.kaggle.com/ngyptr/lstm-sentiment-analysis-keras/data), which contains Tweets after the first GOP debate and their sentiments (among other stuff). The dataset can be downloaded [here](https://www.kaggle.com/crowdflower/first-gop-debate-twitter-sentiment/downloads/Sentiment.csv/2). Please save it in the same directory as this notebook. 

In [1]:
import pandas as pd
import numpy as np
import re
import os

np.random.seed(1111)

df = pd.read_csv('Sentiment.csv')
df.head()

Unnamed: 0,id,candidate,candidate_confidence,relevant_yn,relevant_yn_confidence,sentiment,sentiment_confidence,subject_matter,subject_matter_confidence,candidate_gold,...,relevant_yn_gold,retweet_count,sentiment_gold,subject_matter_gold,text,tweet_coord,tweet_created,tweet_id,tweet_location,user_timezone
0,1,No candidate mentioned,1.0,yes,1.0,Neutral,0.6578,None of the above,1.0,,...,,5,,,RT @NancyLeeGrahn: How did everyone feel about...,,2015-08-07 09:54:46 -0700,629697200650592256,,Quito
1,2,Scott Walker,1.0,yes,1.0,Positive,0.6333,None of the above,1.0,,...,,26,,,RT @ScottWalker: Didn't catch the full #GOPdeb...,,2015-08-07 09:54:46 -0700,629697199560069120,,
2,3,No candidate mentioned,1.0,yes,1.0,Neutral,0.6629,None of the above,0.6629,,...,,27,,,RT @TJMShow: No mention of Tamir Rice and the ...,,2015-08-07 09:54:46 -0700,629697199312482304,,
3,4,No candidate mentioned,1.0,yes,1.0,Positive,1.0,None of the above,0.7039,,...,,138,,,RT @RobGeorge: That Carly Fiorina is trending ...,,2015-08-07 09:54:45 -0700,629697197118861312,Texas,Central Time (US & Canada)
4,5,Donald Trump,1.0,yes,1.0,Positive,0.7045,None of the above,1.0,,...,,156,,,RT @DanScavino: #GOPDebate w/ @realDonaldTrump...,,2015-08-07 09:54:45 -0700,629697196967903232,,Arizona


In [2]:
df = df[['sentiment', 'text']]
df.groupby('sentiment').count()

Unnamed: 0_level_0,text
sentiment,Unnamed: 1_level_1
Negative,8493
Neutral,3142
Positive,2236


For simplicity, we're only going to look at positive and negative tweets.

In [3]:
df = df[df['sentiment'] != 'Neutral']
df['sentiment'] = [1 if s == "Positive" else 0 for s in df['sentiment']]
df.groupby('sentiment').count()

Unnamed: 0_level_0,text
sentiment,Unnamed: 1_level_1
0,8493
1,2236


In [4]:
from sklearn.cross_validation import train_test_split
train_data, test_data = train_test_split(df, test_size=0.15)
print(len(train_data), len(test_data))

9119 1610




In [5]:
train_data.groupby('sentiment').count().apply(lambda x: 100 * x / float(x.sum()))

Unnamed: 0_level_0,text
sentiment,Unnamed: 1_level_1
0,79.471433
1,20.528567


In [6]:
test_data.groupby('sentiment').count().apply(lambda x: 100 * x / float(x.sum()))

Unnamed: 0_level_0,text
sentiment,Unnamed: 1_level_1
0,77.391304
1,22.608696


In [7]:
train_X, train_y = train_data['text'], train_data['sentiment']
test_X, test_y = test_data['text'], test_data['sentiment']

In [8]:
train_data.iloc[0]

sentiment                                                    0
text         RT @ericstonestreet: Trump has Cam hands. #GOP...
Name: 12241, dtype: object

## Vocabulary

In order to use the concept of word embedding, we need to build a vocabulary for our model. We will build our vocabulary on the tokens from the training data only. Any words that are not in our vocabulary will be replaced with an ```<UNK>``` token. We will also add a ```<PAD>``` token as padding.

For computational purposes, we'll only take words that appeared more than 5 times.

In [9]:
from collections import Counter
import re

UNK = "<UNK>"
PAD = "<PAD>"
def build_vocab(sentences, min_count=5, max_vocab=None):
    """
    Build vocabulary from sentences (list of strings)
    """
    # keep track of the number of appearance of each word
    word_count = Counter()
    
    for s in sentences:
        word_count.update(re.findall(r"[\w']+|[.,!?;]", s.lower()))
    
    vocabulary = list([w for w in word_count if word_count[w] > min_count]) + [UNK, PAD]
    indices = dict(zip(vocabulary, range(len(vocabulary))))

    return vocabulary, indices
    
vocabulary, vocab_indices = build_vocab(train_X)
print(len(vocabulary))

2189


Next, we'll have to convert each token in the sentences into its index corresponding to our vocabulary so that pytorch can use it. We'll also pad our sentences to a fixed length of 25 tokens so that we can do batch processing. We will do this for all three sets.

In [10]:
def sentences_to_padded_index_sequences(words, sentences, pad_length=100):
    padded_sequences = np.zeros((len(sentences), pad_length))
    for i, s in enumerate(sentences):
        indices = np.ones(pad_length) * words['<PAD>']
        # only take the first 100
        token_indices = np.array([words[w] if w in words else words['<UNK>'] for w in re.findall(r"[\w']+|[.,!?;]", s.lower())[:pad_length]])
        indices[:len(token_indices)] = token_indices
        padded_sequences[i] = indices
    return padded_sequences

In [11]:
train_X = sentences_to_padded_index_sequences(vocab_indices, train_X, 25)
test_X = sentences_to_padded_index_sequences(vocab_indices, test_X, 25)

In [12]:
train_X.shape

(9119, 25)

## For PyTorch

In [13]:
import torch
from torch.utils.data import DataLoader, Dataset

class TweetDataset(Dataset):
    def __init__(self, sentences, labels):
        self.sentences = sentences.astype(int)
        self.labels = np.array(labels).astype(int)
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, key):
        return (torch.LongTensor(self.sentences[key]), self.labels[key])

BATCH_SIZE = 32
train_loader = DataLoader(TweetDataset(train_X, train_y),
                          batch_size=BATCH_SIZE,
                          shuffle=True)
test_loader = DataLoader(TweetDataset(test_X, test_y),
                          batch_size=BATCH_SIZE,
                          shuffle=True)

## Model

In [14]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
torch.manual_seed(1111)

<torch._C.Generator at 0x1a0cc1c9f0>

In [15]:
from sklearn.metrics import roc_auc_score

def test_model(loader, model, limit=None):
    """
    Help function that tests the models's performance on a dataset
    :param: loader: data loader for the dataset to test against
    """
    correct = 0
    total = 0
    model.eval()
    
    predictions = []
    truths = []
    
    for i, (data, labels) in enumerate(loader):
        if i and i==limit:
            break
        outputs = model(Variable(data)).squeeze()
        predicted = ((outputs.data > 0.5).long()).view(-1)
        predictions += list(predicted.numpy())
        truths += list(labels.numpy())
        total += labels.size(0)
        correct += (predicted == labels).sum()
    model.train()
    return (100 * correct / total), roc_auc_score(truths, predictions)

In [16]:
LOG_INTERVAL = 100

def train(num_epoch, model):
    for epoch in range(num_epoch):
        for i, (data, labels) in enumerate(train_loader):
            outputs = model(Variable(data))
            model.zero_grad()
            loss = loss_function(outputs.squeeze(), Variable(labels.float()))
            loss.backward()
            optimizer.step()

             # report performance
            if (i + 1) % LOG_INTERVAL == 0:
                test_acc, test_auc = test_model(test_loader, model)
                print('Epoch: [{0}/{1}], Step: [{2}/{3}], Loss: {4}, Validation Acc:{5}, AUC:{6}'.format(
                    epoch + 1, EPOCHS, i + 1, len(train_loader), loss.data[0], test_acc, test_auc))

For this lab, we will be exploring two variants of RNN: vanilla (or Elman) RNN and LSTM (Long-short term memory). In the following blocks, please try to define your own model. Some modules in ```torch.nn``` might be really helpful for this.

### Vanilla RNN

In [17]:
class RNN(nn.Module):
    def __init__(self, hidden_dim, output_dim, 
                 vocab_size, embedding_dim, dropout=0.2):
        
        super(RNN, self).__init__()
        
        self.emb = nn.Embedding(vocab_size, embedding_dim, padding_idx=vocab_size-1)
        self.hidden_dim = hidden_dim
        # TODO: RNN with dropout, remember to set batch_first to True
        self.rnn = None
        
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def init_hidden(self, bsz):
        """
        Initialize the hidden state values
        """
        # TODO: initialize RNN values
        
        return None
        
    def forward(self, x):
        x = self.emb(x)
        
        # TODO: take last hidden state of RNN
        
        out = F.sigmoid(self.fc(last_hidden))
        return out

Because the dataset is imbalanced, we will also use AUC as the metric. If everything checks out, you should be able to get about 82% accuracy and an AUC of about 0.6 after 10 epochs of training with the provided parameters with a vanilla RNN. It might be difficult to achieve a better performance than this with a vanilla RNN.

In [58]:
EPOCHS = 10
LR = 0.002

rnn_model = RNN(40, 1, len(vocabulary), 20)

loss_function = nn.BCELoss()
optimizer = torch.optim.Adam(rnn_model.parameters(), lr=LR)

train(EPOCHS, rnn_model)

Epoch: [1/10], Step: [100/285], Loss: 0.4389716684818268, Validation Acc:77.3913043478261, AUC:0.5
Epoch: [1/10], Step: [200/285], Loss: 0.5644798874855042, Validation Acc:77.70186335403727, AUC:0.5107575009260402
Epoch: [2/10], Step: [100/285], Loss: 0.7684984803199768, Validation Acc:78.75776397515529, AUC:0.5448049141869367
Epoch: [2/10], Step: [200/285], Loss: 0.32810139656066895, Validation Acc:78.01242236024845, AUC:0.5185979750586492
Epoch: [3/10], Step: [100/285], Loss: 0.4594899117946625, Validation Acc:78.75776397515529, AUC:0.5360538338066427
Epoch: [3/10], Step: [200/285], Loss: 0.429439902305603, Validation Acc:78.94409937888199, AUC:0.5430917397209533
Epoch: [4/10], Step: [100/285], Loss: 0.49390771985054016, Validation Acc:79.1304347826087, AUC:0.5540190146931719
Epoch: [4/10], Step: [200/285], Loss: 0.4100034236907959, Validation Acc:79.00621118012423, AUC:0.5600228423262131
Epoch: [5/10], Step: [100/285], Loss: 0.42685821652412415, Validation Acc:79.25465838509317, AUC

Play around with trained model:

In [50]:
def test_sentence(sentence, model):
    model.eval()
    test_tensor = torch.LongTensor(sentences_to_padded_index_sequences(vocab_indices, [sentence]).astype(int))
    score = model(Variable(test_tensor)).data.numpy()[0][0][0]
    
    return ("positive" if score > 0.5 else "negative", score)

In [59]:
test_sentence("her speech was great", rnn_model)

('negative', 0.15449949)

In [60]:
test_sentence("His speech sucked", rnn_model)

('negative', 0.28567505)

### LSTM

In [55]:
class LSTM(nn.Module):
    def __init__(self, hidden_dim, output_dim, 
                 vocab_size, embedding_dim, dropout=0.2):
        
        super(LSTM, self).__init__()
        
        self.emb = nn.Embedding(vocab_size, embedding_dim, padding_idx=vocab_size-1)
        self.hidden_dim = hidden_dim
        # TODO: LSTM with dropout, remember to set batch_firs to True
        self.rnn = nn.LSTM(embedding_dim, hidden_dim, batch_first=True, dropout=dropout)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def init_hidden(self, bsz):
        """
        Initialize the hidden state values and C matrix for LSTM
        :return (hidden, c)
        """
        # TODO: initialize LSTM values
        
        return None
        
    def forward(self, x):
        """
        Forward function of the network:
        1. take the embedding vectors of each token index
        2. pass embedding vectors to RNN/LSTM module
        3. take last hidden state from RNN/LSTM module
        4. pass last hidden state to linear layer to get output
        :param x: LongTensor of shape (batch_size, pad_length)
        :return a LongTensor of shape (batch_size, 1)
        """
        x = self.emb(x)
        # TODO: take last hidden state of RNN
        
        out = F.sigmoid(self.fc(last_hidden))
        return out

If everything is done correctly, you should be able to get about 85% accuracy and an AUC of about 0.75 after 10 epochs of training with the provided parameters with a LSTM. Note the significant improvement with a LSTM even without changing the parameters much.

In [57]:
EPOCHS = 10
LR = 0.002

lstm_model = LSTM(40, 1, len(vocabulary), 20)

loss_function = nn.BCELoss()
optimizer = torch.optim.Adam(lstm_model.parameters(), lr=LR)

train(EPOCHS, lstm_model)

Epoch: [1/10], Step: [100/285], Loss: 0.45670372247695923, Validation Acc:77.3913043478261, AUC:0.5
Epoch: [1/10], Step: [200/285], Loss: 0.3691979646682739, Validation Acc:77.70186335403727, AUC:0.5107575009260402
Epoch: [2/10], Step: [100/285], Loss: 0.44790759682655334, Validation Acc:79.5031055900621, AUC:0.6439375231510064
Epoch: [2/10], Step: [200/285], Loss: 0.395826518535614, Validation Acc:80.55900621118012, AUC:0.6138103469564143
Epoch: [3/10], Step: [100/285], Loss: 0.3412517309188843, Validation Acc:80.8695652173913, AUC:0.6576274848746758
Epoch: [3/10], Step: [200/285], Loss: 0.46227243542671204, Validation Acc:82.04968944099379, AUC:0.6915051240893937
Epoch: [4/10], Step: [100/285], Loss: 0.34477517008781433, Validation Acc:82.36024844720497, AUC:0.6400327200888999
Epoch: [4/10], Step: [200/285], Loss: 0.3076227605342865, Validation Acc:82.29813664596273, AUC:0.6629676503272008
Epoch: [5/10], Step: [100/285], Loss: 0.34102505445480347, Validation Acc:82.36024844720497, AU

Play around with trained model:

In [61]:
test_sentence("I love this guy!", lstm_model)

('positive', 0.964622)

In [62]:
test_sentence("her speech was great", lstm_model)

('positive', 0.96462125)

In [63]:
test_sentence("His speech sucked", lstm_model)

('negative', 0.4483167)

### Try at home:
- Use the GloVe embedding as a pretrained embedding, and train it on our dataset (hint: nn.Embedding.weight). Remember to re-index the tokens!
- Use the GloVe embedding, but keep the embedding fixed throughout training (hint: requires_grad=False)