<a href="https://colab.research.google.com/github/antsh3k/NN-learning/blob/master/6_Language_Modeling_RNN_20191105_MedMentions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Switch to GPU
---

## Imports and globals

In [0]:
# IMPORTS (try to organize/group your imports)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
import re
import json
import spacy
from os import path

import torch
from torch import nn
import torch.nn.functional as F
import torch.optim as optim

import sklearn.metrics
from sklearn.metrics import classification_report

In [0]:
# Any global variables
SEED = 15
DATA_PATH = '/content/' # Used for Colab
MAX_SEQ_LEN = 12 # We go with something small as the dataset is small
nlp = spacy.load('en_core_web_sm')
DEVICE = 'cuda'
BATCH_SIZE = 64 # Again small as no powerful GPUs

# Set SEEDs
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)

## Download the data

---
Notes:
- Change directory accordingly if working locally or not on Colab

In [0]:
!wget https://github.com/w-is-h/DeepLearningNLP/raw/master/Session_7/data/text_medmentions.txt -P /content/

--2019-11-05 13:30:53--  https://github.com/w-is-h/DeepLearningNLP/raw/master/Session_7/data/text_medmentions.txt
Resolving github.com (github.com)... 140.82.118.3
Connecting to github.com (github.com)|140.82.118.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/w-is-h/DeepLearningNLP/master/Session_7/data/text_medmentions.txt [following]
--2019-11-05 13:30:55--  https://raw.githubusercontent.com/w-is-h/DeepLearningNLP/master/Session_7/data/text_medmentions.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7157528 (6.8M) [text/plain]
Saving to: ‘/content/text_medmentions.txt’


2019-11-05 13:30:55 (118 MB/s) - ‘/content/text_medmentions.txt’ saved [7157528/7157528]



In [0]:
# Display a couple of lines from the downloaded file
#?TODO
!head /content/text_medmentions.txt

DCTN4	as	a	modifier	of	chronic	Pseudomonas	aeruginosa	infection	in	cystic	fibrosis	Pseudomonas	aeruginosa	(	Pa	)	infection	in	cystic	fibrosis	(	CF	)	patients	is	associated	with	worse	long	-	term	pulmonary	disease	and	shorter	survival	,	and	chronic	Pa	infection	(	CPA	)	is	associated	with	reduced	lung	function	,	faster	rate	of	lung	decline	,	increased	rates	of	exacerbations	and	shorter	survival	.	By	using	exome	sequencing	and	extreme	phenotype	design	,	it	was	recently	shown	that	isoforms	of	dynactin	4	(	DCTN4	)	may	influence	Pa	infection	in	CF	,	leading	to	worse	respiratory	disease	.	The	purpose	of	this	study	was	to	investigate	the	role	of	DCTN4	missense	variants	on	Pa	infection	incidence	,	age	at	first	Pa	infection	and	chronic	Pa	infection	incidence	in	a	cohort	of	adult	CF	patients	from	a	single	centre	.	Polymerase	chain	reaction	and	direct	sequencing	were	used	to	screen	DNA	samples	for	DCTN4	variants	.	A	total	of	121	adult	CF	patients	from	the	Cochin	Hospital	CF	centre	have	been	includ

## We need to put the data into the right format (x, y)

The input data is in the following format:
```
<token>\t<token>\t<token>....<all_tokens_for_doc_1>
<token>\t<token>\t<token>....<all_tokens_for_doc_2>
.
.
.
```
We want to put the data into x so that:
```
x = [[<token>, <token>, <token>, ..., <all_tokens_for_doc_1], 
     [<token>, <token>, <token>, ..., <all_tokens_for_doc_2], 
     ...]
```

---
Notes:
- Usually we would not load the data into memory, but in this case it is small so who cares.
- We also lowercase the text
- There is no `y`

In [0]:
# Load the data into x
x = [] #? TODO
"""
filename = open("/content/text_medmentions.txt")
for token in filename:
  token = token.lower()
  for tkns in [token]:
    x.append(token)

"""
x = [[token.lower() for token in sentence.split("\t")] for sentence in open(DATA_PATH + "text_medmentions.txt")]
  
  

# Sanity
print(x[1])

['nonylphenol', 'diethoxylate', 'inhibits', 'apoptosis', 'induced', 'in', 'pc12', 'cells', 'nonylphenol', 'and', 'short', '-', 'chain', 'nonylphenol', 'ethoxylates', 'such', 'as', 'np2', 'eo', 'are', 'present', 'in', 'aquatic', 'environment', 'as', 'wastewater', 'contaminants', ',', 'and', 'their', 'toxic', 'effects', 'on', 'aquatic', 'species', 'have', 'been', 'reported', '.', 'apoptosis', 'has', 'been', 'shown', 'to', 'be', 'induced', 'by', 'serum', 'deprivation', 'or', 'copper', 'treatment', '.', 'to', 'understand', 'the', 'toxicity', 'of', 'nonylphenol', 'diethoxylate', ',', 'we', 'investigated', 'the', 'effects', 'of', 'np2', 'eo', 'on', 'apoptosis', 'induced', 'by', 'serum', 'deprivation', 'and', 'copper', 'by', 'using', 'pc12', 'cell', 'system', '.', 'nonylphenol', 'diethoxylate', 'itself', 'showed', 'no', 'toxicity', 'and', 'recovered', 'cell', 'viability', 'from', 'apoptosis', '.', 'in', 'addition', ',', 'nonylphenol', 'diethoxylate', 'decreased', 'dna', 'fragmentation', 'caus

## Download the word embeddings

THe links on AWS contain the glove trained model (glove.840B.300d.zip), I've only converted it into keyed_vectors for gensim.


---
Notes: 
- Change directory accordingly if working locally or not on Colab
- Pretrained word embeddings taken from: https://nlp.stanford.edu/projects/glove/

In [0]:
!wget https://zkcl.s3-eu-west-1.amazonaws.com/keyed_vectors_840_300.dat -P /content/
!wget https://zkcl.s3-eu-west-1.amazonaws.com/keyed_vectors_840_300.dat.vectors.npy -P /content/

--2019-11-05 13:32:01--  https://zkcl.s3-eu-west-1.amazonaws.com/keyed_vectors_840_300.dat
Resolving zkcl.s3-eu-west-1.amazonaws.com (zkcl.s3-eu-west-1.amazonaws.com)... 52.218.104.19
Connecting to zkcl.s3-eu-west-1.amazonaws.com (zkcl.s3-eu-west-1.amazonaws.com)|52.218.104.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 120497892 (115M) [application/x-www-form-urlencoded]
Saving to: ‘/content/keyed_vectors_840_300.dat’


2019-11-05 13:32:03 (93.8 MB/s) - ‘/content/keyed_vectors_840_300.dat’ saved [120497892/120497892]

--2019-11-05 13:32:04--  https://zkcl.s3-eu-west-1.amazonaws.com/keyed_vectors_840_300.dat.vectors.npy
Resolving zkcl.s3-eu-west-1.amazonaws.com (zkcl.s3-eu-west-1.amazonaws.com)... 52.218.104.19
Connecting to zkcl.s3-eu-west-1.amazonaws.com (zkcl.s3-eu-west-1.amazonaws.com)|52.218.104.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2635219328 (2.5G) [application/x-www-form-urlencoded]
Saving to: ‘/content/keyed_

In [0]:
from gensim.models import KeyedVectors

DATA_PATH = "/content/"
keyed_vectors = KeyedVectors.load(DATA_PATH + "keyed_vectors_840_300.dat") #?

# Sanity check
#? TODO: Get vectors most similar to the word "fibrosis"
print(keyed_vectors.most_similar("fibrosis"))

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
  if np.issubdtype(vec.dtype, np.int):


[('cystic', 0.8054730892181396), ('pulmonary', 0.7173466086387634), ('emphysema', 0.699813961982727), ('cirrhosis', 0.6976104974746704), ('lung', 0.6918838620185852), ('Fibrosis', 0.6711395978927612), ('sarcoidosis', 0.6456377506256104), ('bronchiectasis', 0.6421423554420471), ('renal', 0.6387313604354858), ('sclerosis', 0.6375002861022949)]


## Subset the pretrained embeddings to only the ones we need
Create a vocabulary based on our current datasete `x`. We use this to subset the full keyed_vectors from glove.

---
Notes:
- Usually we would not do this, unles we know that our current dataset contains all words that we are ever going to see in the future. Here we are only doing it to speed things up.

In [0]:
# TODO
current_vocab = [token for sublist in x for token in sublist]

"""
for sublist in x:
  for token in sublist:
    current_vocab.append(token)
    """
# Convert to set to remove duplicates
current_vocab = set(current_vocab)
print(len(current_vocab))


48834


## From the gensim model take only what we need

`embeddings` - a list of vectors, where each row represents the embedding of one word

`id2word` - map from index in the embeddings list to words

`word2id` - map from word to index in the embeddings list


Once done we should be able to get the embedding for word "house" like this:
`embeddings[word2id['house']]`

---
Notes:
- We only want to have the embeddings for the words in `current_vocab`

In [0]:
embeddings = [] # A list of embeddings for each word in the word2vec vocab

# Embeddings is a list, meaning we know that embeddings[1] is a vector for the 
#word with ID=1, but we don't know what word is that. That is why we need 
#the id2word and word2id mappings.
id2word = {}
word2id = {}

# Loop over all words in the vocabulary and add the values
for word in keyed_vectors.vocab.keys():
  # Get only the words that are in our current_vocab
  if word in current_vocab:
    id2word[len(embeddings)] = word
    word2id[word] = len(embeddings)
    embeddings.append(keyed_vectors[word])

# Add <START> and <END>
word = "<START>"
id2word[len(embeddings)] = word
word2id[word] = len(embeddings)
embeddings.append(np.random.rand(len(embeddings[0])))
word = "<END>"
id2word[len(embeddings)] = word
word2id[word] = len(embeddings)
embeddings.append(np.random.rand(len(embeddings[0])))

# Add <UNK> and <PAD>
word = "<UNK>"
id2word[len(embeddings)] = word
word2id[word] = len(embeddings)
embeddings.append(np.random.rand(len(embeddings[0])))
word = "<PAD>"
id2word[len(embeddings)] = word
word2id[word] = len(embeddings)
embeddings.append(np.zeros(len(embeddings[0])))

# Convert the embeddings list into a tensor
embeddings = torch.tensor(embeddings, dtype=torch.float32)

# Sanity
assert len(embeddings) == len(id2word) == len(word2id)
assert keyed_vectors['house'][0] == embeddings[word2id['house']][0]

## Convert tokens to indices

---
Notes:
- At the beginning of each sample add the index of the `<START>` token
- If a word does not exist in our `word2id` map we use the index for `<UNK>`

In [0]:
x_ind = []

#? TODO
i_start = word2id["<START>"]
i_unk = word2id["<UNK>"]

for sample in x:
  new_sample = [i_start]
  for token in sample:
    # Get index for the token, if not there, get i_unk
    i_token = word2id.get(token, i_unk)
    # Append to the new sample with indices
    new_sample.append(i_token)
  x_ind.append(new_sample)


print(x_ind[0])

[34805, 34807, 26, 6, 12185, 5, 3714, 24692, 18750, 2982, 7, 12786, 12058, 24692, 18750, 13, 6305, 12, 2982, 7, 12786, 12058, 13, 11420, 12, 965, 10, 1187, 18, 1646, 197, 15, 794, 8282, 1183, 3, 4376, 3712, 0, 3, 3714, 6305, 2982, 13, 17721, 12, 10, 1187, 18, 2067, 5162, 977, 0, 1803, 618, 5, 5162, 3501, 0, 1241, 945, 5, 19901, 3, 4376, 3712, 1, 22, 210, 27268, 10018, 3, 2532, 12473, 357, 0, 20, 28, 737, 967, 14, 17244, 5, 32679, 120, 13, 34807, 12, 110, 1903, 6305, 2982, 7, 11420, 0, 913, 4, 1646, 6461, 1183, 1, 2, 1172, 5, 25, 582, 28, 4, 4187, 2, 820, 5, 34807, 22578, 7697, 16, 6305, 2982, 6715, 0, 577, 23, 98, 6305, 2982, 3, 3714, 6305, 2982, 6715, 7, 6, 9518, 5, 1423, 11420, 965, 27, 6, 497, 1814, 1, 13125, 2044, 2421, 3, 1080, 10018, 82, 167, 4, 883, 14041, 2873, 11, 34807, 7697, 1, 6, 654, 5, 4099, 1423, 11420, 965, 27, 2, 24312, 1625, 11420, 1814, 29, 79, 745, 0, 41, 5, 86, 2673, 124, 34807, 5962, 9, 3669, 1084, 23, 325, 63, 8282, 2982, 18, 6305, 0, 3, 2777, 965, 5, 86, 76, 177

In [0]:
# Sanity: convert the indexes for x_ind[0] back to words
#? TODO
" ".join([id2word[id] for id in x_ind[0]])

'<START> <UNK> as a modifier of chronic pseudomonas aeruginosa infection in cystic fibrosis pseudomonas aeruginosa ( pa ) infection in cystic fibrosis ( cf ) patients is associated with worse long - term pulmonary disease and shorter survival , and chronic pa infection ( cpa ) is associated with reduced lung function , faster rate of lung decline , increased rates of exacerbations and shorter survival . by using exome sequencing and extreme phenotype design , it was recently shown that isoforms of dynactin 4 ( <UNK> ) may influence pa infection in cf , leading to worse respiratory disease . the purpose of this study was to investigate the role of <UNK> missense variants on pa infection incidence , age at first pa infection and chronic pa infection incidence in a cohort of adult cf patients from a single centre . polymerase chain reaction and direct sequencing were used to screen dna samples for <UNK> variants . a total of 121 adult cf patients from the cochin hospital cf centre have be

## Flatten x_ind

Concatentate all samples into one flat list

In [0]:
# Flatten x
x_ind_flat = [token for sublist in x_ind for token in sublist] #? TODO

# sanity 
print(len(x_ind_flat))

1214137


## Write the batchify function

The function takes a flat `x` dataset and batch_size. It should output a matrix where `batch_size` is number of culumns.

In [0]:
def batchify(x, batch_size):
    nbatch = len(x) // batch_size
    
    # Trim off any extra elements that wouldn't cleanly fit.
    x = x[:nbatch * batch_size]
    
    # Convert x to tensor of type long
    x = torch.tensor(x, dtype=torch.long)
    
    # Evenly divide the data across the batches. Transpose is needed to keep 
    #the data sequential over rows. 
    x = x.view(batch_size, -1).t() #? TODO
    
    return x

## Split the dataset into train/test 

---
Notes:
- Usually we would have train/dev/test but as this is mainly for demonstration we skip the `dev` set

In [0]:
# Get the proportion for train/test
end_train = int(len(x_ind_flat) * 0.9)

x_train = batchify(x_ind_flat[:end_train], BATCH_SIZE)
x_test = batchify(x_ind_flat[end_train:], BATCH_SIZE)

# Sanity
print(x_train.shape, x_test.shape)

torch.Size([17073, 64]) torch.Size([1897, 64])


## Write the get_batch function

It takes:
- `source` - that is the training/test set  
- `i` - that denotes where we are in the source dataset, which is in fact the batch number.

In [0]:
def get_batch(source, i, device):
  # Sequence length is either the MAX_SEQ_LEN or smaller if the dataset/current_batch
  #does not have sequences of length MAX_SEQ_LEN
  seq_len = min(MAX_SEQ_LEN, len(source) - 1 - i)
  
  # Get the data and targets
  data = source[i:i+seq_len]
  # Targets are flat
  target =  source[i+1:i+seq_len+1] #? TODO
  
  # Move to device and return
  return data.to(device).contiguous(), target.to(device).contiguous().view(-1)

In [0]:
embeddings.size()

torch.Size([34809, 300])

## Create the network

---
Notes:
- We need to add the init_hidden function

In [0]:
class RNN(nn.Module):
  def __init__(self, embeddings, padding_idx):
    super(RNN, self).__init__()
    # Get the required sizes
    self.vocab_size = len(embeddings)
    self.embedding_size = len(embeddings[0])
    
    # Initialize embeddings
    self.embeddings = nn.Embedding(self.vocab_size, self.embedding_size, padding_idx=padding_idx)
    self.embeddings.load_state_dict({'weight': embeddings})

    # Disable training for the embeddings - IMPORTANT
    self.embeddings.weight.requires_grad = False
    
    self.num_layers = 2
    self.hidden_size = 300
    self.dropout = 0.5
    


    # Create the RNN cell
    self.rnn = nn.LSTM(input_size=self.embedding_size, 
                       hidden_size=self.hidden_size, 
                       num_layers=self.num_layers, 
                       dropout=self.dropout)
    
    # Create the FC layer which is in fact the decoder in this case
    self.fc1 = nn.Linear(self.hidden_size, self.vocab_size) #? TODO

  def init_hidden(self, batch_size):
    # Initialize the hidden for the case when we are at batch 0
    weight = next(self.parameters())
    
    return (weight.new_zeros(self.num_layers, batch_size, self.hidden_size),
            weight.new_zeros(self.num_layers, batch_size, self.hidden_size))

  def forward(self, x, hidden):
    # Embed the input: from id -> vec
    x = self.embeddings(x) # x.shape = batch_size x sequence_length x emb_size
    
    # Run 'x' through the RNN, but also provide hidden init
    x, hidden = self.rnn(x, hidden) #? TODO

    # Push x through the fc network
    x = self.fc1(x)
    return x, hidden

## We need one additional function, detach hidden states

As for each batch we provide the hidden states from the previous batch, if we do not detach them pytorch will try to backpropagate the error to the beginning of the dataset.

In [0]:
def repackage_hidden(h):
  """Wraps hidden states in new Tensors, to detach them from their history."""

  if isinstance(h, torch.Tensor):
    return h.detach()
  else:
    return tuple(repackage_hidden(v) for v in h) # Recursive calls

## Instantiate the device, network, criterion and optimizer

In [0]:
device = torch.device(DEVICE) # Create a torch device
net = RNN(embeddings, padding_idx=word2id['<PAD>']) # Create an instance of the RNN, take care what input parameters does it require
criterion = nn.CrossEntropyLoss() # Set the criterion to Cross Entropy Loss
parameters = filter(lambda p: p.requires_grad, net.parameters()) # Get only the parameters that require training
optimizer = optim.Adam(parameters, lr=0.001) # Set the optimizer to Adam with lr = 0.001
net.to(device) # Move the network to device

RNN(
  (embeddings): Embedding(34809, 300, padding_idx=34808)
  (rnn): LSTM(300, 300, num_layers=2, dropout=0.5)
  (fc1): Linear(in_features=300, out_features=34809, bias=True)
)

In [0]:
for epoch in range(20):
  # Switch network to train mode
  net.train()

  # Create the running loss array
  running_loss = []

  # Get the initial values for the hidden layer of RNNs
  hidden = net.init_hidden(BATCH_SIZE) #?
  for i in range(0, x_train.size(0) - 1, MAX_SEQ_LEN):
    x_train_batch, y_train_batch = get_batch(x_train, i, device) #? TODO: get batch

    # zero gradients
    optimizer.zero_grad()
    # Get outputs for our batch
    outputs, hidden = net(x_train_batch, hidden) #? TODO

    # Repackage hidden
    hidden = repackage_hidden(hidden)

    # Get loss
    loss = criterion(outputs.view(-1, len(embeddings)), y_train_batch)
    # Do the backward step
    loss.backward()
    
    # Clip grads, so we dont have exploding gradients
    #? TODO
    parameters = filter(lambda p: p.requires_grad, net.parameters())
    torch.nn.utils.clip_grad_norm_(parameters, 0.25)
    
    # Do the optimizer step
    optimizer.step()

    # Add the loss to the running_loss
    running_loss.append(loss.item())
  # For now we only print the train loss - to speed things up, but you would
  #usually have the test/dev loss also here.
  print("LOSS: {}\n\n".format(np.average(running_loss)))
print('Finished Training')

LOSS: 6.531473404072743


LOSS: 5.584591087118807


LOSS: 5.182388572397989


LOSS: 4.9051427402214856


LOSS: 4.6919456382746105


LOSS: 4.516093103297563


LOSS: 4.376480452878662


LOSS: 4.260773102665818


LOSS: 4.156516642949151


LOSS: 4.06878418748176




KeyboardInterrupt: ignored

## Let's generate a sample from our model

In [0]:
hidden = net.init_hidden(1)
temp = 1 # The higher the temp the higher the diversity in the output

text = "<START> test" # Initialise the hidden state
inds = [word2id[w] for w in text.split(" ")]
for i in range(len(inds)):
  input = torch.tensor([[inds[i]]], dtype=torch.long).to(device)
  output, hidden = net(input, hidden)

out = ""
for i in range(100):
  # Get word weights from the output
  word_weights = output.squeeze().data.div(temp).exp().cpu()
  # Sample one word from multinomial
  word_idx = torch.multinomial(word_weights, 1)[0]
  # Replace existing word with the the new one
  input.data.fill_(word_idx)
  # Convert index to word
  word = id2word[int(word_idx.detach().cpu().numpy())]
  # Concat to output
  out += word + " "
  output, hidden = net(input, hidden)
print(out)

and changes in outcomes in patients assigned to <UNK> cognitions in pune , global symptom of workers who assessed during life care staff th may be a major number of performing available and relative interventions . <UNK> <START> antiproliferative effects of berberine on specific heating from diameter , organic wastes , chloride dehydrogenase is already acquired through respiratory and inflammatory disorders and quantified on relationship , autonomy , developmental patterns , and functional mechanisms ; extremely physiologic conditions do not alter response by movement - emotional symptomatology risks . we possesses a experimental knowledge using the diagnosis of environmental responses 
