<a href="https://colab.research.google.com/github/daveDoesData/IS7033/blob/master/IS7033_NLPTransferLearning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro



This is my first attempt at transfer learning with NLP.
At this time, I am able to get ELMo's word embedings out from the final layer of the ELMo network.
Once , I get over the shape hurdel (reading paper and examples again) this information will feed into a simple one layer nn created with pytorch to classify text documents. 

I have included a visual representation of the notebook's current state below:

![alt text](https://media.giphy.com/media/DEEG4drtolbGM/giphy.gif)

### So, What is ELMo (Besides Adorable)?

{Better description to be added}

ELMo is one of the newer word embedings options avialable to researchers and practioners alike. Produced by the Allen NLP institute, ELMo goes beyond traditional word embeddings becasue with ELMo context matters. ELMo cannot provide a word embeding without seeing the whole sentence. Unfortunately for ELMo, BERT showed up shortly after ELMo and stole the spotlight. However, ELMo is a sensible place to start when learning deep word representation models since it utilizes more "traditional" LSTMs instead of transformers like BERT. 

![https://jalammar.github.io/illustrated-bert/](https://jalammar.github.io/images/elmo-forward-backward-language-model-embedding.png)

![https://jalammar.github.io/illustrated-bert/](https://jalammar.github.io/images/elmo-embedding.png)
source: https://jalammar.github.io/illustrated-bert/

# Development

To get started, let's add in all of the extra goodies we'll need outside of colab's starter set.

In [0]:
%%capture
pip install allennlp

In [0]:
import numpy as np
import pandas as pd
import re
from sklearn.preprocessing import LabelEncoder
from torch.autograd import Variable
import torch
import torch.nn as nn
import torch.nn.functional as F
from allennlp.commands.elmo import ElmoEmbedder, batch_to_ids
torch.cuda.get_device_name(0)

'Tesla K80'

# Stance Data Set

In [0]:
%%bash
wget https://saifmohammad.com/WebDocs/stance-data-all-annotations.zip

--2019-04-02 02:02:28--  https://saifmohammad.com/WebDocs/stance-data-all-annotations.zip
Resolving saifmohammad.com (saifmohammad.com)... 192.185.17.122
Connecting to saifmohammad.com (saifmohammad.com)|192.185.17.122|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 320467 (313K) [application/zip]
Saving to: ‘stance-data-all-annotations.zip’

     0K .......... .......... .......... .......... .......... 15% 1.59M 0s
    50K .......... .......... .......... .......... .......... 31% 1.59M 0s
   100K .......... .......... .......... .......... .......... 47% 77.7M 0s
   150K .......... .......... .......... .......... .......... 63% 1.62M 0s
   200K .......... .......... .......... .......... .......... 79% 66.5M 0s
   250K .......... .......... .......... .......... .......... 95%  126M 0s
   300K .......... ..                                         100% 77.7M=0.09s

2019-04-02 02:02:28 (3.27 MB/s) - ‘stance-data-all-annotations.zip’ saved [320467/320467]



In [0]:
%%bash
unzip stance-data-all-annotations.zip

Archive:  stance-data-all-annotations.zip
   creating: data-all-annotations/
  inflating: data-all-annotations/.DS_Store  
   creating: __MACOSX/
   creating: __MACOSX/data-all-annotations/
  inflating: __MACOSX/data-all-annotations/._.DS_Store  
  inflating: data-all-annotations/readme.txt  
  inflating: data-all-annotations/testdata-taskA-all-annotations.txt  
  inflating: __MACOSX/data-all-annotations/._testdata-taskA-all-annotations.txt  
  inflating: data-all-annotations/testdata-taskA-ids.txt  
  inflating: __MACOSX/data-all-annotations/._testdata-taskA-ids.txt  
  inflating: data-all-annotations/testdata-taskB-all-annotations.txt  
  inflating: __MACOSX/data-all-annotations/._testdata-taskB-all-annotations.txt  
  inflating: data-all-annotations/testdata-taskB-ids.txt  
  inflating: __MACOSX/data-all-annotations/._testdata-taskB-ids.txt  
  inflating: data-all-annotations/trainingdata-all-annotations.txt  
  inflating: __MACOSX/data-all-annotations/._trainingdata-all-annotations

In [0]:
trainEvalCombo = pd.read_csv('data-all-annotations/trainingdata-all-annotations.txt', delimiter='\t', header=0, encoding = 'latin-1')
testTaskA = pd.read_csv('data-all-annotations/testdata-taskA-all-annotations.txt', delimiter='\t', header=0, encoding = 'latin-1')
testTaskB = pd.read_csv('data-all-annotations/testdata-taskB-all-annotations.txt', delimiter='\t', header=0, encoding = 'latin-1')

In [0]:
def clean_tweets(text):
    no_ascii = ''.join(i for i in text if ord(i) < 128)
    no_alphanum = re.sub(r'[^a-zA-Z0-9 ]', '', no_ascii)
    lower_txt = no_alphanum.lower()
    return lower_txt

trainEvalCombo['Tweet'] = trainEvalCombo['Tweet'].apply(clean_tweets)
testTaskA['Tweet'] = testTaskA['Tweet'].apply(clean_tweets)
testTaskB['Tweet'] = testTaskB['Tweet'].apply(clean_tweets)

In [0]:
np.random.seed(1776)
spiter = np.random.rand(len(trainEvalCombo)) < 0.8
trainDF = trainEvalCombo[spiter]
evalDF = trainEvalCombo[~spiter]

In [0]:
trainX_words = [words.split() for words in trainDF['Tweet']] 
evalX_words = [words.split() for words in evalDF['Tweet']]

In [0]:
trainY_string = trainDF['Stance']
convert_label_to_int = LabelEncoder()
trainY = convert_label_to_int.fit_transform(trainY_string)

In [0]:
print(trainY[:10])

[0 0 0 0 0 0 0 1 1 0]


In [0]:
print(trainY_string[:10])

0     AGAINST
1     AGAINST
2     AGAINST
3     AGAINST
4     AGAINST
5     AGAINST
6     AGAINST
7       FAVOR
10      FAVOR
11    AGAINST
Name: Stance, dtype: object


In [0]:
print(trainX_words[0])

['dear', 'lord', 'thank', 'u', 'for', 'all', 'of', 'ur', 'blessings', 'forgive', 'my', 'sins', 'lord', 'give', 'me', 'strength', 'and', 'energy', 'for', 'this', 'busy', 'day', 'ahead', 'blessed', 'hope', 'semst']


# ELMo

### Using ELMo interactively
You can use ELMo interactively (or programatically) with iPython. The allennlp.commands.elmo.ElmoEmbedder class provides the easiest way to process one or many sentences with ELMo, but it returns numpy arrays so it is meant for use as a standalone command and not within a larger model. For example, if you would like to learn a weighted average of the ELMo vectors then you need to use allennlp.modules.elmo.Elmo instead.

The ElmoEmbedder class returns three vectors for each word, each vector corresponding to a layer in the ELMo LSTM output. The first layer corresponds to the context insensitive token representation, followed by the two LSTM layers. See the ELMo paper or follow up work at EMNLP 2018 for a description of what types of information is captured in each layer.

source: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md

The ``elmo`` subcommand allows you to make bulk ELMo predictions.
Given a pre-processed input text file, this command outputs the internal
layers used to compute ELMo representations to a single (potentially large) file.
The input file is previously tokenized, whitespace separated text, one sentence per line.
The output is a hdf5 file (<http://docs.h5py.org/en/latest/>) where, with the --all flag, each
sentence is a size (3, num_tokens, 1024) array with the biLM representations.
For information, see "Deep contextualized word representations", Peters et al 2018.
https://arxiv.org/abs/1802.05365

source: https://github.com/allenai/allennlp/blob/master/allennlp/commands/elmo.py


In [0]:
elmo = ElmoEmbedder()
tokens = ["I", "ate", "an", "apple", "for", "breakfast"]
vectors = elmo.embed_sentence(tokens)

assert(len(vectors) == 3) # one for each layer in the ELMo output
assert(len(vectors[0]) == len(tokens)) # the vector elements correspond with the input tokens

import scipy
vectors2 = elmo.embed_sentence(["I", "ate", "a", "carrot", "for", "breakfast"])
scipy.spatial.distance.cosine(vectors[2][3], vectors2[2][3]) # cosine distance between "apple" and "carrot" in the last layer

0.18020617961883545

In [0]:
elmo = ElmoEmbedder(options_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
                   ,weight_file = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"
                   ,cuda_device = 0)
#trainX_vectors = elmo.embed_sentences([trainX_words], batch_size = 100)
trainX_vectors = [elmo.embed_sentence(tokens) for tokens in trainX_words]
trainX_ELMo_vectors_last_layer = [vectors[2] for vectors in trainX_vectors]

The Allen NLP elmo.embed_sentences function is lazy and it just creates a generator instead of the list of numpy arrays. 

In [0]:
len(trainX_ELMo_vectors_last_layer[0])

26

In [0]:
trainX_ELMo_vectors_last_layer[0]


array([[ 0.17908666, -1.574388  ,  0.9519706 , ...,  0.19991188,
         0.41037235,  0.92734903],
       [ 0.44663948, -0.6578803 ,  0.53522146, ...,  0.29883933,
         2.2307096 ,  0.03350666],
       [ 1.4313195 , -1.4037013 , -0.43510285, ...,  0.11510789,
         1.6071985 , -0.720706  ],
       ...,
       [ 0.13420373, -0.34323138,  0.5758524 , ..., -0.19494513,
         0.4945613 ,  0.19315866],
       [-0.0363926 , -0.4113907 ,  0.19109306, ...,  0.32084963,
         1.2030481 , -0.11308239],
       [-0.13918735, -0.16288519,  0.682714  , ...,  0.11081195,
         0.28356946,  0.31483012]], dtype=float32)

In [0]:
trainX_ELMo_vectors_last_layer = [vectors[2] for vectors in trainX_vectors]

In [0]:
len(trainX_ELMo_vectors_last_layer)

2252

In [0]:
#import torch.utils.data as utils

#tensor_x = []
#for sentence in trainX_ELMo_vectors_last_layer:
#  sent_output_embed = torch.stack([torch.Tensor(i) for i in sentence])
#  tensor_x.append(sent_output_embed)
#tensor_y = torch.from_numpy(trainY)

#my_dataset = utils.TensorDataset(tensor_x,tensor_y) # create your datset
#my_dataloader = utils.DataLoader(my_dataset) # create your dataloader