## Transfer learning

2018 showed us that language modelling is a good task to train powerfull text representations. There are two different approaches how to use this representations: **feature extraction** and **fine-tuning**.

  * One great example of feature extraction is ELMo ([allennlp tutorial](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md), [tf_hub example](https://tfhub.dev/google/elmo/2), [deeppavlov documentation](http://docs.deeppavlov.ai/en/master/apiref/models/embedders.html?highlight=elmo#deeppavlov.models.embedders.elmo_embedder.ELMoEmbedder))
  * One great example of fine-tuning is ULMfit - ([fastai lesson](https://course.fast.ai/videos/?lesson=4), [example notebook](https://nbviewer.jupyter.org/github/fastai/course-v3/blob/master/nbs/dl1/lesson3-imdb.ipynb))

What should you do?

  * Apply ELMo to make named entity recognition system. You can use [CONLL 2003 dataset (en)](http://files.deeppavlov.ai/deeppavlov_data/conll2003_v2.tar.gz) or [Persons1000 dataset (ru)](http://labinform.ru/pub/named_entities/descr_ne.htm) or any other dataset.
  * Apply ULMfit to make text classificator (any dataset, except IMDB)
  * Apply ELMo to make text classificator (on the same dataset)
  * Play with various models and hyperparameters
  * Compare results


**Results of this task:**
  * NER model
  * Two classification models
  * for each model:
    * metrics on the test set (quantitative evaluation)
    * succesfull and _unsucsessfull_ examples (qualitative evaluation)
    * plots showing that the model is training


**Additional points:**
  * Early stopping

In [119]:
import sentencepiece as spm
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as func
from torch.autograd import Variable
import random
import matplotlib.pyplot as plt
import torch.utils.data as utils
import time
import csv
import pandas as pd
import tensorflow as tf
import keras.preprocessing.sequence
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_hub as hub
from keras import backend as K

In [94]:
USE_GPU = True

dtype = torch.float32

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
print('using device:', device)

using device: cpu


# NER

Read the train corpus 

In [95]:
data = pd.read_csv("entity-annotated-corpus/ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
data.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


In [96]:
words = list(set(data["Word"].values))
words.append("ENDPAD")
print("Number of words: {}".format(len(words)))
tags = list(set(data["Tag"].values))
print("Number of tags: {}".format(len(tags)))
tag2idx = {t: i for i, t in enumerate(tags)}# indexes of tags

Number of words: 35179
Number of tags: 17


Class for getting sentece + tags

In [97]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

In [98]:
getter = SentenceGetter(data)
sent = getter.get_next()

Get all sentences

In [99]:
sentences = getter.sentences
print("Number of sentences: {}".format(len(sentences)))
print("Max len of sentence: {}".format(max([len(i) for i in sentences])))

Number of sentences: 47959
Max len of sentence: 104


We need to choose max_len for inputting sentence to net

In [100]:
max_len = 50
X = [[w[0] for w in s] for s in sentences]

In [102]:
new_X = []
for seq in X:
    new_seq = []
    for i in range(max_len):
        try:
            new_seq.append(seq[i])
        except:
            new_seq.append("__PAD__")
    new_X.append(new_seq)
X = new_X

Get idx for tags

In [110]:
y = [[tag2idx[w[2]] for w in s] for s in sentences]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=tag2idx["O"])

In [112]:
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.1, random_state=2018)

In [115]:
batch_size = 32

In [120]:
sess = tf.Session()
K.set_session(sess)