## Transfer learning

2018 showed us that language modelling is a good task to train powerfull text representations. There are two different approaches how to use this representations: **feature extraction** and **fine-tuning**.

  * One great example of feature extraction is ELMo ([allennlp tutorial](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md), [tf_hub example](https://tfhub.dev/google/elmo/2), [deeppavlov documentation](http://docs.deeppavlov.ai/en/master/apiref/models/embedders.html?highlight=elmo#deeppavlov.models.embedders.elmo_embedder.ELMoEmbedder))
  * One great example of fine-tuning is ULMfit - ([fastai lesson](https://course.fast.ai/videos/?lesson=4), [example notebook](https://nbviewer.jupyter.org/github/fastai/course-v3/blob/master/nbs/dl1/lesson3-imdb.ipynb))

What should you do?

  * Apply ELMo to make named entity recognition system. You can use [CONLL 2003 dataset (en)](http://files.deeppavlov.ai/deeppavlov_data/conll2003_v2.tar.gz) or [Persons1000 dataset (ru)](http://labinform.ru/pub/named_entities/descr_ne.htm) or any other dataset.
  * Apply ULMfit to make text classificator (any dataset, except IMDB)
  * Apply ELMo to make text classificator (on the same dataset)
  * Play with various models and hyperparameters
  * Compare results


**Results of this task:**
  * NER model
  * Two classification models
  * for each model:
    * metrics on the test set (quantitative evaluation)
    * succesfull and _unsucsessfull_ examples (qualitative evaluation)
    * plots showing that the model is training


**Additional points:**
  * Early stopping

In [2]:
import sentencepiece as spm
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as func
from torch.autograd import Variable
import random
import matplotlib.pyplot as plt
import torch.utils.data as utils
import time
import csv
import pandas as pd
import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
import tensorflow as tf
import tensorflow_hub as hub
from keras import backend as K

In [3]:
from torch.nn.utils.rnn import pad_sequence

ModuleNotFoundError: No module named 'allennlp'

In [94]:
USE_GPU = True

dtype = torch.float32

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')
print('using device:', device)

using device: cpu


# NER

Read the train corpus 

In [7]:
data = pd.read_csv("entity-annotated-corpus/ner_dataset.csv", encoding="latin1")
data = data.fillna(method="ffill")
data.head(10)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O
5,Sentence: 1,through,IN,O
6,Sentence: 1,London,NNP,B-geo
7,Sentence: 1,to,TO,O
8,Sentence: 1,protest,VB,O
9,Sentence: 1,the,DT,O


In [8]:
words = list(set(data["Word"].values))
words.append("ENDPAD")
print("Number of words: {}".format(len(words)))
tags = list(set(data["Tag"].values))
print("Number of tags: {}".format(len(tags)))
tag2idx = {t: i for i, t in enumerate(tags)}# indexes of tags

Number of words: 35179
Number of tags: 17


Class for getting sentece + tags

In [9]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, p, t) for w, p, t in zip(s["Word"].values.tolist(),
                                                           s["POS"].values.tolist(),
                                                           s["Tag"].values.tolist())]
        self.grouped = self.data.groupby("Sentence #").apply(agg_func)
        self.sentences = [s for s in self.grouped]
    
    def get_next(self):
        try:
            s = self.grouped["Sentence: {}".format(self.n_sent)]
            self.n_sent += 1
            return s
        except:
            return None

In [10]:
getter = SentenceGetter(data)
sent = getter.get_next()

Get all sentences

In [11]:
sentences = getter.sentences
print("Number of sentences: {}".format(len(sentences)))
print("Max len of sentence: {}".format(max([len(i) for i in sentences])))

Number of sentences: 47959
Max len of sentence: 104


We need to choose max_len for inputting sentence to net

In [12]:
max_len = 50
X = [[w[0] for w in s] for s in sentences]

In [13]:
new_X = []
for seq in X:
    new_seq = []
    for i in range(max_len):
        try:
            new_seq.append(seq[i])
        except:
            new_seq.append("__PAD__")
    new_X.append(new_seq)
X = new_X

# ULMFIT

In [3]:
from fastai.text import *

In [4]:
from sklearn.datasets import fetch_20newsgroups
dataset = fetch_20newsgroups(shuffle=True, random_state=1, remove=('headers', 'footers', 'quotes'))

In [6]:
documents = dataset.data
df = pd.DataFrame({'label':dataset.target, 'text':dataset.data})

In [22]:
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()

[PosixPath('/Users/pavel/.fastai/data/imdb_sample/texts.csv')]

In [23]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,Un-bleeping-believable! Meg Ryan doesn't even ...,False
1,positive,This is a extremely well-made film. The acting...,False
2,negative,Every once in a long while a movie will come a...,False
3,positive,Name just says it all. I watched this movie wi...,False
4,negative,This movie succeeds at being one of the most u...,False


In [27]:
data_lm = TextDataBunch.from_csv(path, 'texts.csv')

In [28]:
data_lm.save()

In [29]:
data = TextClasDataBunch.from_csv(path, 'texts.csv')
data.show_batch()

text,target
"xxbos xxmaj raising xxmaj victor xxmaj vargas : a xxmaj review \n \n xxmaj you know , xxmaj raising xxmaj victor xxmaj vargas is like sticking your hands into a big , steaming bowl of xxunk . xxmaj it 's warm and gooey , but you 're not sure if it feels right . xxmaj try as i might , no matter how warm and gooey xxmaj raising xxmaj",negative
"xxbos xxmaj now that xxmaj che(2008 ) has finished its relatively short xxmaj australian cinema run ( extremely limited xxunk screen in xxmaj sydney , after xxunk ) , i can xxunk join both xxunk of "" xxmaj at xxmaj the xxmaj movies "" in taking xxmaj steven xxmaj soderbergh to task . \n \n xxmaj it 's usually satisfying to watch a film director change his style /",negative
"xxbos xxmaj this film sat on my xxmaj xxunk for weeks before i watched it . i dreaded a self - indulgent xxunk flick about relationships gone bad . i was wrong ; this was an xxunk xxunk into the screwed - up xxunk of xxmaj new xxmaj yorkers . \n \n xxmaj the format is the same as xxmaj max xxmaj xxunk ' "" xxmaj la xxmaj ronde",positive
"xxbos xxmaj many neglect that this is n't just a classic due to the fact that it 's the first xxup 3d game , or even the first xxunk - up . xxmaj it 's also one of the first stealth games , one of the xxunk definitely the first ) truly claustrophobic games , and just a pretty well - rounded gaming experience in general . xxmaj with graphics",positive
"xxbos i really wanted to love this show . i truly , honestly did . \n \n xxmaj for the first time , gay viewers get their own version of the "" xxmaj the xxmaj bachelor "" . xxmaj with the help of his obligatory "" hag "" xxmaj xxunk , xxmaj james , a good looking , well - to - do thirty - something has the chance",negative


In [34]:
data.vocab.itos[0:10]

['xxunk',
 'xxpad',
 'xxbos',
 'xxeos',
 'xxfld',
 'xxmaj',
 'xxup',
 'xxrep',
 'xxwrep',
 'the']

In [42]:
data.train_ds[0][0].data[:10]

array([   2,    5,    9,  300, 5990,  239,   70,    0,   95,  372])

In [43]:
data = (TextList.from_csv(path, 'texts.csv', cols='text')
                .split_from_df(col=2)
                .label_from_df(cols=0)
                .databunch())

In [45]:
bs=48

In [47]:
path = untar_data(URLs.IMDB)
path.ls()

Downloading https://s3.amazonaws.com/fast-ai-nlp/imdb


KeyboardInterrupt: 

In [48]:
path.ls()

[PosixPath('/Users/pavel/.fastai/data/imdb_sample/texts.csv'),
 PosixPath('/Users/pavel/.fastai/data/imdb_sample/data_save.pkl')]