<a href="https://colab.research.google.com/github/delhian/NLP_course/blob/master/week3/seminar_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Description

In this lecture we will get insight into very popular NLP task - Named Entity Recognition.<br>Our goal is to:
- build a good baseline solution
- modify the data markup
- learn how to solve this problem using neural network methods.

In first part we will explore how to get fast solution of this task, how to exlore metrics and how to convert labeling.<br>
In the second part we will look how we can solve this task by using different architectures and measure them.

What we will learn:
- non neural approaches for NER-task;
- measure quality of model for NER-task;
- different markup for NER-task;
- data preparation for neural network solution of NER;
- using different neural approaches for NER;

# Part 1

## Solving NER task without Neural netowrks

In [1]:
!pip install datasets > /dev/null

In [2]:
import pytest
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from datasets import load_dataset
import torch
import torch.nn as nn
from torch import LongTensor, FloatTensor
from torch.nn import functional as F
from typing import List, Dict, Tuple, Optional
from torch.utils.data import Dataset
from torch.optim import Adam
import time
from tqdm import tqdm

from collections import Counter
from sklearn.metrics import classification_report

### look at the data

For this task we will use common NER-dataset which is always included in all benchmarks, when scientists measure quality of SOTA solutions for NER.<br>
The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups.

In [3]:
dataset_base = load_dataset("conll2003")

Downloading:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.78k [00:00<?, ?B/s]

Downloading and preparing dataset conll2003/conll2003 (download: 4.63 MiB, generated: 9.78 MiB, post-processed: Unknown size, total: 14.41 MiB) to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6...


  0%|          | 0/3 [00:00<?, ?it/s]

Downloading:   0%|          | 0.00/650k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/163k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/146k [00:00<?, ?B/s]

  0%|          | 0/3 [00:00<?, ?it/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

0 examples [00:00, ? examples/s]

Dataset conll2003 downloaded and prepared to /root/.cache/huggingface/datasets/conll2003/conll2003/1.0.0/40e7cb6bcc374f7c349c83acd1e9352a4f09474eb691f64f364ee62eb65d0ca6. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [4]:
dataset_base['train']

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 14041
})

In [5]:
import json
mapping_ = {v: k for k, v in dataset_base["train"].features["ner_tags"].feature._str2int.items()}

with open('mapping.json', 'w') as f:
  json.dump(mapping_, f)
mapping_

{0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-MISC',
 8: 'I-MISC'}

In [6]:
for i in range(10):
  print(i + 1, ' '.join(dataset_base["train"]['tokens'][i]))

1 EU rejects German call to boycott British lamb .
2 Peter Blackburn
3 BRUSSELS 1996-08-22
4 The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep .
5 Germany 's representative to the European Union 's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer .
6 " We do n't support any such recommendation because we do n't see any grounds for it , " the Commission 's chief spokesman Nikolaus van der Pas told a news briefing .
7 He said further scientific study was required and if it was found that action was needed it should be taken by the European Union .
8 He said a proposal last month by EU Farm Commissioner Franz Fischler to ban sheep brains , spleens and spinal cords from the human and animal food chains was a highly specific and precautionary move

#### Task 1

Count occurence of each entity. Print number of occurences for each entity. Result must be a dictinary, where keys are entities from `dataset_base["train"]['ner_tags']` and values are total number of occurencies for each key.

In [7]:
%%time
counter = {}

for tags in dataset_base["train"]['ner_tags']:
  for tag in tags:
    counter[mapping_[tag]] = counter.get(mapping_[tag], 0) + 1

counter

CPU times: user 181 ms, sys: 479 µs, total: 182 ms
Wall time: 183 ms


In [8]:
assert len(counter) == 9
assert counter['O'] > 169000

In [9]:
counter

{'B-LOC': 7140,
 'B-MISC': 3438,
 'B-ORG': 6321,
 'B-PER': 6600,
 'I-LOC': 1157,
 'I-MISC': 1155,
 'I-ORG': 3704,
 'I-PER': 4528,
 'O': 169578}

As you see, we have dominating number of class `O`. Our main goal is to make such model, that will not overfit to predict always `O` token.<br>
What metrics are more appropriate to measure quality of models for NER?

### Sklearn-crf

Now I'd like to introduce you great library, that can provide light and easy implementation for solving NER-task. It's name is `sklearn-crf`. It has familiar interface to basic sklearn, but is based on very powerful tool for NER-task - CRF(Conditional Random Field). <br>
CRF is nowdays the de facto standard for solving the NLP problem. Even in the most modern SOTA neural networks approaches, a CRF layer can now often be seen as an output layer.

In [10]:
!pip install sklearn_crfsuite > /dev/null

In [11]:
import sklearn
import sklearn_crfsuite
from sklearn_crfsuite import scorers
from sklearn_crfsuite import metrics

As all sklearn-like libraries we need to get pandas.DataFrame as an input for this model. Let's create it.<br>
In our DataFrame we will make each word, entity and sentence_id on each row.

In [12]:
df = pd.DataFrame({'sent_id': [i for j in [[i] * len(s['tokens']) for i, s in enumerate(dataset_base['train'])] for i in j],
                   'data': [i for j in dataset_base['train'] for i in j['tokens']],
                   'entities': [mapping_[i] for j in dataset_base['train'] for i in j['ner_tags']]})
df.head(20)

Unnamed: 0,sent_id,data,entities
0,0,EU,B-ORG
1,0,rejects,O
2,0,German,B-MISC
3,0,call,O
4,0,to,O
5,0,boycott,O
6,0,British,B-MISC
7,0,lamb,O
8,0,.,O
9,1,Peter,B-PER


Now we have dataframe, where only 3 columns exsists:
 - sentense_id - which mark each word belonging to each sentence
 - data contains words on each row
 - entities marks which entity does each word refer to.

We also need a class, that will process each sentence and aggregate words and entities in it

In [13]:
class SentenceGetter(object):
    
    def __init__(self, data):
        self.n_sent = 1
        self.data = data
        self.empty = False
        agg_func = lambda s: [(w, t) for w, t in zip(s['data'].values.tolist(), 
                                                     s['entities'].values.tolist())]
        self.grouped = self.data.groupby('sent_id').apply(agg_func)
        self.sentences = [s for s in self.grouped]
        
    def get_next(self):
        try: 
            s = self.grouped['Sentence: {}'.format(self.n_sent)]
            self.n_sent += 1
            return s 
        except:
            return None

In [14]:
getter = SentenceGetter(df)
sentences = getter.sentences
sentences[0]

[('EU', 'B-ORG'),
 ('rejects', 'O'),
 ('German', 'B-MISC'),
 ('call', 'O'),
 ('to', 'O'),
 ('boycott', 'O'),
 ('British', 'B-MISC'),
 ('lamb', 'O'),
 ('.', 'O')]

In [15]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'bias': 1.0, 
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit()
    }
    if i > 0:
        word1 = sent[i-1][0]
        features.update({
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper()
        })
    else:
        features['BOS'] = True
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        features.update({
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper()
        })
    else:
        features['EOS'] = True

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, label in sent]

def sent2tokens(sent):
    return [token for token, label in sent]

In [16]:
X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]
len(X)

14041

In [17]:
X_train = X[:10000]
X_test = X[10000:]
y_train = y[:10000]
y_test = y[10000:]

In [18]:
%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
    verbose=True
)
crf.fit(X_train, y_train)

loading training data to CRFsuite: 100%|██████████| 10000/10000 [00:00<00:00, 14019.78it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 2954
Seconds required: 0.085

L-BFGS optimization
c1: 0.100000
c2: 0.100000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=0.23  loss=173256.49 active=2930  feature_norm=1.00
Iter 2   time=0.24  loss=132654.60 active=2796  feature_norm=3.04
Iter 3   time=0.12  loss=110485.96 active=2699  feature_norm=2.59
Iter 4   time=0.25  loss=97099.18 active=2747  feature_norm=2.22
Iter 5   time=0.12  loss=88075.12 active=2874  feature_norm=2.58
Iter 6   time=0.12  loss=80585.90 active=2849  feature_norm=3.05
Iter 7   time=0.12  loss=62716.54 active=2801  feature_norm=5.39
Iter 8   time=0.12  loss=56430.24 active=2862  feature_norm=6.07
Iter 9   time=0.13  loss=50203.36 active=2877  feature_norm=7.93
Iter 10  tim

In [21]:
all_entities = sorted(df.entities.unique().tolist())
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=[i for i in all_entities if i != 'O'])

0.5996010363706731

#### Task 2

Print classification report for all useful tokens (exluding token `O`)

In [25]:
# YOUR CODE HERE

print(metrics.flat_classification_report(y_test, y_pred, labels = [i for i in all_entities if i != 'O']))

              precision    recall  f1-score   support

       B-LOC       0.62      0.64      0.63      2205
      B-MISC       0.64      0.61      0.63      1103
       B-ORG       0.52      0.49      0.50      1739
       B-PER       0.64      0.57      0.60      1976
       I-LOC       0.53      0.41      0.46       422
      I-MISC       0.55      0.37      0.45       370
       I-ORG       0.60      0.62      0.61      1149
       I-PER       0.68      0.80      0.73      1297

   micro avg       0.61      0.60      0.60     10261
   macro avg       0.60      0.56      0.58     10261
weighted avg       0.61      0.60      0.60     10261



#### Task 3

Make some additional features to reach at least 0.82 weighted f1-score on detection all useful tokens.

##### help

In [None]:
# 1. You can check for lower() each word
# 2. You can add more words to features, for example last 3 words words[-3:]

##### continue work

In [60]:
# YOUR CODE HERE
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]
    
    features = {
        'bias': 1.0, 
        # add some here
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'word.isalnum()': word.isalnum(),
        'word.islower()': word.islower()
    }
    if i > 0:
        word1 = sent[i-1][0]
        features.update({
            # add something here
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:word.isdigit()': word1.isdigit(),
            '-1:word.isalnum()': word1.isalnum(),
            '-1:word.islower()': word1.islower()
        })
    else:
        features['BOS'] = True
    if i < len(sent)-1:
        word1 = sent[i+1][0]
        features.update({
            # add something here
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:word.isdigit()': word1.isdigit(),
            '+1:word.isalnum()': word1.isalnum(),
            '+1:word.islower()': word1.islower()
        })
    else:
        features['EOS'] = True

    if i > 1:
        word2 = sent[i-2][0]
        features.update({
            # add something here
            '-2:word.istitle()': word2.istitle(),
            '-2:word.isupper()': word2.isupper(),
            '-2:word.isdigit()': word2.isdigit(),
            '-2:word.isalnum()': word2.isalnum(),
            '-2:word.islower()': word2.islower()
        })
    else:
        features['BOS'] = True

    if i < len(sent)-2:
        word2 = sent[i+2][0]
        features.update({
            # add something here
            '+2:word.istitle()': word2.istitle(),
            '+2:word.isupper()': word2.isupper(),
            '+2:word.isdigit()': word2.isdigit(),
            '+2:word.isalnum()': word2.isalnum(),
            '+2:word.islower()': word2.islower()
        })
    else:
        features['EOS'] = True

    if i > 2:
        word3 = sent[i-3][0]
        features.update({
            # add something here
            '-3:word.istitle()': word3.istitle(),
            '-3:word.isupper()': word3.isupper(),
            '-3:word.isdigit()': word3.isdigit(),
            '-3:word.isalnum()': word3.isalnum(),
            '-3:word.islower()': word3.islower()
        })
    else:
        features['BOS'] = True

    if i < len(sent)-3:
        word3 = sent[i+3][0]
        features.update({
            # add something here
            '+3:word.istitle()': word3.istitle(),
            '+3:word.isupper()': word3.isupper(),
            '+3:word.isdigit()': word3.isdigit(),
            '+3:word.isalnum()': word3.isalnum(),
            '+3:word.islower()': word3.islower()
        })
    else:
        features['EOS'] = True

    if i > 3:
        word4 = sent[i-4][0]
        features.update({
            # add something here
            '-4:word.istitle()': word4.istitle(),
            '-4:word.isupper()': word4.isupper(),
            '-4:word.isdigit()': word4.isdigit(),
            '-4:word.isalnum()': word4.isalnum(),
            '-4:word.islower()': word4.islower()
        })
    else:
        features['BOS'] = True

    if i < len(sent)-4:
        word4 = sent[i+4][0]
        features.update({
            # add something here
            '+4:word.istitle()': word4.istitle(),
            '+4:word.isupper()': word4.isupper(),
            '+4:word.isdigit()': word4.isdigit(),
            '+4:word.isalnum()': word4.isalnum(),
            '+4:word.islower()': word4.islower()
        })
    else:
        features['EOS'] = True

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, label in sent]

def sent2tokens(sent):
    return [token for token, label in sent]

In [61]:
# explore quality for your new features

X = [sent2features(s) for s in sentences]
y = [sent2labels(s) for s in sentences]
len(X)

14041

In [62]:
X_train = X[:10000]
X_test = X[10000:]
y_train = y[:10000]
y_test = y[10000:]

In [63]:
%%time
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
    verbose=False
)
crf.fit(X_train, y_train)

CPU times: user 31.1 s, sys: 110 ms, total: 31.2 s
Wall time: 31.1 s


In [64]:
all_entities = sorted(df.entities.unique().tolist())
y_pred = crf.predict(X_test)
metrics.flat_f1_score(y_test, y_pred, average='weighted', labels=[i for i in all_entities if i != 'O'])

0.6613944287508109

In [65]:
print(metrics.flat_classification_report(y_test, y_pred, labels=[i for i in all_entities if i != 'O']),)

              precision    recall  f1-score   support

       B-LOC       0.68      0.73      0.70      2205
      B-MISC       0.71      0.65      0.68      1103
       B-ORG       0.64      0.57      0.61      1739
       B-PER       0.68      0.64      0.66      1976
       I-LOC       0.55      0.50      0.53       422
      I-MISC       0.55      0.42      0.48       370
       I-ORG       0.62      0.64      0.63      1149
       I-PER       0.74      0.83      0.79      1297

   micro avg       0.67      0.66      0.66     10261
   macro avg       0.65      0.62      0.63     10261
weighted avg       0.67      0.66      0.66     10261



### Converting markup

Now it's time to get acquainted to NER markup or NER data labeling.<br>
When we work with almost every NLP task, we usually need our data to be labeled. For NER problem data labeling is often rather expensive. Often we ask to label just in text, and then simple label all tokens for `BIO`-markup.<br>
But in some tasks in which we need to very accurately define separate entities, the `BILUO`-markup may come to the rescue.


In our dataset we have `BIO-markup.

#### Task 4

write function to convert `BIO`-markup into `BILUO`-markup

In [66]:
entities_list = [[mapping_[token] for token in tokens] for tokens in dataset_base["train"]['ner_tags']]
# entities_list

In [67]:
# B - 'beginning'
# I - 'inside'
# L - 'last'
# O - 'outside'
# U - 'unit'

In [68]:
def bio_2_biluo(entities_list, missing: str = 'O'):
  result = list()
  for entities in entities_list:
    current_new_markup = [entities[0]]
# YOUR CODE HERE
    entities_len = len(entities)
    for id in range(1, entities_len - 1):
      if entities[id][0] == 'I' and entities[id+1][0] == missing:
        current_new_markup.append('L' + entities[id][1:])
        continue
      if entities[id][0] == 'B' and entities[id+1][0] == missing:
        current_new_markup.append('U' + entities[id][1:])
        continue
      current_new_markup.append(entities[id])
    if entities[-1][0] == 'I':
        current_new_markup.append('L' + entities[-1][1:])
    else:
      current_new_markup.append(entities[-1])

    # FILL code here
    result.append(current_new_markup)
  return result

In [69]:
assert len(bio_2_biluo(entities_list)) == len(entities_list)
assert set(bio_2_biluo([entities_list[1]])[0]) == {'B-PER', 'L-PER'}
assert len(set(bio_2_biluo([entities_list[7]])[0])) == 4

Sometimes after markup we have data labeled in offets: in plain text we get beginning and ending of each entity.<br>
In this situations we can use function from spacy named `offsets_to_biluo_tags`. But you need to be careful, because sometimes it works incorrect. In this case you need to check translation of markup or write your own function to translate markups.

Future readings

In [None]:
# you can also try to use spacy built-in ner model from spacy python library. Example of usage is here -> https://spacy.io/api/cli

# Part 2

In this part we will try to use some basic approaches to solve NER-task. Dataset will be the same as above. In this part don't forget to change runtime of your notebook to `GPU`.

In [70]:
!pip install spacy==3.1

Collecting spacy==3.1
  Downloading spacy-3.1.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.4 MB)
[K     |████████████████████████████████| 6.4 MB 10.8 MB/s 
Collecting srsly<3.0.0,>=2.4.1
  Downloading srsly-2.4.1-cp37-cp37m-manylinux2014_x86_64.whl (456 kB)
[K     |████████████████████████████████| 456 kB 33.8 MB/s 
Collecting thinc<8.1.0,>=8.0.7
  Downloading thinc-8.0.10-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (623 kB)
[K     |████████████████████████████████| 623 kB 42.4 MB/s 
[?25hCollecting pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4
  Downloading pydantic-1.8.2-cp37-cp37m-manylinux2014_x86_64.whl (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 41.9 MB/s 
Collecting typer<0.4.0,>=0.3.0
  Downloading typer-0.3.2-py3-none-any.whl (21 kB)
Collecting spacy-legacy<3.1.0,>=3.0.7
  Downloading spacy_legacy-3.0.8-py2.py3-none-any.whl (14 kB)
Collecting catalogue<2.1.0,>=2.0.4
  Downloading catalogue-2.0.6-py3-none-any.whl (17 kB)
Collecting p

In [71]:
!pip install spacy-transformers

Collecting spacy-transformers
  Downloading spacy_transformers-1.0.6-py2.py3-none-any.whl (42 kB)
[?25l[K     |███████▊                        | 10 kB 22.2 MB/s eta 0:00:01[K     |███████████████▍                | 20 kB 13.4 MB/s eta 0:00:01[K     |███████████████████████         | 30 kB 12.6 MB/s eta 0:00:01[K     |██████████████████████████████▉ | 40 kB 15.2 MB/s eta 0:00:01[K     |████████████████████████████████| 42 kB 1.1 MB/s 
Collecting spacy-alignments<1.0.0,>=0.7.2
  Downloading spacy_alignments-0.8.3-cp37-cp37m-manylinux2014_x86_64.whl (998 kB)
[K     |████████████████████████████████| 998 kB 26.4 MB/s 
Collecting transformers<4.10.0,>=3.4.0
  Downloading transformers-4.9.2-py3-none-any.whl (2.6 MB)
[K     |████████████████████████████████| 2.6 MB 43.4 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 40.9 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-

In [72]:
from IPython.display import Image
from IPython.core.display import display, HTML
import pandas as pd
from collections import Counter
import random
import json
from datasets import load_dataset
import torch
import torch.nn as nn

import numpy as np
import pandas as pd
import time
from tqdm import tqdm

from collections import Counter
from spacy.training import offsets_to_biluo_tags
import spacy
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

from torch import LongTensor, FloatTensor
from torch.nn import functional as F
from typing import List, Dict, Tuple, Optional, Union
from torch.utils.data import Dataset
from sklearn.metrics import classification_report

Let's look at distribution of our data. Maybe we can deal with our problem by just using simple Neural networks.

In [73]:
!wget https://raw.githubusercontent.com/snv-ds/NLP_course/master/week3/restauranttrain_updated.json

--2021-10-08 09:14:00--  https://raw.githubusercontent.com/snv-ds/NLP_course/master/week3/restauranttrain_updated.json
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15469921 (15M) [text/plain]
Saving to: ‘restauranttrain_updated.json’


2021-10-08 09:14:01 (163 MB/s) - ‘restauranttrain_updated.json’ saved [15469921/15469921]



In [74]:
max_lens = list()
for row in new_data:
    max_lens.append(len(row))
max_lens = pd.Series(max_lens)
max_lens.plot();
max_lens.describe()

NameError: ignored

As we can see, there are not so many long texts. And we can forecast all tokens at ones.

### FCNN for NER

For first approach we can just use basic FCNN. In production you will never see this, but for learning purpose it can be useful to explore.

As usual, we will write to fix words order in our vocab

In [None]:
import torch
from torch import nn
from torch.nn import functional as F
from torch import LongTensor, FloatTensor
from torch.nn.parameter import Parameter
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from typing import Any

### to biluo

This time we want to change our markup and train some models

For whis purpose we will use spacy library. It contains built-in method that converts markup. But we need to correct it. That's why we wrote function that converts `BIO`-markup to `BILUO`-markup.

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
!python -m spacy init config base_config.cfg -p ner --force

In [None]:
nlp = spacy.load("en_core_web_sm")  # without vocabulary spacy can not work

In [None]:
with open('restauranttrain_updated.json', 'r') as f:
    d = json.load(f)
d[0]['paragraphs'][34]['sentences']

We will join our data, which is in list and then move it to spacy method 

In [None]:
tokens_dict = d[0]['paragraphs'][34]['sentences'][0]['tokens']
tokens = [i['orth'] for i in tokens_dict]

text = ' '.join(tokens)
doc = nlp(text)
entities = d[0]['paragraphs'][34]['entities']

For futher usage you can download and use this function in your work

In [None]:
from typing import List, Tuple, Union
def convert_to_biluo(text: str = '',
                     entities: List[Tuple] = None,
                     tokens: list = None,
                     missing: str = 'O') -> Tuple[Union[List[str], list, None], List[str]]:
    """
    Tokenize text and return text tokens and ner labels.

    Args:
        text: text
        entities: labels in spacy format
        tokens: already tokenized text, if you want it
        missing: lable for tokens without entities

    Returns:
        tokenized text and labels
    """

    # create dicts with start/end position of token and its index
    starts = []
    ends = []
    cur_index = 0
    tokens = text.split() if tokens is None else tokens

    for token in tokens:
        starts.append(cur_index)
        ends.append(cur_index + len(token))
        cur_index += len(token) + 1

    starts = {k: v for v, k in enumerate(starts)}
    ends = {k: v for v, k in enumerate(ends)}

    # this will be a list with token labels
    biluo = ["-" for _ in text.split()]

    # check that there are no overlapping entities
    entities_indexes = [list(range(i[0], i[1])) for i in entities]
    if max(Counter([i for j in entities_indexes for i in j]).values()) > 1:
        raise ValueError('You have overlapping entities')

    tokens_in_ents = {}

    # Handle entity cases
    for start_char, end_char, label in entities:
        for token_index in range(start_char, end_char):
            tokens_in_ents[token_index] = (start_char, end_char, label)
        start_token = starts.get(start_char)
        end_token = ends.get(end_char)
        # Only interested if the tokenization is correct
        if start_token is not None and end_token is not None:
            if start_token == end_token:
                biluo[start_token] = f"U-{label}"
            else:
                biluo[start_token] = f"B-{label}"
                for i in range(start_token + 1, end_token):
                    biluo[i] = f"I-{label}"
                biluo[end_token] = f"L-{label}"

    # put missing value for tokens without labels
    entity_chars = set()
    for start_char, end_char, label in entities:
        for i in range(start_char, end_char):
            entity_chars.add(i)

    for ind, token in enumerate(tokens):
        for i in range(list(starts.keys())[ind], list(ends.keys())[ind]):
            if i in entity_chars:
                break
        else:
            biluo[ind] = missing

    return tokens, biluo

In [None]:
# convert the data
%%time
new_data = []
biluo_labels = []
for i in range(len(d[0]['paragraphs'])):
    tokens_dict = d[0]['paragraphs'][i]['sentences'][0]['tokens']
    tokens = [i['orth'] for i in tokens_dict]
    if len([i['orth'] for i in tokens_dict]) > 1:
        
        text = ' '.join(tokens)
        doc = nlp(text)
        entities = d[0]['paragraphs'][i]['entities']

        new_ents = offsets_to_biluo_tags(doc, entities)  # using spacy function
        if entities == []:
            new_ents = ['O'] * len(tokens)
        new_data.append(tokens)
        
        biluo_labels.append(new_ents)
        if len(tokens) != len(new_ents): # if lists from 2 methods don't match
            
            ents2 = convert_to_biluo(text, entities)[1]
            biluo_labels[-1] = ents2

NameError: ignored

In [None]:
biluo_labels[0], new_data[0]

In [None]:
import json
from collections import Counter
from tqdm.notebook import tqdm
import joblib
from typing import List, Tuple, Union, Dict

#### Task 5

create to variables, that will contains mappings between entities and indices. Each dictionary must include entities: `O` and `PAD`.
Initialize variable `tag_to_idx`.

In [None]:
# YOUR CODE HERE

tags = sorted(list({i for j in biluo_labels for i in j}))

tag_to_idx = {}

with open('mapping.json', 'w') as f:
  json.dump(tag_to_idx, f)

idx_to_tag = {second: first for first, second in tag_to_idx.items()}

tag_to_idx

In [None]:
def get_word_to_idx(count: List[Tuple[str, int]],
                   min_words: Union[int, float] = 0.0,
                   max_words: Union[int, float] = 1.0) -> Dict[str, int]:
    max_count = count[0][1]
    if isinstance(min_words, float):
        min_words = max_count * min_words
    if isinstance(max_words, float):
        max_words = max_count * max_words
    
    all_words = [w[0] for w in count if max_words >= w[1] >= min_words]
    
    all_words = ['<pad>', '<unk>'] + all_words
    
    word_to_idx = {k: v for k, v in zip(all_words, range(0, len(all_words)))}
    return word_to_idx

#### Task 6

Count how many unique words are there in our train dataset. Parameters `min_words` and `max_words` should be initialized as default values. Initialize variable `word_to_idx` from method `get_word_to_idx`.

##### help

In [None]:
# 1. first you can count occurences of each word
# 2. second you can pass list of tuples for each pair (word, num_of_occurencies) to function get_word_to_idx

##### Continue work

In [None]:
count = Counter()
word_to_idx = None

In [None]:
assert len(word_to_idx) == 3805

In [None]:
def create_matrix_of_texts(dataset, max_sequence_length, 
                           pad_token, word2index):
    texts = np.full((len(dataset), max_sequence_length),
                    word2index[pad_token], dtype=np.int64)  # creating empty matrix

    for ind, row in enumerate(dataset):
          trim_length = min(max_sequence_length, len(row))
          text = row[:trim_length]
          texts[ind, :trim_length] = [word2index[item.lower()] for item in text]
    return texts

def create_matrix_of_tags(dataset, max_sequence_length, pad_index, tag2idx):
    tags = np.full((len(dataset), max_sequence_length),
                    pad_index, dtype=np.int64)  # creating empty matrix

    for ind, row in enumerate(dataset):
          trim_length = min(max_sequence_length, len(row))
          labels = row[: trim_length]
          tags[ind, : trim_length] = [tag2idx[item] for item in labels]
    return tags

In [None]:
texts = create_matrix_of_texts(new_data, 
                               int(max_lens.quantile(0.97)),
                               '<pad>', word_to_idx)
tags = create_matrix_of_tags(biluo_labels,
                             int(max_lens.quantile(0.97)),
                             tag_to_idx['PAD'],
                             tag_to_idx)

In [None]:
class NerDataset(Dataset):
    def __init__(self,
                 texts: np.array,
                 tags: np.array):
        self.tags = tags
        self.texts = texts
        

    def __getitem__(self, idx: int) -> Tuple[torch.LongTensor, torch.LongTensor]:
        tokens_tensor = torch.tensor(self.texts[idx], dtype=torch.int64)
        return tokens_tensor, torch.tensor(self.tags[idx], dtype=torch.int64)

    def __len__(self) -> int:
        dataset_len = self.texts.shape[0]
        return dataset_len

In [None]:
ner_dataset = NerDataset(texts, tags)
assert len(ner_dataset) == 7634

In [None]:
from torch.utils.data.dataset import random_split

In [None]:
BATCH_SIZE = 32

#### Task 8

Initialize dataloaders for train and validation. There is no need to shuffle validation dataloader, but it is better to shuffle train and drop last batch from train dataloader.

In [None]:
# YOUR CODE HERE

num_train = int(len(ner_dataset) * 0.95)
split_train_, split_valid_ = \
    random_split(ner_dataset, [num_train, len(ner_dataset) - num_train])

train_dataloader = pass
valid_dataloader = pass

In this toy example we will first try simple FCNN for your problem. Let's look how bad/good it fits our data.

#### Task 9

Initialize sequantial layers of our FCNN

In [None]:
# YOUR CODE HERE

class NerModel(nn.Module):
    def __init__(
        self,
        word2idx: Dict,
        embedding_dim: int = 100,
        mapping: Dict[int, str] = None,
        hidden_size: int = 256,
    ):
        super(NerModel, self).__init__()
        if not mapping:
            raise RuntimeError(f'Empty labels')
        self.word2idx = word2idx
        self.labels = mapping

        self.linear_sigmoid_stack = nn.Sequential(
            # FILL YOUR CODE HERE
        )

    def forward(self, tokens: LongTensor) -> FloatTensor:

        return self.linear_sigmoid_stack(tokens).view(-1, len(self.labels))

Now we will create basic network and check how it calculate loss.

In [None]:
model = NerModel(word_to_idx, 100, {idx: str(idx) for idx in range(10)})
assert (
        len(list(name for name, module in model.named_modules())) > 3
    ), "Not enough layers created"

In [None]:
num_classes = len(tag_to_idx)
model = NerModel(word_to_idx, 30, {idx: str(idx) for idx in range(num_classes)})
seq_len = 32
example_input = torch.randint(0, 2, (BATCH_SIZE, seq_len), dtype=torch.int64)
logits = model(example_input)
assert isinstance(logits, torch.FloatTensor)
assert logits.shape == (BATCH_SIZE * seq_len, num_classes), f"current size of model output {logits.shape}"

In [None]:
i = iter(train_dataloader)
text, label = next(i)
logits = model(text)
loss_function = nn.CrossEntropyLoss()
loss_function(logits, label.view(-1))

Everything looks pretty well and seems correct. Lets now write evaluation function and begin our training.

#### Task 10

Fill lines of code. First you need to initialize variable of `correct_labels` (labels, that are not special ones). Then you need to get `true_labels` and `predicted` variables.

In [None]:
# YOUR CODE HERE
def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0
    correct_labels = None  # Fill your code here
    predicted, true_labels = list(), list()

    with torch.no_grad():
        for idx, batch in enumerate(dataloader):
            tokens, label = batch
            tokens = tokens.to(device)
            
            logits = model(tokens)
            predictions = F.log_softmax(logits, dim=1).reshape(-1,
                                                               int(max_lens.quantile(0.97)),
                                                               len(tag_to_idx)).argmax(dim=2).flatten().detach().cpu().numpy()
            predicted.extend(predictions)
            true_labels.extend(label.flatten().detach().cpu().numpy())
    
    true_labels = None # Fill your code here
    
    predicted = None  # Fill your code here
    print('\n', classification_report(true_labels,
                                      predicted,
                                      labels=correct_labels))

Now we can create our model and start trainig

In [None]:
model = NerModel(word_to_idx, 300, tag_to_idx)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

In [None]:
for e in range(6):
    total_loss = 0
    model.train()
    loss_function = nn.CrossEntropyLoss()
    for sent in tqdm(train_dataloader):

            # (1) Set gradient to zero for new example: Set gradients to zero before pass
            model.zero_grad()
            
            # (2) Encode sentence and tag sequence as sequences of indices
            input_sent, gold_tags = sent

            # (3) Predict tags (sentence by sentence)
            if len(input_sent) > 0:
                pred_scores = model(input_sent.to(device))
                mask = gold_tags != 0
                # (4) Compute loss and do backward step
                loss = loss_function(pred_scores.to(device), gold_tags.view(-1).to(device))
                loss.backward()
              
                # (5) Optimize parameter values
                optimizer.step()
          
                # (6) Accumulate loss
                total_loss += loss
    print('\nEpoch: %d, loss: %.4f' % (e, total_loss / len(train_dataloader)))
    evaluate(valid_dataloader)

We did it, but quality of model is rather bad. Now you can try RNNs.

### RNNs

All process from FCNN works fine, but we need to use new architecture. Let's write new model, that process data and uses some kind of Recurrent Neural Network.

#### Task 11

Fill missing layers of model. You can use any RNN.

In [None]:
# YOUR CODE HERE

class NerRNNModel(nn.Module):
    def __init__(
        self,
        word2idx: Dict,
        embedding_dim: int = 100,
        mapping: Dict[int, str] = None,
        hidden_size: int = 256
    ):
        super(NerRNNModel, self).__init__()
        if not mapping:
            raise RuntimeError(f'Empty labels')
        self.word2idx = word2idx
        self.labels = mapping
        self.embedding = nn.Embedding(len(word_to_idx), embedding_dim)
        self.encoder = nn.RNN(
            # FILL YOUR CODE HERE
        )
        self.projection = nn.Linear(hidden_size, len(mapping))

    def forward(self, tokens: LongTensor) -> FloatTensor:

        emb = self.embedding(tokens)        
        h, _ = self.encoder(emb)
        pred = self.projection(h)
        return pred.view(-1, len(self.labels))

Now we can duplicate all cells from above and simply just start new iteration of training new model.

In [None]:
model = NerRNNModel(word_to_idx, 100, {idx: str(idx) for idx in range(10)})
assert (
        len(list(name for name, module in model.named_modules())) > 3
    ), "Not enough layers created"

In [None]:
num_classes = len(tag_to_idx)
model = NerRNNModel(word_to_idx, 30, {idx: str(idx) for idx in range(num_classes)})
seq_len = 32
example_input = torch.randint(0, 2, (BATCH_SIZE, seq_len), dtype=torch.int64)
logits = model(example_input)
assert isinstance(logits, torch.FloatTensor)
assert logits.shape == (BATCH_SIZE * seq_len, num_classes), f"current size of model output {logits.shape}"

In [None]:
i = iter(train_dataloader)
text, label = next(i)
logits = model(text)
loss_function = nn.CrossEntropyLoss()
loss_function(logits, label.view(-1))

In [None]:
from sklearn.metrics import classification_report

def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0
    correct_labels = [value for value in idx_to_tag.values() if value != 'O' and value != 'PAD']
    predicted, true_labels = list(), list()

    with torch.no_grad():
        for idx, batch in enumerate(dataloader):
            tokens, label = batch
            tokens = tokens.to(device)
            
            logits = model(tokens)
            predictions = F.log_softmax(logits, dim=1).reshape(-1,
                                                               int(max_lens.quantile(0.97)),
                                                               len(tag_to_idx)).argmax(dim=2).flatten().detach().cpu().numpy()
            predicted.extend(predictions)
            true_labels.extend(label.flatten().detach().cpu().numpy())
    
    true_labels = [idx_to_tag[val] for val in true_labels]
    
    predicted = [idx_to_tag[val] for val in predicted]
    print('\n', classification_report(true_labels,
                                      predicted,
                                      labels=correct_labels))

In [None]:
model = NerRNNModel(word_to_idx, 300, tag_to_idx)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

In [None]:
sum([params.numel() for params in model.parameters() if params.requires_grad])

In [None]:
for e in range(6):
    total_loss = 0
    model.train()
    loss_function = nn.CrossEntropyLoss()
    for sent in tqdm(train_dataloader):

            # (1) Set gradient to zero for new example: Set gradients to zero before pass
            model.zero_grad()
            
            # (2) Encode sentence and tag sequence as sequences of indices
            input_sent, gold_tags = sent

            # (3) Predict tags (sentence by sentence)
            if len(input_sent) > 0:
                pred_scores = model(input_sent.to(device))
                mask = gold_tags != 0
                # (4) Compute loss and do backward step
                loss = loss_function(pred_scores.to(device), gold_tags.view(-1).to(device))
                loss.backward()
              
                # (5) Optimize parameter values
                optimizer.step()
          
                # (6) Accumulate loss
                total_loss += loss
    print('\nEpoch: %d, loss: %.4f' % (e, total_loss / len(train_dataloader)))
    evaluate(valid_dataloader)

Futher working:
- Try more complex architecture
- try bidirectional rnns
- try other hyperparameters
- try pretrained embeddings