# Sequence Labelling and Classification

In this session we'll first investigate Part-of-Speech (POS) tagging and Named-entity recognition (NER). 
- For this we will make use of the spaCy natural langauge processing API: https://spacy.io/
- spaCy is an opensource API that provides state-of-the-art performance on sequence labeling tasks such as POS tagging and NER. 
- Parts of this tutorial are based on code from: https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

In the second part of the tutorial we will train a Text classifier that makes use of a Bidirectional LSTM (Long Short-term Memory) model.


Before starting I need to connect the drive storage to the notebook.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
import os
os.chdir('/content/drive/MyDrive/Colab Notebooks/NLP')
os.getcwd()

'/content/drive/MyDrive/Colab Notebooks/NLP'

## Installing spaCy and downloading models

First we need to check whether the spaCy library is installed:

In [None]:
!pip install -U spacy

Then we need to download pretrained models for use with spaCy. We will load models for both English and Italian:
- The models are called 'en_core_web_sm' and 'it_core_news_sm', where the 'web'/'news' indicates what type of collection the model was trained on and the 'sm' at the end indicates that we are using the 'small' version of the models
- Other models are available here: https://spacy.io/models
- The following code calls the python executable instructing it to run the module 'spacy', which in turn download the models. See discussion here: https://stackoverflow.com/questions/46148033/unable-to-load-en-from-spacy-in-jupyter-notebook

In [None]:
import sys
!{sys.executable} -m spacy download en_core_web_sm
!{sys.executable} -m spacy download it_core_news_sm;

We are now ready to import spacy and load a model. Let's start with the English model:

In [None]:
import spacy
import en_core_web_sm
nlp_model = en_core_web_sm.load()

Consider the following piece of text:

In [None]:
text = 'Melbourne is to re-enter Stage 3 lockdown after a record increase in cases. Victorian state premier Daniel Andrews said there was “simply no alternative” to reimposing stay at home restrictions in Australia’s second-biggest city.'
text = "Good evening, London. Allow me first to apologize for this interruption. I do, like many of you, appreciate the comforts of everyday routine, the security of the familiar, the tranquillity of repetition. I enjoy them as much as any bloke. But in the spirit of commemoration, whereby those important events of the past, usually associated with someone's death or the end of some awful bloody struggle, are celebrated with a nice holiday, I thought we could mark this November the fifth, a day that is sadly no longer remembered, by taking some time out of our daily lives to sit down and have a little chat."
print(text)

Parse the text using the NLP engine:

In [None]:
parsed_text = nlp_model(text)
print(parsed_text)

Did it do something? It looks like it has just output the same text.
- Actually, yes. It has parsed the input and built its internal datastructure from it. 
- Note that the length of the parsed object is in words, not characters:

In [None]:
print(f'The length of the original text is {len(text)} chacacters')
print(f'The length of the parsed text is {len(parsed_text)} words')

## Part-of-Speech Tagging

While parsing the text, spaCy performs part-of-speech (POS) tagging. 
- We can see the POS tag for each token as follows:

In [None]:
[(w,w.pos_) for w in parsed_text]

Who remembers their grammar from high school? What do all those POS symbols mean?
- You can find an explanation of the POS tags on this website https://spacy.io/api/annotation in the section "Universal Part-of-speech Tags" 

What can we do with POS tags? 
- Well, we could select all terms that have a certain tag, such as all adjectives:

In [None]:
set(w for w in parsed_text if w.pos_=='ADJ')

That was a little underwhelming. 
- Let's try it on Alice in Wonderland chapter 1 text. (You'll need to upload it again to Google Colab)

In [None]:
adjectives = sorted(set(w.text for w in nlp_model(open("docs/Alice_Chapter1.txt", "r").read()) if w.pos_=='ADJ'))
print(adjectives)

You can see how descriptive a writer Lewis Carroll was! 

This leads us to one explanation as to why we might want to extract POS tags from text: 
- They can sometimes be useful for **extracting features** (often handcrafted ones) for certain text classification tasks (such as authorship identification).
- This is particularly the case if only a small amount of training data is available.  
- For example, in this article (https://towardsdatascience.com/automatically-detect-covid-19-misinformation-f7ceca1dc1c7) hand-crafted features are extracted for classifying covid misinformation. 

Another reason why we might consider POS tagging is to **reduce ambiguity** in our bag-of-words representation by appending POS tags to word occurrences. 
- Consider the following two sentences:

In [None]:
ex1 = 'I catch the train to and from work.'       # This is Prof. Mark Carman speaking
ex2 = 'I like to train at least 6 times a week.'  # This is Prof. Jacked Carman speaking

print(ex1, '     <-- \'train\' is a', nlp_model(ex1)[3].pos_)
print(ex2, '<-- \'train\' is a', nlp_model(ex2)[3].pos_)

The second sentence has nothing to do with trains, despite containing the word 'train'!
- We could deal with this issue by appending the POS tag to the observed literal to form vocabulary elements: train_NOUN, train_VERB

A final reason why we might think about running POS tagging would be to extract proper nouns from the text, since they refer to real entities that are being discussed in it:

In [None]:
[w.text for w in parsed_text if w.pos_=='PROPN']

Shortly though, we will talk about Entity-extraction, which is the task of identifying and categorising the entities discussed in the text.

## Lemmatization

While parsing, spaCy also performs lemmatization. 
- Lemmatization is the process of extracting the 'lemma' for each token, which is the canonical form of the word that would be found in the dictionary, (see https://en.wikipedia.org/wiki/Lemma_(morphology))
- Basically, verbs converted to their root form, e.g.: **went, going, goes, gone => go**
- And nouns are retuned to singular form: **kittens => kitten**
- Lemmatization is a more complicated POS-aware process than stemming (https://en.wikipedia.org/wiki/Stemming). Stemmers simply apply a set of language-specific syntax rules to recover the stem of the word

In [None]:
[(x, x.lemma_) for x in parsed_text]

Why would one want to perfom lemmatization? -- Or stemming for that matter?
- to **reduce the vocabulary size** and thereby generalise the representation. -- This used to be very important for improving performance of search engine performance (better similarity measures between documents) and also classifiers on small datasets, (before word embeddings came along).
- to **look-up information** about the word in a dictionary/ontology, such as WordNet (https://en.wikipedia.org/wiki/WordNet). This used to be an important way to compute semantic similarity between words, but again, word embeddngs probably do a better job.

## Dependency Parsing

Tradititonally in Natural Language Processing, text is processed in a pipeline that first tokenizes, then POS tags, lemmatizes and finaly dependency parses a piece of text. 
- The idea with dependency parsing is to determine what function each of the word instances is fulfilling in the sentence. 
- What is the subject and object of the sentence? 
- Which noun is each adjective referring to?

So while parsing the text, the spaCy model also generates a **dependency parse tree**, which can be displayed using 'displacy':

In [None]:
from spacy import displacy
displacy.render(parsed_text, jupyter=True, style='dep')

Such dependency trees are interesting for understanding and visualising language (particularly for linguists) and could possibly be used for some downstream tasks (say checking ambiguity in legal documents).  

Consider the sentences:
- *The girl saw a man carrying a telescope.*
- *The girl saw a man with a telescope.*

Who had the telescope?

In [None]:
displacy.render(nlp_model('The girl saw a man carrying a telescope.'),jupyter=True,style='dep')

In [None]:
displacy.render(nlp_model('The girl saw a man with a telescope.'),jupyter=True,style='dep')

The second sentence is ambiguous: The girl may have made use of her telescope or the man may have been using the girl's telescope...
- Language is full of such ambiguities which we as humans naturally deal with using our prior knowledge and abilty to construct mental models of the situations described.
- This process is not without its biases:
  - *The doctor went over to talk to the nurse. She told him that she had just given the patient 5mg of Vicodin and the child had started convulsing. He listened attentively as she explained what had happened. The doctor was worried that the patient should not be given any more painkillers. The nurse told the doctor not to worry, that the patient was in good hands, and that he would let her know immediately if the child's condition changed.*
  - What gender are the doctor and the nurse?

## Extracting Entities

A more important output than the depency parse, from a text mining perspective, is the list of named-entities present in the text
- **named entities** are objects in the real world, e.g. persons, products, organizations, locations, etc. 
  - see https://en.wikipedia.org/wiki/Named_entity
- if spacy has found any named entities while parsing the text, we can access them as follows:

In [None]:
parsed_text.ents

Note that the entities are not single word tokens but short sequences of words: 'Stage 3' and 'Daniel Andrews'.

Not only does spacy extract the entities, but also categorises them based on their type:

In [None]:
print([(ent.text, ent.label_) for ent in parsed_text.ents])

The city and country locations have been labeled 'GPE' for 'geopolitical entity', while the Premier of Victoria has been correctly identified as a person. 
- Here is the list of all entity types that spaCy looks for: https://spacy.io/api/annotation#section-named-entities

Internally, the output of the Named Entity Recogniser is a sequence annotated with entities using inside-outside-beginning encoding: 
- see https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
- We can print out this labeling as follows:

In [None]:
[(X, X.ent_iob_, X.ent_type_) for X in parsed_text]

The above format is a bit hard to read though, so spaCy also provides a far more natural visualisation of the tags:

In [None]:
displacy.render(parsed_text, jupyter=True, style='ent')

## Extracting entities from a web document

Now that we know how to perform entity recognition on text, let's apply it to a full document

In [None]:
url = 'https://www.bbc.com/news/world-latin-america-53319517'
#url = 'https://en.wikipedia.org/wiki/Apple_(disambiguation)'

import requests
html_doc = requests.get(url).text

from bs4 import BeautifulSoup
parsed_doc = BeautifulSoup(html_doc, 'lxml')

Now lets extract the title and paragraph text:

In [None]:
title = parsed_doc.find('title').text
paragraphs = [p.text for p in parsed_doc.find_all('p')]

# Combine the title and paragraphs into a single text:
article_text = title + '\n\n' + '\n'.join(paragraphs)
print(article_text)

#article_text = parsed_doc.get_text()
#print(article_text)

---

Parse the article to identify the entities and display them: 

In [None]:
parsed_article = nlp_model(article_text)
displacy.render(parsed_article,jupyter=True,style='ent')

---

What do you think? Did it work?

Let's have a bit of a better look at the entities found
- List all the distinct entities found in the article, sorted alphabetically:

In [None]:
sorted(set(x.text for x in parsed_article.ents))

We can count the number of times each **entity type** occurs:

In [None]:
from collections import Counter

labels = [x.label_ for x in parsed_article.ents]
Counter(labels)

We can also count the number of times each **entity name** occurs

In [None]:
items = [x.text for x in parsed_article.ents]
Counter(items).most_common()

Note that some of the phrases refer to the same entity, e.g. 'Mr Bolsonaro' and just 'Bolsonaro'.
- Entity Linking and Reference Resolution are the NLP problems that deal with resolving the different references to the same entity in the text.

If we were only interested in what was being said about Bolsonaro, 
- we could select only sentences refering to him:

In [None]:
sentences_containing_Bolsonaro = [x for x in parsed_article.sents if 'Bolsonaro' in x.text]
displacy.render(sentences_containing_Bolsonaro,jupyter=True,style='ent')

## Named Entity Extraction in Italian

But wait, SpaCy can speak Italian too!
- Let's make use of the pretrained italian model that we downloaded earlier: https://spacy.io/models/it
- to recognise entities in an article from 'Il Corriere'

First download the article:

In [None]:
url = 'https://www.ansa.it/sito/notizie/mondo/2020/07/07/bolsonaro-ha-i-sintomi-del-coronavirus_40d26967-e377-4455-9b42-83c2756cf5f1.html'
html_doc = requests.get(url).text
parsed_doc = BeautifulSoup(html_doc, 'lxml')

Now let's extract the title and paragraph text:

In [None]:
title = parsed_doc.find('title').text
paragraphs = [p.text for p in parsed_doc.find_all('p')]

# Combine the title and paragraphs into a single text:
article_text = title + '\n\n' + '\n'.join(paragraphs)
print(article_text)

---

Now we'll parse the text of the article with an Italian NLP engine to extract Named Entities.
- First load the italian model 'it_core_news_sm' that we downloaded earlier

In [None]:
import it_core_news_sm
nlp_it = it_core_news_sm.load()

Parse article and extract the entities:

In [None]:
parsed_article = nlp_it(article_text)
displacy.render(parsed_article, jupyter=True, style='ent')

That looks not great. 
- Here are the entities found in the news article: 

In [None]:
sorted(set(x.text for x in parsed_article.ents))

Alterantively you can use Stanza (https://stanfordnlp.github.io/stanza/). It's very similar to spaCy:
- Python package
- Supports multiple languages
- Uses deep neural network modules

Let's start installing the library

In [None]:
!pip install stanza

Now we can import Stanza and create a pipeline for Italian

In [None]:
import stanza

stanza_nlp_model = stanza.Pipeline(lang='it', processors='tokenize,ner')

As before we need to parse the document

In [None]:
stanza_parsed_article = stanza_nlp_model(article_text)

Given a document, Stanza breaks it into sentences and then tokens.
For each token adds the tags using the sleected processors (here we are using only the NER processors).

Let's give a look at the identified entities:

In [None]:
for sentence in stanza_parsed_article.sentences:
    for entity in sentence.ents:  # Hello, Treebeard!
        print(f"{entity.text}: {entity.type}")

That's a bit better than before, don't you think?

## Fine-tuning your own NER Model

What if you want to update the Named Entity Extraction model yourself in order to customize it to your set of entities? We'll have a look at that now based on:
- this instructions page: https://spacy.io/usage/training#ner
- and this blog post: https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718

In order to fine-tune the model, we need to prepare data in the following format: 
- a piece of text, 
- plus a list of entity types that occur it along with their positions.

In [None]:
my_data = [
    ("Have you heard of an associate professor from the Politecnico di Milano called Mark Carman?", {"entities": [(50, 71, "ORG"),(79, 90, "PERSON")]}),
    ("No, I haven't, but I don't know many people at the Politecnico. What does he work on?", {"entities": [(51, 62, "ORG")]}),
    ("Mainly machine learning and text mining. I met him a couple of years ago at SIGIR in Tokyo.", {"entities": [(76, 81, "EVENT"),(85, 90, "GPE")]}),
]
my_data

Where would this data come from? 
- either created manually, perhaps by searching for known individuals in a text collection,
- or by using an annotation tool such as https://doccano.herokuapp.com/, see for example: https://medium.com/@justindavies/training-spacy-ner-models-with-doccano-8d8203e29bfa


The following code comes from here: https://github.com/explosion/spaCy/blob/master/examples/training/train_ner.py
- The only change made was to remove the training data

Before starting we need to install the plac package

In [None]:
!pip install plac

Now we define function with the training loop:

In [None]:
from __future__ import unicode_literals, print_function

import plac
import random
import warnings
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding


@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir=None, n_iter=100):
    """Load the model, set up the pipeline and train the entity recognizer."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if "ner" not in nlp.pipe_names:
        ner = nlp.create_pipe("ner")
        nlp.add_pipe(ner, last=True)
    # otherwise, get it so we can add labels
    else:
        ner = nlp.get_pipe("ner")

    # add labels
    for _, annotations in TRAIN_DATA:
        for ent in annotations.get("entities"):
            ner.add_label(ent[2])

    # get names of other pipes to disable them during training
    pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    # only train NER
    with nlp.disable_pipes(*other_pipes), warnings.catch_warnings():
        # show warnings for misaligned entity spans once
        warnings.filterwarnings("once", category=UserWarning, module='spacy')

        # reset and initialize the weights randomly – but only if we're
        # training a new model
        if model is None:
            nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
            print("Losses", losses)

    # test the trained model
    for text, _ in TRAIN_DATA:
        doc = nlp(text)
        print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
        print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        for text, _ in TRAIN_DATA:
            doc = nlp2(text)
            print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
            print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

Once the above model has been defined, we can update and save the model
- Note that this code doesn't currently run in Google colab. 

In [None]:
TRAIN_DATA=my_data  # The method expects the training data to have this name
main(model='en_core_web_sm',output_dir='spacy_model',n_iter=5)

## Entity Linking in spaCy

We don't want to just find entity mentions in a document but link them to a known entity in a knowledge base. 
- The task of linking the entity mentions to the corresponding entity in the knowledge base is called 'Entity Linking'.
- I don't have time here, but watch this video to learn more: 
https://spacy.io/universe/project/video-spacy-irl-entity-linking

In [None]:
# TODO

## Sequence labelling with embeddings and a Recurrent Neural Network

In this section of the notebook I will run through an example of using LSTM (Long Short-term Memory) network for text sequence labelling.
We can train our own model for POS-tagging or NER.
Moreover, we can use pre-trained embedding models to encode the input text.

- We are going to use PyTorch (https://pytorch.org) to build and train our model. Pytorch is a state-of-the-art framework for deep leaning (much better than TensorFlow IMHO).

### Data preparation

As usula we start from data preparation. 
We can use the [WikiNER](https://www.sciencedirect.com/science/article/pii/S0004370212000276) corpus, which provides corpora for POS-tagging and NER in multiple languages.
Today we are going to focus on NER.

You can find a copy the English split in the `doc/` directory. 
All the parts of the corpus are avaialble here: https://figshare.com/articles/dataset/Learning_multilingual_named_entity_recognition_from_Wikipedia/5462500

Let's start by loading the file in memory and reading it line by line. Each line corresponds to a sentence.

In [1]:
with open('docs/aijwikinerenwp3') as f:
    wiki_ner_data = f.read().strip().split('\n')
wiki_ner_data[:10]

['The|DT|I-MISC Oxford|NNP|I-MISC Companion|NNP|I-MISC to|TO|I-MISC Philosophy|NNP|I-MISC says|VBZ|O ,|,|O "|LQU|O there|EX|O is|VBZ|O no|DT|O single|JJ|O defining|VBG|O position|NN|O that|IN|O all|DT|O anarchists|NNS|O hold|VBP|O ,|,|O and|CC|O those|DT|O considered|VBN|O anarchists|NNS|O at|IN|O best|JJS|O share|NN|O a|DT|O certain|JJ|O family|NN|O resemblance|NN|O .|.|O "|RQU|O',
 'In|IN|O the|DT|O end|NN|O ,|,|O for|IN|O anarchist|JJ|O historian|JJ|O Daniel|NNP|I-PER Guerin|NNP|I-PER "|LQU|O Some|DT|O anarchists|NNS|O are|VBP|O more|RBR|O individualistic|JJ|O than|IN|O social|JJ|O ,|,|O some|DT|O more|JJR|O social|JJ|O than|IN|O individualistic|JJ|O .|.|O',
 'From|IN|O this|DT|O climate|NN|O William|NNP|I-PER Godwin|NNP|I-PER developed|VBD|O what|WP|O many|NN|O consider|VBP|O the|DT|O first|JJ|O expression|NN|O of|IN|O modern|JJ|O anarchist|NN|O thought|NN|O .|.|O',
 'Godwin|NNP|I-PER was|VBD|O ,|,|O according|VBG|O to|TO|O Peter|NNP|I-PER Kropotkin|NNP|I-PER ,|,|O "|LQU|O the|DT|O

Now we can parse the data. 
Tokens are separated by spaces and for each token we have the associated POS tag and the NER tag written as: `<token>|<POS tag>|<NER tag>`.

Here NER tags are written using a system called BIO-tagging.
The 'B' stands for "begin" and introduces (starts) a new named entity, the tags are written as "B-PER" to indicate a person or "B-LOC" to indicate a location and so on. 
The 'I' stands for "inside" and continues a started named entity, the tags are written as "I-PER" to indicate a person or "I-LOC" to indicate a location and so on. 
The 'O' stands for outside, it means that the token is outside any named entity. 
There are other tagging systems.  

In [2]:
keys = ['text', 'pos_tag', 'ner_tag']
wiki_ner_data = [[dict(zip(keys, token.split('|'))) for token in sentence.split()] for sentence in wiki_ner_data]

wiki_ner_data[0]

[{'text': 'The', 'pos_tag': 'DT', 'ner_tag': 'I-MISC'},
 {'text': 'Oxford', 'pos_tag': 'NNP', 'ner_tag': 'I-MISC'},
 {'text': 'Companion', 'pos_tag': 'NNP', 'ner_tag': 'I-MISC'},
 {'text': 'to', 'pos_tag': 'TO', 'ner_tag': 'I-MISC'},
 {'text': 'Philosophy', 'pos_tag': 'NNP', 'ner_tag': 'I-MISC'},
 {'text': 'says', 'pos_tag': 'VBZ', 'ner_tag': 'O'},
 {'text': ',', 'pos_tag': ',', 'ner_tag': 'O'},
 {'text': '"', 'pos_tag': 'LQU', 'ner_tag': 'O'},
 {'text': 'there', 'pos_tag': 'EX', 'ner_tag': 'O'},
 {'text': 'is', 'pos_tag': 'VBZ', 'ner_tag': 'O'},
 {'text': 'no', 'pos_tag': 'DT', 'ner_tag': 'O'},
 {'text': 'single', 'pos_tag': 'JJ', 'ner_tag': 'O'},
 {'text': 'defining', 'pos_tag': 'VBG', 'ner_tag': 'O'},
 {'text': 'position', 'pos_tag': 'NN', 'ner_tag': 'O'},
 {'text': 'that', 'pos_tag': 'IN', 'ner_tag': 'O'},
 {'text': 'all', 'pos_tag': 'DT', 'ner_tag': 'O'},
 {'text': 'anarchists', 'pos_tag': 'NNS', 'ner_tag': 'O'},
 {'text': 'hold', 'pos_tag': 'VBP', 'ner_tag': 'O'},
 {'text': ',', 

Now all the labels are properly organised

At this point we need a system ti encode and decode the labels into categorical entities. 
We can use the label encoder from Scikit-Learn for that (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html).

In [3]:
from sklearn.preprocessing import LabelEncoder

pos_le = LabelEncoder().fit([token['pos_tag'] for sentence in wiki_ner_data for token in sentence])
ner_le = LabelEncoder().fit([token['ner_tag'] for sentence in wiki_ner_data for token in sentence])

Now we have a module mapping from tags to IDs and vice-versa

In [4]:
ner_tag = ['I-PER']
# ner_tag = ['I-LOC']
# ner_tag = ['B-PER']
# ner_tag = ['O']

ner_le.transform(ner_tag)

array([7])

In [5]:
ner_tag_id = [0]

ner_le.inverse_transform(ner_tag_id)

array(['B-LOC'], dtype='<U6')

How many NER tags do we have? 

In [6]:
len(ner_le.classes_)

9

Which are those tags?

In [7]:
ner_le.classes_

array(['B-LOC', 'B-MISC', 'B-ORG', 'B-PER', 'I-LOC', 'I-MISC', 'I-ORG',
       'I-PER', 'O'], dtype='<U6')

Collect the same info for POS tags

In [8]:
# TODO

Finally we can do our train-validation-test split

In [13]:
from sklearn.model_selection import train_test_split

tmp_data, test_data = train_test_split(wiki_ner_data)
train_data, valid_data = train_test_split(wiki_ner_data)

### Defining and training the RNN model for NER

We start by installing PyTorch and importing the required modules

In [14]:
!pip install torch



In [15]:
import torch
import torch.nn as nn
import torch.nn.functional as F

The we load the Word Embedding model we want to use. We can re-use the 50 dimensional GloVe emebddings from last time

In [16]:
import gensim.downloader as api

we_model = api.load("glove-wiki-gigaword-50")

Before creating the LSTM we decide where to train our model, either cpu or gpu, depending on which is avaialble.

In [17]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Now we can create an RNN module.
Here are the available modules for PyTorch: https://pytorch.org/docs/stable/nn.html#recurrent-layers

In [18]:
lstm = nn.LSTM(
    input_size=50,  # We use the dimensions of the input embeddings
    hidden_size=len(ner_le.classes_),  # We use the otput of the LSTM directly for classification
    batch_first=True,  # To specify that the first dimension of the input tensor in the batch size (more on this later)
)

Now we can move the LSTM to the target device

In [19]:
lstm = lstm.to(device)

Once we have the model we need to create an optimizer that will take care of updating the weights. Here are the available optimizers: https://pytorch.org/docs/stable/optim.html.
I'm going to use RMSProp.

The optimizer needs to receive the parameters of the neural network and the selected learning rate

In [20]:
lr = 0.001
optimizer = torch.optim.RMSprop(params=lstm.parameters(), lr=lr)

Now we need to prepare our data. 
Before doing so we prepare a function that maps a mini-batch of samples (i.e., a subset of the lists of dictionaries we prepared) to input tensors to use with our model. The function willl take care of:
- Mapping all words to their embeddings with size $d$
- Collect the emebddings of the same sentence into a matrix with shape $(n_\textit{words in sentence}, d)$
- Collect the different matrices of the same batch into a single tensor with shape $(n_\textit{batch elements}, n_\textit{words in longest sentence}, d)$ (this will be input).
- Mapping all the labels to their IDs 
- Collecting all the IDs of the same sentence into a vector with size $n_\textit{words in sentence}$ 
- Collecting all the different vectors into a matrix with shape $(n_\textit{batch elements}, n_\textit{words in longest sentence})$ (this will be target output).

Note that different sentences have different lengths, to cope with this issue we apply a process called padding: we are going to add to the input tensor and the target output matrix  dummy values to have all the same "sentence length".

In [21]:
import numpy as np

def collate(mini_batch):
    # Get the length of the longest sentence
    longest_len = max(len(sample) for sample in mini_batch)
    # Create an input tensor with all zero values
    input_embeds = np.zeros((len(mini_batch), longest_len, 50))
    # Create a target output matrix with all -100 (PyTorch ignores this value by default)
    output_lbl = np.full((len(mini_batch), longest_len), -100)
    # Fill the tensor and the matrix
    for i, sample in enumerate(mini_batch):
        for j, token in enumerate(sample):
            # Manage missing tokens in vocabulary
            if token['text'].lower() in we_model:
                input_embeds[i,j] = we_model[token['text'].lower()]
            output_lbl[i,j] = ner_le.transform([token['ner_tag']])
    # Convert to PyTorch tensor
    input_embeds = torch.tensor(input_embeds, dtype=torch.float)
    output_lbl = torch.tensor(output_lbl)

    return input_embeds, output_lbl

How does an encoded batch looks like? Let's enode the first three sentences

In [22]:
train_data[:3]

[[{'text': 'The', 'pos_tag': 'DT', 'ner_tag': 'O'},
  {'text': 'A300', 'pos_tag': 'NNP', 'ner_tag': 'I-MISC'},
  {'text': 'provided', 'pos_tag': 'VBD', 'ner_tag': 'O'},
  {'text': 'Airbus', 'pos_tag': 'NNP', 'ner_tag': 'I-ORG'},
  {'text': 'the', 'pos_tag': 'DT', 'ner_tag': 'O'},
  {'text': 'experience', 'pos_tag': 'NN', 'ner_tag': 'O'},
  {'text': 'of', 'pos_tag': 'IN', 'ner_tag': 'O'},
  {'text': 'manufacturing', 'pos_tag': 'VBG', 'ner_tag': 'O'},
  {'text': 'and', 'pos_tag': 'CC', 'ner_tag': 'O'},
  {'text': 'selling', 'pos_tag': 'VBG', 'ner_tag': 'O'},
  {'text': 'airliners', 'pos_tag': 'NNS', 'ner_tag': 'O'},
  {'text': 'competitively', 'pos_tag': 'RB', 'ner_tag': 'O'},
  {'text': '.', 'pos_tag': '.', 'ner_tag': 'O'}],
 [{'text': 'After', 'pos_tag': 'IN', 'ner_tag': 'O'},
  {'text': 'spending', 'pos_tag': 'VBG', 'ner_tag': 'O'},
  {'text': 'over', 'pos_tag': 'IN', 'ner_tag': 'O'},
  {'text': 'a', 'pos_tag': 'DT', 'ner_tag': 'O'},
  {'text': 'year', 'pos_tag': 'NN', 'ner_tag': 'O'}

In [23]:
embeds, lbl = collate(train_data[:3])

print(f"The shape of the input is: {embeds.size()}")
print(f"The shape of the output is: {lbl.size()}")

The shape of the input is: torch.Size([3, 38, 50])
The shape of the output is: torch.Size([3, 38])


In [24]:
embeds

tensor([[[ 4.1800e-01,  2.4968e-01, -4.1242e-01,  ..., -1.8411e-01,
          -1.1514e-01, -7.8581e-01],
         [ 1.3736e+00, -7.6518e-01, -2.2713e-01,  ...,  4.5972e-01,
           2.6261e-01, -5.0583e-01],
         [ 5.2014e-01,  3.8829e-01,  1.1381e-01,  ...,  8.4034e-01,
          -2.9246e-02,  1.8248e-01],
         ...,
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00],
         [ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
           0.0000e+00,  0.0000e+00]],

        [[ 3.8315e-01, -3.5610e-01, -1.2830e-01,  ..., -4.2580e-01,
           1.3681e-01, -7.7731e-01],
         [ 1.2570e-03, -4.6617e-01,  7.5832e-01,  ...,  6.0802e-01,
           4.2680e-01,  9.9916e-01],
         [ 1.2972e-01,  8.8073e-02,  2.4375e-01,  ...,  9.0912e-02,
          -6.0515e-01, -9.8270e-01],
         ...,
         [ 0.0000e+00,  0

In [25]:
lbl

tensor([[   8,    5,    8,    6,    8,    8,    8,    8,    8,    8,    8,    8,
            8, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100],
        [   8,    8,    8,    8,    8,    8,    8,    8,    8,    8,    8,    8,
            8,    8,    8,    8,    8,    8,    8,    7,    7,    8,    8,    8,
            8,    8,    8, -100, -100, -100, -100, -100, -100, -100, -100, -100,
         -100, -100],
        [   8,    8,    8,    8,    8,    8,    8,    8,    8,    8,    8,    8,
            8,    8,    8,    8,    8,    8,    8,    8,    8,    8,    4,    8,
            8,    8,    8,    8,    8,    8,    8,    8,    8,    8,    8,    4,
            8,    8]])

Now we can finally wrap a DataLoader around our samples. PyTorch data loaders take care of generating batches on a given data set. We just need to set the batch size.

In [26]:
batch_size = 32

train_loader = torch.utils.data.DataLoader(train_data, batch_size=batch_size, collate_fn=collate, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_data, batch_size=batch_size, collate_fn=collate)
test_loader = torch.utils.data.DataLoader(test_data, batch_size=batch_size, collate_fn=collate)

We can train the lst iterating over the data set for a given number of epochs

In [27]:
# Import nice loading bar using tqdm
from tqdm import tqdm

# Set model in training mode
lstm.train()

n_epochs = 3

# Iterate over epochs
for i in range(n_epochs):
    print(f"Starting epoch {i + 1}/{n_epochs}")
    # Iterate over training batches
    for embeds, lbl in tqdm(train_loader):
        # Zero your gradients for every batch!
        optimizer.zero_grad()
        # Move input and output to target device
        embeds = embeds.to(device)
        lbl = lbl.to(device)
        # Compute logits (i.e., softmax values before exponential normalisation)
        logits, _ = lstm(embeds)
        # Flatten logits to a shape (batch_size * max_sentence_len, n_classes)
        logits = logits.reshape(-1, len(ner_le.classes_))
        # Flatten targets to a shape (batch_size * max_sentence_len)
        lbl = lbl.reshape(-1)
        # Compute loss
        loss = F.cross_entropy(logits, lbl)
        # Compute gradients
        loss.backward()
        # Update weights
        optimizer.step()


  0%|                                          | 1/3483 [00:00<07:26,  7.79it/s]

Starting epoch 1/3


100%|███████████████████████████████████████| 3483/3483 [01:34<00:00, 36.66it/s]
  0%|                                          | 4/3483 [00:00<01:37, 35.72it/s]

Starting epoch 2/3


100%|███████████████████████████████████████| 3483/3483 [01:37<00:00, 35.90it/s]
  0%|                                          | 3/3483 [00:00<01:59, 29.15it/s]

Starting epoch 3/3


100%|███████████████████████████████████████| 3483/3483 [01:34<00:00, 36.83it/s]


We can also test out model using the dedicated split. We iterate over the mini batches, colelct the prediction as the value with the highest logit value and we store these values until we go through the entire data set. Then we compute the classification report

In [28]:
from sklearn.metrics import classification_report

# Set model in evaluation mode
lstm.eval()

y_true = []
y_pred = []

# Disable gradients
with torch.no_grad():
    # Iterate over validation batches
    for embeds, lbl in tqdm(test_loader):
        # Move input and output to target device
        embeds = embeds.to(device)
        # Compute logits (i.e., softmax values before exponential normalisation)
        logits, _ = lstm(embeds)
        # Get predictions as the index corresponding to the highest logit score
        pred_lbl = torch.argmax(logits, dim=-1)
        # Append predicted labels
        y_pred.append(pred_lbl.reshape(-1).cpu().numpy())  
        # Append target labels
        y_true.append(lbl.reshape(-1).numpy())

# Concatenate all the vectors of target labels and predicted labels
y_true = np.concatenate(y_true)
y_pred = np.concatenate(y_pred)
# Remove elements to ignore (the -100 labels)
mask = y_true == -100
y_true = y_true[~mask]
y_pred = y_pred[~mask]

# Finally compute classification report
print(classification_report(y_true, y_pred, target_names=ner_le.classes_))

100%|███████████████████████████████████████| 1161/1161 [00:24<00:00, 47.66it/s]
  mask &= (ar1 != a)


TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

We can play a bit with the model directly.

Let's define a function to compute the predictions give a sentence. We will use the NLTK tokenizer to split the sentence into word tokens

In [43]:
from nltk.tokenize import word_tokenize

def predict(sample):
    # Tokenize sample
    tokenized_sample = word_tokenize(sample)
    # Create an input tensor with all zero values
    input_embeds = np.zeros((1, len(tokenized_sample), 50))
    # Fill the tensor and the matrix
    for i, token in enumerate(tokenized_sample):
        # Manage missing tokens in vocabulary
        if token.lower() in we_model:
            input_embeds[0, i] = we_model[token.lower()]
    # Convert to PyTorch tensor
    input_embeds = torch.tensor(input_embeds, dtype=torch.float, device=device)
    # Run model over input
    logits, _ = lstm(embeds)
    # Get predictions as the index corresponding to the highest logit score
    pred_lbl = torch.argmax(logits, dim=-1)
    # Decode labels
    pred_labels = ner_le.inverse_transform(pred_lbl.reshape(-1).cpu().numpy())
    # Group together tokens and predicted NER labels
    labelled_sample = [{'text': token, 'ner_tag': str(lbl)} for token, lbl in zip(tokenized_sample, pred_labels)] 

    return labelled_sample

And now call it on a custom sample

In [44]:
sample = "Hello, my name is Vincenzo and I like pizza."

predict(sample)

[{'text': 'Hello', 'ner_tag': 'O'},
 {'text': ',', 'ner_tag': 'O'},
 {'text': 'my', 'ner_tag': 'O'},
 {'text': 'name', 'ner_tag': 'O'},
 {'text': 'is', 'ner_tag': 'O'},
 {'text': 'Vincenzo', 'ner_tag': 'O'},
 {'text': 'and', 'ner_tag': 'O'},
 {'text': 'I', 'ner_tag': 'I-LOC'},
 {'text': 'like', 'ner_tag': 'O'},
 {'text': 'pizza', 'ner_tag': 'O'},
 {'text': '.', 'ner_tag': 'O'}]

### Defining and training the RNN model for POS-tagging

You can do this at home to start getting familiar with PyTorch and RNNs

In [None]:
# TODO

## Text Classification with a Recurrent Neural Network

In this last section of the notebook I will run through a quick example of using a Bidirectional LSTM (Long Short-term Memory) network for text classification. 
- RNNs extend embedding-based classification of text by taking word-order into account. They were, until relatively recently, the state-of-the-art when it came to training text classifiers.
- Tensorflow is sophisticated toolkit for building Deep Neural Network models. We will use it to build the model. The tutorial follows mostly this Tensorflow tutorial: https://www.tensorflow.org/tutorials/text/text_classification_rnn


### Data preparation

First let's load the Twitter dataset we used in the second session:

In [None]:
import nltk
nltk.download('twitter_samples')

from nltk.corpus import twitter_samples
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

Remove emoticons from the positive and negative examples:

In [None]:
import re 
emoticon_regex = '(\:\w+\:|\<[\/\\]?3|[\(\)\\\D|\*\$][\-\^]?[\:\;\=]|[\:\;\=B8][\-\^]?[3DOPp\@\$\*\\\)\(\/\|])(?=\s|[\!\.\?]|$)'
positive_tweets_noemoticons = [re.sub(emoticon_regex,'',tweet) for tweet in positive_tweets]
negative_tweets_noemoticons = [re.sub(emoticon_regex,'',tweet) for tweet in negative_tweets]

And create the examples and labels as we did before. This time we will use numeric labels (0,1) instead of text labels ('negative','positive'), since the deep learning library we will use requires numeric class labels.

In [None]:
tweets_x = positive_tweets_noemoticons + negative_tweets_noemoticons
tweets_y = [1]*len(positive_tweets) + [0]*len(negative_tweets)

And again, split the data into training, validation and test:

In [None]:
from sklearn.model_selection import train_test_split
temp_x, test_x, temp_y, test_y = train_test_split(tweets_x, tweets_y, test_size=0.2)
train_x, valid_x, train_y, valid_y = train_test_split(temp_x, temp_y, test_size=0.2)

Now that we have the training and validation data prepared, we can import the Tensorflow library, and use it to load the training and validaton datasets into the tensorflow format. Note that:
- Tensorflow comes installed on Google Colab. 
- If you run this notebook on your own machine you will need to first install tensorflow using '!pip install'

In [None]:
import tensorflow as tf
train_tf = tf.data.Dataset.from_tensor_slices((train_x, train_y))
valid_tf = tf.data.Dataset.from_tensor_slices((valid_x, valid_y))

Training will run on *batches* of the data at a time, so we need to create them.
- We first use the shuffle command to randomise the order of the training data. (The buffer-size limits the number of instances loaded into memory when shuffling and is only for efficiency -- you could remove it.)
- We then create the batches. Each batch will contain 64 examples.
- The validation data needs to have the same format as the training data, so we batch it too.

In [None]:
train_dataset = train_tf.shuffle(buffer_size=10000).batch(batch_size=64).prefetch(tf.data.AUTOTUNE)
valid_dataset = valid_tf.batch(batch_size=64).prefetch(tf.data.AUTOTUNE)

Let's have a look at the first batch in the training data. It consists of:
- an array of strings (tweets)
- an array of binary values (class labels)

In [None]:
for batch in train_dataset.take(1):
  print(batch)

Now that we have the text data in the format required, we can vectorize it. We will need to make use a specific text vectorization module from tensorflow to do this.
- We first limit the vocabulary of the vectorizer to 5000,
- then extract only the text portion of the training dataset,
- and finally fit the vectorizer to the text using the 'adapt' method:

In [None]:
from tensorflow.keras.layers import TextVectorization
vectorizer = TextVectorization(max_tokens=5000)
train_text = train_dataset.map(lambda text, label: text)
vectorizer.adapt(train_text)

Let's print out the first tokens form the vocabulary:

In [None]:
vocab = vectorizer.get_vocabulary()
vocab[:100]

Note that the first two tokens in the vocabulary are the empty token '', and the unknown token '[UNK]'. The latter is used to mask out-of-vocabulary tokens in the text

We can now use the vectorizer to encode a tweet:

In [None]:
text = 'This is my first tweet! It contains one out-of-vocabulary term. Any suggestions for extending this tweet?'
encoding = vectorizer([text]).numpy()[0]
print('Tweet:     ', text)
print('Encoded:   ', encoding) 
print('Recovered: ',' '.join([vocab[i] for i in encoding]))

Note that the vectorizer is not turning the text into a single vector, but is simply replacing the vocabulary words by their indices. If a word is not present in the dictionary it is replaced by the unknown token.

Let's have a look at some actual examples from the dataset, printing out the first 6 tweets:

In [None]:
for text in batch[0][:6].numpy():
    encoding = vectorizer([text]).numpy()[0]
    print('Tweet:     ', text.decode("utf-8"))
    print('Encoded:   ', encoding) 
    print('Recovered: ',' '.join([vocab[i] for i in encoding]))
    print()

### Defining the RNN model

Now we can define the model, which contains four layers: 
- an input embedding layer which produces word embeddings of size 64
- a bidirectional LSTM layer
- 2 dense (aka fully connected) layers that maps the 2 embedding vectors (of size 64) produced by the bidirectional LSTM down to a single neuron   

This constitutes a relatively standard basic RNN architecture. (The details of why these specific components are chosen is beyond the scope of this tutorial.)  

Once the model has been defined it is compiled in the following step: 

In [None]:
model = tf.keras.Sequential([
    vectorizer,                       
    tf.keras.layers.Embedding(input_dim=len(vectorizer.get_vocabulary()),output_dim=64,mask_zero=True), 
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1)
])
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),metrics=['accuracy'],optimizer=tf.keras.optimizers.Adam(1e-4))

Fit the model by running it for 10 epochs (iterations over the training data).
- Note that we provide it with both the training dataset and the validation dataset:

In [None]:
model.fit(train_dataset, epochs=10, validation_data=valid_dataset, validation_steps=20)

Once we've trained the model we can check the final accuracy on the validation data:

In [None]:
valid_loss, valid_acc = model.evaluate(valid_dataset)

print('Validation Loss: {}'.format(valid_loss))
print('Validation Accuracy: ',valid_acc)

We can have a look at the predictions from the model:

In [None]:
tweets = []
tweets.append('I can\'t believe how much fun I\'m having learning to train a text classifier with a bidirectional LSTM!')
tweets.append('I am really confused. I want my mommy.')
tweets.append('The internet connection has been pretty annoying today!')
tweets.append('They just played my favourite song on the radio.')
tweets.append("I don't like going to the dentist.")
tweets.append('I am so happy today!')
tweets.append('I am so unhappy today!')

predictions = model.predict(tweets)

for i in range(len(tweets)):
  print('tweet: ',tweets[i])
  encoding = vectorizer([tweets[i]]).numpy()[0]
  print('encoded as: ',' '.join([vocab[j] for j in encoding]))
  print('predicted value: ', predictions[i][0])
  print('predicted label: ', 'negative' if (predictions[i]<0) else 'positive')
  print()

And calculate the usual evaluation metrics:

In [None]:
pred_y = [0 if (pred < 0) else 1 for pred in model.predict(valid_x)]

from sklearn.metrics import accuracy_score
print('accuracy: '+ str(accuracy_score(pred_y, valid_y)))

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cmd = ConfusionMatrixDisplay(confusion_matrix(valid_y, pred_y,normalize='all'),display_labels=['negative','positive'])
cmd.plot()

Finally, let's print out the model summary to get an understanding of the number of parameters in the model:

In [None]:
print(model.summary())

Most of the parameters are used to define the embeddiing, then the LSTMs. 