# Learning to predict named entities

I went to <span style="border:#F87171 1px solid;border-radius:10px;padding:5px;padding-right:0px">Cologne&nbsp;<span style="color:#991B1B;border:#F87171 1px solid;background-color:#FEE2E2;border-radius:10px;padding:5px">CITY</span></span> yesterday. It was really nice!

---

Let's say, we have a sentence like `I went to Cologne yesterday. It was really nice!` and, we want to extract that `{"CITY": "Cologne"}`. For such tasks, named entity recognition (NER) is your go-to solution. In this notebook, we'll load an already labeled text corpus and build a NER classifier using [embedders](https://github.com/code-kern-ai/embedders) and [sequence-learn](https://github.com/code-kern-ai/sequence-learn).

As always, first, we got to import our libraries

In [None]:
from data.samples import get_entities_data

In [None]:
from sequencelearn.sequence_tagger import CRFTagger
from sequencelearn.point_tagger import TreeTagger

from sequencelearn.metrics import get_confusion_matrix
from embedders.extraction.contextual import TransformerTokenEmbedder

Once we did so, we can load the sample data. We'll just grab 200 samples for now.

In [None]:
corpus, labels = get_entities_data(num_samples=200)
print(corpus[0])
print(labels[0])

Now, for NER to work well, we want to calculate tokens of our data. A token is e.g. a word, e.g. if you would split sentences at each whitespace; of course, there are cases in which tokenization is more complex, but for now, we can think of it like that.

Further, we want to use modern, pre-trained architectures, to kickstart our models' performance. We will use transformers to calculate embeddings. With the `embedders` library, we provide a library that you can easily use to tokenize, embed, and lastly match your documents. This way, we can create highly informative token-level embeddings within one line of code. `"distilbert-base-uncased"` is the configuration string of the [transformer](https://huggingface.co/) model we want to load, `"en_core_web_sm"` is the language model of [spaCy](https://spacy.io/) that we use.

In [None]:
embedder = TransformerTokenEmbedder("distilbert-base-uncased", "en_core_web_sm")

Next, we can just pour our text corpus into the embedder and create the embeddings.

In [None]:
embeddings = embedder.fit_transform(corpus) 
# for pre-trained models, you can also just go with embedder.transform(corpus)

Now that we got our embeddings, we can specify a small amount of training samples. For now, we'll go with 100 records.

In [None]:
num_train_samples = 100

embeddings_train = embeddings[:num_train_samples]
embeddings_test = embeddings[num_train_samples:]

labels_train = labels[:num_train_samples]
labels_test = labels[num_train_samples:]

Now that the data is prepared, we can instantiate our model. In this example, we'll use a `TreeTagger`, a very simple approach predicting the sequence independently. There is also the option to use the `CRFTagger`, which is commonly used to predict labels for sequences when predictions are dependent on one another (i.e. there are different label probabilities for a given token $i$ depending on the label of token $i-1$.

In [None]:
tagger = TreeTagger()
tagger.fit(embeddings_train, labels_train)

With our model instantiated and trained, we can now make predictions

In [None]:
preds_test = tagger.predict(embeddings_test)

Of course, we want to see how well our model predicts. We can just put our predictions and labels into the confusion matrix calculation.

In [None]:
cm, labels_sorted_bio = get_confusion_matrix(preds_test, labels_test)

To help us analyze the results, we can make use of `"ConfusionMatrixDisplay"` from scikit-learn, so that we can see pairwise prediction/label combinations.

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix
import matplotlib.pyplot as plt

disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels_sorted_bio)

fig, ax = plt.subplots(figsize=(12, 12))
disp.plot(ax=ax);

And that's it; you now got your tagger prepared, and can easily use it to predict named entities within texts.

If you like our tutorial and the library, please consider giving this repository a star, or enter an issue for things you desire in this library.