# Named Entity Recognition (NER) with Spacy

The goal of this notebook is to show the full end to end process for Name-Entity Recognition([**NER**](https://nlp.stanford.edu/ner/)) with [**Spacy**](https://spacy.io/). We will also explore **pattern matching** as an alternative to **NER** when there is a known small set of fixed values. 

This will be a full end to end demo of the whole process where we will perform both labeling and model training.

As an example, we will create a model to detect entities related to **oil/petrol** from this [public dataset](https://www.kaggle.com/mitusha/email-dataset) which contains a list of emails related to the oil industry. This is an oversimplification since you want to have more generic entities but this will provide a good example where pattern matching is a better option than NER. So, to summarize we are going to extract oil related entities from emails.

We will perform the following:
- Read the emails data set which has an email per line.
- We will label the emails with the OIL entity using **[Doccano](http://doccano.herokuapp.com/)** labeling tool. This is a manual process.
- We will save the labels in a text file as **JSONL**.
- We will use **Spacy** Neural Network model to train a new statistical model. 
- We will save the model.
- We will create a **Spacy** NLP pipeline and use the new model to detect oil entities never seen before.
- Finally, we will use pattern matching instead of a deep learning model to compare both method.



## Label the Data

The first step is to label the entities using **Doccano**.

Follow **[Doccano](https://doccano.github.io/doccano/tutorial/)** instructions to install and open Doccano.

If you use **Linux/Mac**, I recommend using the docker image:
- `docker pull doccano/doccano`
- `docker container create --name doccano -e "ADMIN_USERNAME=admin" -e "ADMIN_EMAIL=admin@example.com" -e "ADMIN_PASSWORD=password" -p 8000:8000 doccano/doccano`
- `docker container start doccano`
    
For **Windows**, just use **pip**: 
- `pip install doccano`
- `doccano`

Go to http://127.0.0.1:8000/.

Next, label the data using Doccano. Find entities which talk about oil, petrol, petroleum, etc and label them with the tag **OIL**. 

Export the result as **JSONL(Text label)** format.

## Model Training

First, let's read the JSONL file using format:

`{"id": 15, "text": "....", "meta": {}, "annotation_approver": null, "labels": [[226, 234, "OIL"]]}
`

In [4]:
import json
labeled_data = []
with open(r"emails_labeled.jsonl", "r") as read_file:
    for line in read_file:
        data = json.loads(line)
        labeled_data.append(data)

#### Convert the format to spacy format

Next, let's convert the Deccano format to Spacy format.

We will also remove extra columns and rename labels to entities.

In [5]:
TRAINING_DATA = []
for entry in labeled_data:
    entities = []
    for e in entry['labels']:
        entities.append((e[0], e[1],e[2]))
    spacy_entry = (entry['text'], {"entities": entities})
    TRAINING_DATA.append(spacy_entry)

### Train the model!

We use Deep Learning (NN) and set a dropout rate of 0.3 to prevent overfitting.

The idea is to use a Neural Network with many neurons and several layers. We show them text that has been already labeled, so we know the correct answer. We will run several iterations and on each iteration we will calculate the error using a *Loss Fucntion* which will trigger an adjustment of the neurons weight which trigger their activation. As iterations pass, the network will adjust their weight to minimize the error learning patterns to solve the problem.

In order to avoid overfitting, which means that the model "memorizes" the training data and does not perform well with new data, we randomly drop some neurons on each iteration, so the model can generalize better.

Remember to install Spacy first:
```
pip install -U pip setuptools wheel
pip install -U spacy
python -m spacy download en_core_web_sm
```

In [7]:
import spacy
import random
import json

nlp = spacy.blank("en")
ner = nlp.create_pipe("ner")
nlp.add_pipe(ner)
ner.add_label("OIL")

# Start the training
nlp.begin_training()

# Loop for 40 iterations
for itn in range(40):
    # Shuffle the training data
    random.shuffle(TRAINING_DATA)
    losses = {}

    # Batch the examples and iterate over them
    for batch in spacy.util.minibatch(TRAINING_DATA, size=2):
        texts = [text for text, entities in batch]
        annotations = [entities for text, entities in batch]

        # Update the model
        nlp.update(texts, annotations, losses=losses, drop=0.3)
    print(losses)

{'ner': 1984.1658156155754}
{'ner': 109.73926401266718}
{'ner': 138.12986747484507}
{'ner': 47.588300116433004}
{'ner': 2.0419946271741405}
{'ner': 3.7440482430942508}
{'ner': 2.012700471865181}
{'ner': 2.051291773717853}
{'ner': 0.0007268571710379556}
{'ner': 1.8225483010250167}
{'ner': 0.0028021886504119237}
{'ner': 0.0030751793809788803}
{'ner': 1.851928768298534e-06}
{'ner': 1.3330685327512708}
{'ner': 0.0025945839101430344}
{'ner': 0.0010146719511195867}
{'ner': 0.0001327180884835861}
{'ner': 0.01385305647296808}
{'ner': 0.00010323189421863507}
{'ner': 7.754888536541881e-09}
{'ner': 1.665324719642864e-10}
{'ner': 2.2762587873646342e-07}
{'ner': 1.0154709520117594e-08}
{'ner': 2.1325194765755028e-10}
{'ner': 2.7548444297135507e-09}
{'ner': 2.2013025419373184e-08}
{'ner': 6.319800208021556e-10}
{'ner': 6.278292340971489e-09}
{'ner': 2.2878875866347553e-11}
{'ner': 2.4229452172145967e-08}
{'ner': 1.2993179705777092e-10}
{'ner': 3.8404045014699504e-08}
{'ner': 2.593142834269671e-09}
{

You should see the error decreasing as iterations go by, note that some times it may increase due to the dropout setting.

#### Save the model to disk

In [9]:
nlp.to_disk("oil.model")

## Test the model

Let's test the model.  For this we use displacy which will display the entities in the text.

In [31]:
from spacy import displacy
example = "service postings marathon petroleum co said it reduced the contract price it will pay for all grades of service oil one dlr a barrel effective today the decrease brings marathon s posted price for both west texas intermediate and west texas sour to dlrs a bbl the south louisiana sweet grade of service was reduced to dlrs a bbl the company last changed its service postings on jan reuter"

In [26]:
doc = nlp(example)
displacy.render(doc, style='ent')

### Conclusion

We have shown how to label data with **Doccano** and create a custom model with **Spacy**. 

## Phrase Matching

The second approach is to use pattern matching to look for certain keywords and patterns in the text. 

**Spacy** provides matchers which can be easily used to look for specific substrings, digits, etc. We can also set rules based on the part-of-speech tags.

In [48]:
import spacy
# Import the Matcher
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
doc = nlp(example)

# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)

# Add the pattern to the matcher
matcher.add("OIL_PATTERN", None, [{"LOWER": "oil"}], [{"LOWER": "petroleum"}])

# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])

Matches: ['petroleum', 'oil']


We can see how we also found the right tag, but in this case typos or similar words are not detected.

## Conclusion

We have seen to different approaches used for entity recognition. 

**Pattern Matching** can be used in the following use cases:
- Low cardinality attributes
- Common patterns such dates, quantities, numbers, etc.
- Patterns occuring in certain parts of the speech
- When typos are not expected
- Structured data



**Statistical Models** are great to learn complex patterns in the data and can "guess" and categorize data never seem before. Use cases:
- High Cardinality attributes
- You need to deal with typos (fuzzy matches)
- You need to categorize new, never seen data.
- Unstructured data

These models are much more powerful since they can make decisions on things that were never trained on. It can detect new entities without any code change.
