# NER using SpaCy

In [None]:
%pip install datasets
%pip install evaluate
%pip install spacy

In [None]:
%%bash
python -m spacy download en_core_web_lg
python -m spacy download en_core_web_trf

In [1]:
from utils import *

## Preprocessing

Spacy uses slightly different class labels than CONLL. i.e. Spacy uses `PERSON` whereas CONLL uses `B-PER` and `I_PER` for person entities. We must normalize these in order to evaluate spacy's pre-trained model on CONLL test data. 

In addition it is necessary to convert the data from the format used by CONLL to Spacy's NER format.

In [6]:
# Preprocessing
test_data_path = os.path.join(conll_dir, "test.txt")
spacy_dir = os.path.join(conll_dir, "spacy_data")
os.makedirs(spacy_dir, exist_ok=True)

spacy_test_data_path = os.path.join(spacy_dir, "test_spacy.txt")
test_data = [x.strip() for x in open(test_data_path)]

def process_line(line):
    """ Translate NER tags used by Spacy to the format used by CONLL """
    return re.sub("([BI])-(PER)", r"\g<1>-PERSON", line )

test_data = [process_line(line) for line in test_data]

with open(spacy_test_data_path, "w") as fo:
    fo.write("\n".join(test_data))


### Evaluate CONLL test data on Spacy pretrained models

n.b. This is on both large and transfomer models.

In [7]:
%%bash

cd ../data/external/archive

python -m spacy convert "spacy_data/test_spacy.txt" spacy_data -c ner

python -m spacy evaluate en_core_web_lg spacy_data/test_spacy.spacy > spacy_data/spacy_lg_results.txt
python -m spacy evaluate en_core_web_trf spacy_data/test_spacy.spacy > spacy_data/spacy_trf_results.txt


[38;5;4mℹ Auto-detected token-per-line NER format[0m
[38;5;4mℹ Grouping every 1 sentences into a document.[0m
[38;5;3m⚠ To generate better training data, you may want to group sentences
into documents with `-n 10`.[0m
[38;5;2m✔ Generated output file (3684 documents):
spacy_data/test_spacy.spacy[0m


### Train a Spacy NER model

- Convert CONLL data to spacy NER format
- Train a Spacy NER model
- Evaluate the trained model on the test dataset

In [None]:
%%bash
python -m spacy convert "test.txt" spacy_data -c ner
python -m spacy convert "train.txt" spacy_data -c ner
python -m spacy convert "valid.txt" spacy_data -c ner

In [None]:
%%bash
python -m spacy train en model spacy_data/train.spacy spacy_data/valid.spacy -p ner