# NER Summary - Data Culture Group
This document should serve as a summary of the work completed from May 16th, 2022 to September 28th, 2022. Along with it are a variety of resources regarding use of spaCy, Prodigy, and other implements that support the work completed.

In [None]:
!pip install spacy
!spacy download en_core_web_lg

## I. Testing and Training Datasets
As of October 12th, 2022, we have a total of **3,790** annotations for training data, **780** annotations for testing data, and **624** "recall" annotations.

The corpus utilizes text sourced from exclusively Bon Appetit's articles, Instagram Posts, and Youtube transcripts.

Training datasets (training.spacy) were generated using the Prodigy annotation tool over a randomized stream of documents from the corpus. More often than not, suggestions for annotated data were generated by the model in development, and then either kept or corrected. Annotated text has n-gram spans which indicate proper food entities in a sentence. Annotations are labelled "FOOD".

The evaluation dataset (eval-data.spacy) was generated using the same method as the training datasets.

The "recall" annotations dataset (recall-data.spacy) was generated in order to improve the model's ability to make guesses about rare or non-English food names. Articles were hand-selected from the Bon Appetit corpus which reflected food entities from cuisines the model struggled to parse. These articles then entered the same Prodigy annotation pipeline as the above four datasets.

In Prodigy, data is managed through SQL databases. The database takes as its default input and output a JSON Lines file (**.jsonl**).

## II. spaCy
### spaCy Model Configuration

In [2]:
!spacy init config -p parser,ner base_config.cfg


[38;5;1m✘ The provided output file already exists. To force overwriting the
config file, set the --force or -F flag.[0m



This command will create the default config file for the model. You do not need this step per se; I will attach two additional config files that follow most of the same parameters as the one above.

File one -- "base_config.cfg" -- uses a tok2vec along with the default spacy pipeline (parser, ner). The tok2vec layer provides word embeddings to incorporate within the named entity recognition process, and the NER is configured to listen to the tok2vec layer when it makes decisions. This configuration for the model does not require that many resources to run, but does not include the BERT transformers that Professor Bhargava has requested.

File two -- "trf_config.cfg" -- uses a transformer layer along with the default spacy pipeline. We'll talk more about the use of transformers and BERT on this project later. The transformer and tok2vec pipes are *mutually exclusive*, meaning that a transformer layer replaces the traditional vector embedding approach in tok2vec. This is expected to improve the model's overall performance and make successful long-range guesses, but requires a GPU in order to run properly, which Northeastern provides through Discovery. This transformer model is the **roBERTa-base model**, for which you can find documentation here: https://huggingface.co/roberta-base. 

### spaCy Model Creation
The spaCy library has a train command in the CLI, which uses the config files from above. See the command below:

In [5]:
!spacy train base_config.cfg --paths.train ./training.spacy --paths.dev ./eval-data.spacy -o food_entity_recognizer

[38;5;2m✔ Created output directory: food_entity_recognizer[0m
[38;5;4mℹ Saving to output directory: food_entity_recognizer[0m
[38;5;4mℹ Using CPU[0m
[1m
[2022-10-14 15:37:57,924] [INFO] Set up nlp object from config
[2022-10-14 15:37:57,936] [INFO] Pipeline: ['tok2vec', 'parser', 'ner']
[2022-10-14 15:37:57,941] [INFO] Created vocabulary
[2022-10-14 15:37:57,941] [INFO] Finished initializing nlp object
[2022-10-14 15:38:08,855] [INFO] Initialized pipeline components: ['tok2vec', 'parser', 'ner']
[38;5;2m✔ Initialized pipeline[0m
[1m
[38;5;4mℹ Pipeline: ['tok2vec', 'parser', 'ner'][0m
[38;5;4mℹ Initial learn rate: 0.001[0m
E    #       LOSS TOK2VEC  LOSS PARSER  LOSS NER  DEP_UAS  DEP_LAS  SENTS_F  ENTS_F  ENTS_P  ENTS_R  SCORE 
---  ------  ------------  -----------  --------  -------  -------  -------  ------  ------  ------  ------
  0       0          0.00         0.00     40.91     0.00     0.00   100.00    0.00    0.00    0.00    0.00
  0     200         26.02       

Do note that the pipeline will save *two* models: the **best** performance, which I believe is evaluated according to the "score" on the far right, and the **last** performance, which is the last iteration of the model before saving. These are saved to ./model-best and ./model-last, respectively.
### spaCy Model Evaluation

spaCy offers users a basic command for evaluation. "spacy evaluate" takes the config path and the path to evaluation data in spacy format.

In [12]:
!spacy evaluate ./food_entity_recognizer/model-last ./eval-data.spacy

[38;5;4mℹ Using CPU[0m
[1m

TOK      100.00
UAS      -     
LAS      -     
NER P    61.34 
NER R    30.29 
NER F    40.56 
SENT P   -     
SENT R   -     
SENT F   -     
SPEED    17249 

[1m

           P       R       F
FOOD   61.34   30.29   40.56



The model, clearly, is not doing amazing. I'll talk about why this could be towards the end.

## III. Prodigy

### Initialization:

Prodigy is a premium software, and as such a software key is needed to install it. The command below has the software key that belongs to Mediacloud.
Do note that this key has a limited number of downloads for each released Prodigy version. I don't believe executing the command below counts as a "download" under this limit, this is useful to know.

In [None]:
!pip install prodigy -f https://E9D3-7A82-009E-D41C@download.prodi.gy

### Prodigy Model Creation

Prodigy does not necessarily require a config file in order to function. You can include a config file using -c. 
In order to train the model in Prodigy, you need to use the JSONL version of the datasets, which is also included. These JSONL datasets are to be plugged into SQL, so they need to be both downloaded and imported into the Prodigy SQL database.

In [6]:
!prodigy db-in training training.jsonl
!prodigy db-in recall-data recall-data.jsonl
!prodigy db-in eval-data eval-data.jsonl

ModuleNotFoundError: No module named 'prodigy'

Once datasets have been imported, you can use the following command to generate a model:

In [7]:
!prodigy train -n training,recall-data food_entity_recognizer

zsh:1: command not found: prodigy


Do note that Prodigy does *not* make use of evaluation data, but rather just partitions off 20% of the training data for evaluation. If you want, you can use the evaluation dataset to aid in training, but there is little reason to do so.
### Annotation using Prodigy

Prodigy mainly serves as an annotation tool to supplement those using spaCy models. Here is how to use Prodigy to generate more annotations to work with.

First, a source document is needed as a stream. I have been working with the file dubbed "compiled_jsons.json", which contains un-annotated text sourced from Youtube video transcripts, website articles, and Instagram captions from Bon Appetit that has been shuffled. Somewhat barbarically, I manually updated this document, which means once a cycle of annotations had been done, I went into the document and removed those text sources by hand. They are placed in a different file -- "archived_jsons.json".

Second, for certain types of annotation methods, a pre-existing model must exist to make initial guesses at food entities to correct. This is not universally required, but can hasten the process of data collection if so desired

Finally, a dataset *within* Prodigy's SQL database is necessary. If it does not already exist, Prodigy will make the dataset with the requested name, but do be sure to collect these annotations from the SQL database if they are intended for use with spaCy training or annotation.

#### Method One: ner.manual

No model required! This annotation command will provide you **raw text** to annotate, *without* help from a model.
Example:

In [None]:
!prodigy ner.manual new_annotations ./compiled_jsons.json -l="FOOD"

#### Method Two: ner.correct

This annotation command provides you the model's **best guess** at annotation sin the document. Your job is to correct these annotations, which provides perfected data. Depending on the quality of the model used, this can significantly hasten the time spent working with data. Example:

In [13]:
!prodigy ner.correct new_annotations ./food_entity_recognizer/model-best ./compiled_jsons.json -l="FOOD"

zsh:1: command not found: prodigy


#### Final note: .jsonl to .spacy conversion:

I recommend that data be kepy in .jsonl format, as .spacy carries less metadata and annotation information because it only retains the annotations. Here is a function that converts jsonl data to spacy data:

In [1]:
from spacy.tokens import DocBin
def jsonl_spacy(filename):
    nlp = spacy.load('en_core_web_lg')
    with open(filename + ".jsonl", 'r') as fl:
        take = fl.readlines()
    doc_bin = DocBin()
    for thing in take:
        lnput = json.loads(thing)
        text = lnput['text']
        spans = [(s['start'], s['end'], 'FOOD') for s in lnput['spans']]
        example = Example.from_dict(nlp(text), dict(text= text, entities=spans))
        doc_bin.add(example.reference)
    doc_bin.to_disk(filename + ".spacy")

## IV. BERT and Transformers

We implemented BERT and BERT-like transformer pipes as components preceding named entity recognition. BERT (Bidirectional Encoder Representations from Transformers) use two non-directional objectives in order to train a model to assign vector representations to each token. BERT also uses a unique tokenizer, that segments words into wordpieces  to offer more precise guesses on the nature of certain tokens. These strings can be re-patched together under the classification layer that takes the Transformer output for the NER. In order to use the BERT model, you can initialize through the spacy config provided ("trf_config.cfg"), so long as you ensure that a GPU is available to use.
#### Using spaCy

In [3]:
!spacy train -o bert_entity_recognizer -g 0 trf-config.cfg

Usage: python -m spacy train [OPTIONS] CONFIG_PATH
Try 'python -m spacy train --help' for help.

Error: Invalid value for 'CONFIG_PATH': Path './trf_config.cfg' does not exist.


#### Using Prodigy

In [4]:
!prodigy train -c trf_config.cfg -n training,recall-data -g 0 bert_entity_recognizer

zsh:1: command not found: prodigy
