# Named Entity Recognition Testing

Previously, we used heuristic methods to identify terminology, but we can do better with named entity recognition. We will use Stanza, the best SOTA off-the-shelf NER algorithm for biomedical NER per Kühnel and Fluck (2022).

In [64]:
import os
from tqdm import tqdm
dir_path = os.getcwd()

In [65]:
#Open our test files first
f = open(os.path.join(dir_path, "../../wmt22test.txt"), "r", encoding = "utf8")
en_sent = [line.strip() for line in f.readlines()]
f.close()
f = open(os.path.join(dir_path, "../../wmt22gold.txt"), "r", encoding = "utf8")
fr_sent = [line.strip() for line in f.readlines()]
f.close()

In [66]:
#Tokenise and pos-tag FR sentences first using sequoia-trained treebank model. We don't need anything more.
import stanza

stanza.download('fr', processors='tokenize, mwt, pos', package='sequoia')
nlp_fr = stanza.Pipeline('fr', processors='tokenize, mwt, pos', package='sequoia')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-07-14 21:34:36 INFO: Downloading these customized packages for language: fr (French)...
| Processor       | Package  |
------------------------------
| tokenize        | sequoia  |
| mwt             | sequoia  |
| pos             | sequoia  |
| pretrain        | conll17  |
| backward_charlm | newswiki |
| forward_charlm  | newswiki |

2023-07-14 21:34:36 INFO: File exists: C:\Users\ethan\stanza_resources\fr\tokenize\sequoia.pt
2023-07-14 21:34:36 INFO: File exists: C:\Users\ethan\stanza_resources\fr\mwt\sequoia.pt
2023-07-14 21:34:36 INFO: File exists: C:\Users\ethan\stanza_resources\fr\pos\sequoia.pt
2023-07-14 21:34:36 INFO: File exists: C:\Users\ethan\stanza_resources\fr\pretrain\conll17.pt
2023-07-14 21:34:36 INFO: File exists: C:\Users\ethan\stanza_resources\fr\backward_charlm\newswiki.pt
2023-07-14 21:34:36 INFO: File exists: C:\Users\ethan\stanza_resources\fr\forward_charlm\newswiki.pt
2023-07-14 21:34:36 INFO: Finished downloading models and saved to C:\Users\ethan\stanza_

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-07-14 21:34:37 INFO: Loading these models for language: fr (French):
| Processor | Package |
-----------------------
| tokenize  | sequoia |
| mwt       | sequoia |
| pos       | sequoia |

2023-07-14 21:34:37 INFO: Using device: cpu
2023-07-14 21:34:37 INFO: Loading: tokenize
2023-07-14 21:34:37 INFO: Loading: mwt
2023-07-14 21:34:37 INFO: Loading: pos
2023-07-14 21:34:37 INFO: Done loading processors!


In [67]:
#Extract only the information we need - PoS tags and text.
fr_tagged = []
for sentence in tqdm(fr_sent):
    doc = nlp_fr(sentence)
    tokens = [{"text" : word.text, "upos" : word.upos} for sent in doc.sentences for word in sent.words]
    fr_tagged.append(tokens)

100%|██████████| 588/588 [04:34<00:00,  2.14it/s]


In [78]:
#This allows us to generate a pandas dataframe. We won't remove HTML tags, because they also appear in the source language set.
import pandas as pd
term_list = pd.read_csv("wmt22_ner_terms.txt", sep = "\t", header=None, names = ["sent_ID", "term"])

In [81]:
#And we simply aggregate counts. We will take note of casing here, but generate a separate list without casing.
def find_count_in_sentence_exact_match(row):
    query = row["term"]
    return len([found for found in fr_tagged[row["sent_ID"]] if query == found["text"]])

In [82]:
term_list["count"] = term_list.apply(find_count_in_sentence_exact_match, axis=1)

In [83]:
term_list = term_list.drop_duplicates().reset_index(drop=True)

In [87]:
term_list.to_csv("wmt22_ner_terms_counts.txt", sep = "\t", header = False, index = False) 

In [None]:
#Next, we will generate a separate list without casing, and aggregate counts as usual. Interestingly, we have the same number of rows, indicating that there are no terminologies which 
#appear multiple times within a single sentence, but with different casing. This means that we can stop here for now.
#term_list_uncased = term_list[["sent_ID", "term"]]
#term_list_uncased["uncased_term"] = term_list_uncased["term"].apply(str.lower)
#term_list_uncased = term_list_uncased.drop(columns = "term")
#term_list_uncased = term_list_uncased.drop_duplicates().reset_index(drop=True)