In this Jupyter notebook, we will explore the use of Spacy's PhraseMatcher with RapidFuzzy search to identify terms from a list of terms in a given text. Spacy is a popular open-source software library used for advanced Natural Language Processing (NLP) tasks, while RapidFuzzy is a fast and efficient fuzzy search library in Python.

The ability to quickly and accurately identify terms from a list in a text is a crucial task in NLP applications such as document classification, sentiment analysis, and named entity recognition. The combination of Spacy's PhraseMatcher and RapidFuzzy search can greatly simplify this task and reduce the computational time needed.

Throughout this notebook, we will walk through the steps to create a PhraseMatcher object, use RapidFuzzy search to compare the extracted phrases with a list of terms, and finally visualize the results. By the end of this notebook, you will have a better understanding of how to use Spacy and RapidFuzzy together to efficiently identify terms in a given text."""

In [23]:
##Create term list should look like this [{"ID": 1, "term": "term1", "label": "label1"}, {"ID": 2, "term": "term2", "label": "label2"}, ... {"ID": 30000, "term": "term30000", "label": "label30000"}

import os
import pandas as pd
from multiprocessing  import Pool
import spacy
from spacy.matcher import PhraseMatcher
from rapidfuzz import fuzz


list_of_dictionaries = [dic for dic in os.listdir("DEBBIE_dictionaries_annotations-main/dictionaries") if dic.startswith(".") is False]

In [28]:
terms = []
for dictionary in list_of_dictionaries:
    try:
        data = pd.read_csv("DEBBIE_dictionaries_annotations-main/dictionaries/"+dictionary,sep="\t",usecols=[0,1,2], names=['term', 'label',"id"], header=None)
        data = data.dropna()
        data = data.to_dict(orient="records")
        for d in data:
            d["label"] = d["label"].replace("LABEL=", "") # remove the "LABEL=" string from the "label" key
            d["id"] = d["id"].replace("ID=", "") # remove the "ID=" string from the "id" key
        terms += data 
    except:
        print(dictionary)

In [30]:
terms[54]

{'term': 'CCL1', 'label': 'Cell', 'id': 'id.200906009267275401'}

In [None]:
def create_patterns(term):
    try:
        pattern = nlp(term["term"])
        fuzzy_pattern = [{"LOWER": token.lower_, "FUZZY": term["term"]} for token in pattern]
        return (fuzzy_pattern + [{"ID": term["id"], "LABEL": term["label"]}, pattern])
    except:
        print("error with term: "+str(term))

In [None]:


nlp = spacy.load("en_core_web_sm") # load the small English model

matcher = PhraseMatcher(nlp.vocab) # initialize the PhraseMatcher object
#terms = [{"ID": 1, "term": "term1", "label": "label1"}, {"ID": 2, "term": "term2", "label": "label2"}, ... {"ID": 30000, "term": "term30000", "label": "label30000"}] # list of dictionaries containing terms to match (30000 terms)

# create a list of patterns that includes the original term and its fuzzy match, as well as the term ID and label
patterns = []
# for term in terms:
#     pattern = nlp(term["term"])
#     fuzzy_pattern = [{"LOWER": token.lower_, "FUZZY": term["term"]} for token in pattern]
#     patterns.append(fuzzy_pattern + [{"ID": term["id"], "LABEL": term["label"]}, pattern])
with Pool(6) as pool:
    patterns += pool.map(create_patterns,terms)
matcher.add("TERMS", None, *patterns)

# example text to match against
text = "Here is an example text that contains some of the terms we want to match."

doc = nlp(text) # create a Doc object from the example text
matches = []
for pattern in patterns:
    for match in matcher(doc):
        if match[0] == nlp.vocab.strings["TERMS"]:
            match_text = doc[match[1]:match[2]].text.lower()
            if fuzz.partial_ratio(match_text, pattern[0]["FUZZY"]) > 80:
                matches.append(match + pattern[-2:])

# print the matches
for match_id, start, end, metadata, _ in matches:
    print(f"Matched term: {doc[start:end].text}")
    print(f"ID: {metadata['ID']}")
    print(f"Label: {metadata['LABEL']}")
    print()


In [None]:
terms[0]

In [None]:
patterns_2 = []

for n in terms:
    patter = {}
    pattern = []
    text = n["term"].split(" ")
    for n_t in text:
        pattern.append({"LOWER":n_t.lower()})
    patter["id"] = n["id"]
    patter["label"] = n["label"]
    patter["pattern"] = pattern
    patterns_2.append(patter)



In [None]:
ruler = nlp.add_pipe("entity_ruler",validate=True)


In [None]:
txt="""1998 Jan
Bioactive glass fiber/polymeric composites bond to bone tissue. 
Bioactive glass fibers were investigated for use as a fixation vehicle between a low modulus, polymeric composite and bone tissue. In an initial pilot study, bioactive glass fiber/polysulfone composites and all polysulfone control rods were implanted into the rabbit tibia; the study was subsequently expanded with implantation into the rabbit femur. Bone tissue exhibited direct contact with the glass fibers and adjacent polymer matrix and displayed a mechanical bond between the composite and bone tissue after six weeks implantation. Interfacial bond strengths after six weeks implantation averaged 12.4 MPa, significantly higher than those of the all polymer controls. Failure sites for the composite at six weeks generally occurred in the bone tissue or composite, whereas the failure site for the polymer implants occurred exclusively at the implant/tissue interface. The bioactive glass fiber/polysulfone composite achieved fixation to bone tissue through a triple mechanism: a bond to the bioactive glass fiber, mechanical interlocking between the tissue and glass fibers, and close apposition and possible chemical bond between the portions of the polymer and bone tissue. This last mechanism resulted from an overspill of bioactivity reactions from the fibers onto the surface of the surrounding polymer which we call the "halo" effect. """

In [None]:
ruler.add_patterns(patterns_2)

In [None]:
doc = nlp(txt)

In [None]:
nlp.remove_pipe("ner")

In [None]:
for ent in doc.ents:
    print(ent.label_,ent.text)

In [None]:
for d in data:
    if d["term"]:
        print("Empty value found for 'term' key in dictionary:", d)

In [None]:
dirs = [n for n in os.listdir("/jupyter/Miguel/prodigy_brat_files") if n.startswith(".") == False]

In [None]:
for direc in dirs:
    path = "/jupyter/Miguel/prodigy_brat_files/"+direc
    docs = [n for n in os.listdir(path) if n.endswith("txt") == True]
    for doc in docs:
        name = doc.split(".")[0]
        with open(path+"/"+name+".txt","r") as doc1:
            text = doc1.read()
            doc2 = nlp(text)
        annotations_list = []   # list to store annotations
        T_id = 1   # brat annotation line id

        # Go through the predicted entities
        if doc2.ents:
            with open("/jupyter/Miguel/DEBBIE/"+name+".txt","a") as doc3:
                for ent in doc2.ents:
                    doc3.write('T{}\t{} {} {}\t{}\n'.format(T_id, ent.label_, ent.start_char, ent.end_char, ent.text))
                    T_id += 1