# Objective

Many agronomic terms appear in natural language in multiple forms, e.g.:
* "The awns are rough", "It has rough awns", or "It is rough-awned". In all these cases, the plant part (PLAN), awn, is modified by an adjective, rough. The combination, "rough" + "awn" is a trait (TRAT).
* "early maturing", "matures early". In these cases, a trait (TRAT), 'maturing' is modified by an adjective, early. This combination "early" + "maturing" is a compound trait (TRAT).

In this notebook, we will run a small section of text against a trained NLP model, read the predictions, identify compoud traits based on the above rules, and output modified named entities in JSON format that include the compound traits.

In [None]:
import spacy


# Do a quick training test

In [None]:
output_dir="agdata"
from src.trainNER import *

# Depending on the nature of out training dataset, we might get warnings that the 
# data is not well formatted. Ignore those errors for now. 
import warnings
warnings.filterwarnings('ignore')

# If you are running the notebook for the first time, and you do not have 
# an already custom trained NER recognition model, you will need to uncomment 
# the lines below and first train the model

#n_iter = 10
#trainModel(None,output_dir,n_iter)

# NLP parse some sample text

In [None]:
import PyPDF2
from spacy import displacy


nlp = spacy.load(output_dir)

test_text = '''Kold is a six-rowed winter feed barley obtained from the cross Triumph/Victor. It was released by the Oregon AES in 1993. It has rough awns and the aleurone is white. It has low lodging, matures early and its yield is low. Crop Science 25:1123 (1985).'''

doc = nlp(test_text)

colors = {'ALAS':'BlueViolet','CROP': 'Aqua','CVAR':'Chartreuse','PATH':'red','PED':'orange','PLAN':'pink','PPTD':'brown','TRAT':'yellow'}
cust_options = {'ents': ['ALAS','CROP','CVAR','PATH','PED','PLAN','PPTD','TRAT'], 'colors':colors}

displacy.render(doc, style='ent', jupyter=True, options=cust_options)


# Identify compound traits ADJ + PLAN = TRAT
## first flag entities and POS

In [None]:
print("Entities:")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)
    
# Comment out the following due to lack of statistical training for NERModel
# print("\nNoun Chunks:")
# for chunk in doc.noun_chunks:
#     print(chunk.text, chunk.root.text, chunk.root.dep_,
#             chunk.root.head.text)
    
print("\nParts of Speech:")
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_)

## Next, identify clauses fitting ADJ + PLAN = TRAT

In [None]:
from spacy.tokens import Span

def compound_trait_entities(doc):
    new_ents = []
    for ent in doc.ents:
        # Only check for PLAN entities with a preceding one-token adjective
        # (e.g., 'rough awns')
        if ent.label_ == "PLAN" and ent.start != 0:
            prev_token = doc[ent.start - 1]
            print('DEBUG: ', ent.text, ent.start, ent.label_, prev_token.text, prev_token.pos_, prev_token.dep_)
            if prev_token.pos_ == 'ADJ' and prev_token.dep_ == 'amod':
                new_ent = Span(doc, ent.start - 1, ent.end, label='TRAT')
                new_ents.append(new_ent)
            else:
                new_ents.append(ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

new_doc = compound_trait_entities(doc)
print(new_doc)
print("Entities:")
for ent in new_doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


## Add the above as a pipeline

In [None]:
# TODO in the future: Add the component after the named entity recognizer
# nlp.add_pipe(compound_trait_entities, after='ner')


The example above is rigid. It only check for PLAN entities with a preceding one-token adjective. This needs to be generalized to explicitly include any dependency relationehips (e.g., 'the awns are rough' in addition to 'rough awns')

In [None]:
plan_entities = [ent for ent in doc.ents if ent.label_ == "PLAN"]
for ent in plan_entities:
    # Because the entity is a spans, we need to use its root token. The head
    # is the syntactic governor of the person, e.g. the verb
    head = ent.root.head
    if head.lemma_ == "be":
        # Check if the children contain an adjectival complement (acomp)
        acomps = [token for token in head.children if token.dep_ == "acomp"]
        for acomp in acomps:
            print(acomp, ent, "is a TRAT (trait)")

## Quick gut-check on spacy basics

Ah, now all we need to do is add this piece of logic to the earlier compound_trait_entities function we will add to the pipeline. This handles phrases where the adjective doesn't precede the noun it modifies, and the original code already handles the phrases where the adjective directly precedes the noun it modifies. But first, let's make sure we understand [Tokens](https://spacy.io/api/token), [Spans](https://spacy.io/api/span) and [Docs](https://spacy.io/api/doc).

In [None]:
nlp = spacy.load('en_core_web_sm')
from spacy.tokens import Span

# Figure out alternatives to creating a Span
test_doc = nlp('This is a simple sentence used for testing.')
# span = test_doc[1:4]
# The next one doesn't assign the label
# span = test_doc[1:4].char_span(0, 11, label="FRAG")
span = Span(test_doc, 1, 4, label="FRAG")
print("SPAN FRAG:", span.text, span.label_)

# Now figure out how to figure out the right token number in the doc for a token
token = test_doc[3]
print("Token:", token.text, "at position", token.i)

# What is the first token in the span?
token = test_doc[span.start]
print("Span start:", token.text, "at position", token.i)

## Combined approach: Handling ADJ PLAN = TRAT and PLAN (be) ADJ = TRAT

In [None]:
from spacy.tokens import Span

def adj_plan_entities(doc):
    # only deals with ADJ PLAN = TRAT entities (e.g, 'rough awns')
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == "PLAN" and ent.start != 0:
            prev_token = doc[ent.start - 1]
            # print('DEBUG: ', ent.text, ent.start, ent.label_, prev_token.text, prev_token.pos_, prev_token.dep_)
            if prev_token.pos_ == 'ADJ' and prev_token.dep_ == 'amod':
                new_ent = Span(doc, ent.start - 1, ent.end, label='TRAT')
                new_ents.append(new_ent)
            else:
                new_ents.append(ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

def plan_adj_entities(doc):
    #only deals with PLAN (be) ADJ = TRAT entities (e.g., 'awns are rough')
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == "PLAN":
            # Because the entity is a spans, we need to use its root token. The head
            # is the syntactic governor of the person, e.g. the verb
            head = ent.root.head
            # print('DEBUG: entity head lemma', head.lemma_)
            if head.lemma_ == "be":
                # Check if the children contain an adjectival complement (acomp)
                acomps = [token for token in head.children if token.dep_ == "acomp"]
                # CAVEAT 1: For now let's assume adjectives are single-word
                # Later we should figure out the lowest and highest index among the adjectives
                acomp = acomps[0]
                # print('DEBUG: ', acomp, ent, "is a new TRAT (trait)")
                # CAVEAT 2: The document remains unchanged, so the term that will get stored
                # will be the original phrase 'awns are rough' instead of the standardized
                # 'rough awns'
                #
                # Having trouble defining a span from the first token of the Span 'ent' to the last token of 'acomp'
                # print('DEBUG: acomp', doc[acomp.i+1])
                new_ent = Span(doc, doc[ent.start].i, acomp.i+1, label="TRAT")
                new_ents.append(new_ent)
                # I would much rather overwrite the phrase as 'rough awns' like this DEBUG statement shows
                # by printing acomp + ent
            else:
                new_ents.append(ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc


def compound_trait_entities(doc):
    doc = adj_plan_entities(doc)
    doc = plan_adj_entities(doc)
    return doc


new_doc = compound_trait_entities(doc)
print(new_doc)
print("Entities:")
for ent in new_doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


# Identify compound traits ADJ + TRAT = TRAT

In [None]:
from spacy.tokens import Span

def adj_ent_entities(doc):
    # only deals with ADJ TRAT (or PLAN) = TRAT entities (e.g, 'low lodging' or 'rough awns')
    new_ents = []
    for ent in doc.ents:
        if ent.label_ in ('PLAN', 'TRAT') and ent.start != 0:
            prev_token = doc[ent.start - 1]
            # print('DEBUG: ', ent.text, ent.start, ent.label_, prev_token.text, prev_token.pos_, prev_token.dep_)
            if prev_token.pos_ == 'ADJ' and prev_token.dep_ == 'amod':
                new_ent = Span(doc, ent.start - 1, ent.end, label='TRAT')
                new_ents.append(new_ent)
            else:
                new_ents.append(ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc

def trat_adj_entities(doc):
    # only deals with TRAT ADJ = TRAT entities (e.g., 'matures early')
    # CAVEAT: we still need to do deal with TRAT (be) ADJ (e.g., 'its protein levels are low')
    new_ents = []
    for ent in doc.ents:
        if ent.label_ in ('TRAT'):
            next_token = doc[ent.start + 1]
            # print('DEBUG: ', ent.text, ent.start, ent.label_, next_token.text, next_token.pos_, next_token.dep_)
            if next_token.pos_ == 'ADV' and next_token.dep_ == 'advmod':
                new_ent = Span(doc, ent.start, ent.end + 1, label='TRAT')
                new_ents.append(new_ent)    
            else:
                new_ents.append(ent)
        else:
            new_ents.append(ent)
    doc.ents = new_ents
    return doc


def compound_trait_entities(doc):
    doc = adj_ent_entities(doc)
    doc = plan_adj_entities(doc)
    doc = trat_adj_entities(doc)
    return doc


new_doc = compound_trait_entities(doc)
print(new_doc)
print("Entities:")
for ent in new_doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


# Add our new compound trait function to the pipeline

In [None]:
nlp.add_pipe(compound_trait_entities, after='ner')
doc = nlp(test_text)
print(doc)
print("Entities:")
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)


## Preparse pedigrees and journal entries

### First let's re-aquaint ourselves with the results format of pre-parse

In [None]:
from src.preparse import *

regex_ents = preparse(test_text)
print(regex_ents, '\n')
print (regex_ents[test_text]['entity 1'], '\n')
print (regex_ents[test_text]['entity 1']['label'], '\n')
print (regex_ents[test_text]['entity 1']['substring'])

### Try to figure out matcher

In [None]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
tdoc = nlp("I do not care if you do not care")
substring = "do not care"
sdoc = nlp(substring)
pattern = [{"ORTH": token.text} for token in sdoc]
#pattern = [{"ORTH": "do"}, {"ORTH": "not"}, {"ORTH": "care"}]
matcher.add("do not care", None, pattern)
matches = matcher(tdoc)
print(matches)

### Now check for entities overlapping each new regex one

In [None]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

def get_matched_spans(doc, substring):

    sdoc = nlp(substring)
    pattern = [{"ORTH": token.text} for token in sdoc] #allows matching multi-word pattern
    matcher.add(substring, None, pattern)
    
    result = []
    matches = matcher(doc)
    print(matches)
    for match_id, start, end in matches:
#        string_id = nlp.vocab.strings[match_id]  # Get string representation
        span = doc[start:end]  # The matched span
#        print(match_id, string_id, start, end, span.text)
        print(match_id, start, end, span.text)
        result.append(span)
    return result
    
tdoc = nlp("I don't care if you don't care")
span_res = get_matched_spans(tdoc, "don't care")
[(i.text,i.start,i.end) for i in span_res]


### Back to our problem

In [None]:
ent_data = regex_ents[test_text]['entity 1']
print(doc.text)
print(ent_data['substring'])
# the following returns a list of spans, each indexed as span_list[0].text, 
#   span_list[0].start, and span_list[0].end
span_list = get_matched_spans(doc, ent_data['substring'])

# Although there might be multiple matches, just work with the first one for now
new_ent = Span(doc, span_list[0].start, span_list[0].end, label=ent_data['label'])

print("SPAN FRAG:", new_ent.text, new_ent.label_)

## Here's how we put it all together

In [None]:
from src.agParse import *
nlp = spacy.load(output_dir)
text = 'Kold is a six-rowed winter feed barley obtained from the cross Triumph/Victor. It was released by the Oregon AES in 1993. It has rough awns and the aleurone is white. It has low lodging, matures early and its yield is low. Crop Science 25:1123 (1985).'
nlp.add_pipe(compound_trait_entities, after='ner')
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

We now have the basics working. But here is a recap on the caveats: We don't handle multiple-word adjectival or adverbial modifiers like 'mid to late maturity' and 'height is very low'. We also don't handle TRAT (be) ADJ constructs (e.g., 'its yield is low'). These will need to be added.