# Generate Training Data

We wish to speed up the process of having a human being name entities that they recognize in a paragraph, and having their positions in the paragraph identified and placed in a syntax usable by spaCy's NER training routine.

## Open PDF file, extract page two and display as sentences

In [38]:
import spacy
import PyPDF2
# import slate3k as slate
nlp = spacy.load('en_core_web_sm')

#filename = "BarCvDescLJ11.pdf"
filename = "Data/CSU/Cowboy-reprint.pdf"

#Open PDF file for reading
pdfFile = open(filename, mode="rb")
pdfReader = PyPDF2.PdfFileReader(pdfFile)

# Select a page to work on
pageNumber = 3

# Get text
OnePage = pdfReader.getPage(pageNumber-1) #0-based count
OnePageText = OnePage.extractText()

# Close PDF file
pdfFile.close()

# Replacement code to use slate which handles spacing within some PDFs better
# with open(filename, 'rb') as f:
#     OnePageText = slate.PDF(f).text()

# Remove newlinesxmx82k-&a. It appears multiple newlines together makes
# Spacy think that is the end of a sentence. The PDF reader reads the text in
# an odd fashion
OnePageText = OnePageText.replace('\n','')

# create a spaCy doc object from the page and break it into sentences
doc = nlp(OnePageText)
l=0
for sent in doc.sents:
     print(l, ": ", sent)
     l = l+1


0 :  Journal of Plant Registrations 171Other evaluations in Colorado or through the USDA Regional Testing Program have shown that Cowboy is moderately susceptible to Barley yellow dwarf virus and susceptible to Wheat soilborne mosaic virus.
1 :  ˆe reaction of Cowboy to Wheat streak mosaic virus is not known, although it lacks the DNA markers associated with Wsm1 (Qi et al., 2007) and Wsm2
2 :  (Lu et al., 2012).
3 :  Cowboy is heterogeneous for resistance to a collection of endemic biotypes of the Hessiay [Mayetiola destructor (Say)]
4 :  (Chen et al., 2009), susceptible to greenbug Biotype E
5 :  [Schizaphis graminum (Rondani)], resistant to Russian wheat aphid (Diuraphis noxia Kurdjumov) Biotype 1, and susceptible to Russian wheat aphid Biotype 2.Field PerformanceIn ˙eld trials in Colorado, grain yield of Cowboy was similar (P > 0.05) to its sister selection Denali (as described in Haley et al., 2012).
6 :  ˆese trials include the CSU Elite Trial (2009Œ2011; 29 trial locations), the

## Read in Per-line named entity file and match entities to sentence positions.

In [40]:
import re
import csv
import pandas as pd

fname = "Data/DavisLJ11/barley_p"+str(pageNumber)+"_ner.txt"
#fname = "Data/UIdaho2019/barley_p"+str(pageNumber-40)+"_ner.txt" # Note p1 = 41 in the document
fname = "Data/CSU/Cowboy_p"+str(pageNumber)+"_ner.txt"

# Covert the nlp senetence generator into a list of sentences
sentences = list(doc.sents)

# Open the file of manually matched pairs (sentence # <tab> word phrase <tab> named entity)
# e.g.:
#  0      AC Metcalfe     CVAR
#  0      two-rowed       TRAT
#  0      barley          CROP
#  1      Agri-Food Candada   ORG
#  1      1997    DATE
file = open(fname)
reader = csv.reader(file, delimiter='\t', quoting=csv.QUOTE_NONE)
data = list()

for row in reader:
    try:
        (sentIndex, phrase, label) = row
        sent = sentences[int(sentIndex)].string.rstrip()
        
        # find all instances of the 'phrase' in the 'sent'.
        iter = re.finditer(r"\b"+phrase+r"\b", sent)
        indices = [m.start(0) for m in iter]
        
        # check to make sure the phrase the user said was there was indeed found
        if len(indices) == 0:
            raise ValueError
                
        # print out all instances
        for i in indices:
#            print(sentIndex, sent, phrase, "("+str(i), i+len(phrase), "'"+label+"')")
            data.append([sentIndex, sent, phrase, "("+str(i)+", "+str(i+len(phrase))+", '"+label+"')"])
            
    except:
        print("Handle manually: ", row)
        
df = pd.DataFrame(data, columns = ["Index", "Sentence", "Phrase", "MatchInfo"])
print(df)


   Index                                           Sentence  \
0     22  In these trials, Cowboy had average grain volu...   
1     22  In these trials, Cowboy had average grain volu...   
2     24  the Wyoming irrigated state variety trial, gra...   
3     28  In these trials, Cowboy had above-average grai...   
4     28  In these trials, Cowboy had above-average grai...   
5     36  ˆe three check varieties have overall good mil...   
6     36  ˆe three check varieties have overall good mil...   
7     36  ˆe three check varieties have overall good mil...   
8     36  ˆe three check varieties have overall good mil...   
9     36  ˆe three check varieties have overall good mil...   
10    37  Values for milling-related variables were gene...   
11    37  Values for milling-related variables were gene...   
12    37  Values for milling-related variables were gene...   
13    37  Values for milling-related variables were gene...   

                         Phrase           MatchInfo  


## Create a function to clean up overlapping intervals

In [41]:
import re
coordRegex = re.compile(r'(\d+), (\d+)')

def sortByStart(coords):
    """For use in sort routines, return object with lowest (X,Y) values"""
    # split out coordinates that come in as (5, 7, 'CVAR')
    mo = coordRegex.search(coords)
    return(int(mo.group(1)))

def overlaps(coord1, coord2):
    """Check if coordinates of the form 5, 7, 'CVAR' and 32, 46, 'TRAT' overlap"""
    mo1 = coordRegex.search(coord1)
    mo2 = coordRegex.search(coord2)
    coord1Low = int(mo1.group(1))
    coord1High = int(mo1.group(2))
    coord2Low = int(mo2.group(1))
    coord2High = int(mo2.group(2))
    
    if ((coord1High >= coord2Low) and (coord1Low <= coord2Low) or
        (coord2High >= coord1Low) and (coord2Low <= coord1Low)):
        return True
    else:
        return False

def keepFirst(coord1, coord2):
    """Given overlapping coordinates, return the wider encompassing one."""
    mo1 = coordRegex.search(coord1)
    mo2 = coordRegex.search(coord2)
    coord1Low = int(mo1.group(1))
    coord1High = int(mo1.group(2))
    coord2Low = int(mo2.group(1))
    coord2High = int(mo2.group(2))
 
    if (int(coord1High) - int(coord1Low)) >= (int(coord2High) - int(coord2Low)):
        return True
    else:
        return False

# print("Should be false:", overlaps("(5, 7, 'CVAR')", "(32, 46, 'TRAT')"))
# print("Should be true:", overlaps("(26, 46, 'TRAT')", "(32, 46, 'TRAT')"))
# print("Should be true:", overlaps("(26, 46, 'TRAT')", "(26, 46, 'TRAT')"))
# print("Keeper:", keepFirst("34, 46, 'TRAT'", "34, 46, 'TRAT'"))

def cleanIntervals(inputString=""):
    """order intervals like (5, 7, 'CVAR'), (32, 46, 'TRAT'), (26, 46, 'TRAT') and remove overlapping ones."""
    inputString = inputString.lstrip("(").rstrip(")")
    intervalList = inputString.split("), (")
    intervalList.sort(key = sortByStart)
#    print("Sorted Interval List:", intervalList)

    # Pairwise compare every interval in the list to every other interval to check overlap
    keeperList = [True]*len(intervalList) # Logic array to determine if each interval should be kept
    i=0
    for interval1 in intervalList:
        for interval2 in intervalList:
            if interval1 == interval2:
                if intervalList.index(interval1) != i: # when both are the same we reject the higher one
                    keeperList[i] = False
            else:
                if overlaps(interval1, interval2) and keepFirst(interval1, interval2) == False:
                    keeperList[i] = False
        i = i+1
        
#    print("keeperList:", keeperList)
   
    # Build up the return interval list
    returnStr = "("
    for interval, isKeeper in zip(intervalList, keeperList):
        if isKeeper:
            returnStr = returnStr + interval + "), ("
    return (returnStr.rstrip("), (") + ")")
        
# cleanIntervals("(5, 7, 'CVAR'), (32, 46, 'TRAT'), (5, 9, 'CVAR'), (48, 55, 'ORG'), (26, 46, 'TRAT')")
# cleanIntervals("(0, 8, 'CVAR'), (0, 5, 'CVAR'), (21, 26, 'PLAN'), (32, 37, 'CVAR')")
cleanIntervals("(0, 12, 'CVAR'), (39, 49, 'CVAR'), (39, 49, 'CVAR'), (71, 77, 'CVAR'), (71, 77, 'CVAR'), (92, 113, 'TRAT'), (140, 150, 'CVAR'), (140, 150, 'CVAR'), (181, 187, 'CVAR'), (181, 187, 'CVAR')")


"(0, 12, 'CVAR'), (39, 49, 'CVAR'), (71, 77, 'CVAR'), (92, 113, 'TRAT'), (140, 150, 'CVAR'), (181, 187, 'CVAR')"

## Aggregate all matches for each sentence on a single line and output in spaCy training format

In [42]:
# use Pandas dataframes to aggregate all entity matches together for a single sentence
agg_rules = {'Sentence': 'first', 'Phrase': 'first', 'MatchInfo': lambda x: ', '.join(x)}
res = df.groupby('Index').agg(agg_rules)
#print(res)

# Now format it just like what is needed for the spaCy training module: 
# E.g.:
# ('Eight-Twelve is a six-rowed winter feed barley', {'entities': [(0, 12, 'CVAR'), (18, 27, 'TRAT'), (28, 39, 'TRAT'),(40, 46, 'CROP')]}),
records = res.to_dict('records')
print("TRAIN_DATA = [")
maxr = len(records)
for i in range(0,maxr):
    print("    ('"+records[i]['Sentence']+"', {'entities': ["+cleanIntervals(records[i]['MatchInfo'])+"]})", end='')
    if (i == maxr-1):
        print()
    else:
        print(",")

print("]")

TRAIN_DATA = [
    ('In these trials, Cowboy had average grain volume weight (764 kg m3), greater than (P < 0.05)', {'entities': [(17, 23, 'CVAR'), (36, 55, 'TRAT')]}),
    ('the Wyoming irrigated state variety trial, grain yield of Cowboy (7879 kg ha1) was the highest among the entries in a combined analysis across years (2010Œ2012, 3 locations), similar to (P > 0.05)', {'entities': [(43, 54, 'TRAT')]}),
    ('In these trials, Cowboy had above-average grain volume weight (773 kg m3), similar to (P >', {'entities': [(17, 23, 'CVAR'), (42, 61, 'TRAT')]}),
    ('ˆe three check varieties have overall good milling properties while overall baking properties for Hatcher and Ripper are good and Above is poor.', {'entities': [(43, 61, 'TRAT'), (76, 93, 'TRAT'), (98, 105, 'CVAR'), (110, 116, 'CVAR'), (130, 135, 'CVAR')]}),
    ('Values for milling-related variables were generally good for Cowboy, with kernel characteristics, grain protein concentration, and Brabendeuadrumat Senior (', {'entitie

In my current process, I am writing the above content to a file e.g., `Data/DavisLJ11/barley_p5_td.py`, adding any manual corrections (usually PED and JRNL entries) and then running the script `python3 py2json.py --doc 'BarCvDescLJ11.pdf' --url 'https://smallgrains.ucdavis.edu/cereal_files/BarCvDescLJ11.pdf' --chunk 5 Data/DavisLJ11/barley_p5_td.py Data/DavisLJ11/barley_p5_td.json` to create the JSON file for Training.