# Generate Training Data

We wish to speed up the process of having a human being name entities that they recognize in a paragraph, and having their positions in the paragraph identified and placed in a syntax usable by spaCy's NER training routine.

## Open PDF file, extract page two and display as sentences

In [24]:
import spacy
import PyPDF2
nlp = spacy.load('en_core_web_sm')

#Open PDF file for reading
pdfFile = open("BarCvDescLJ11.pdf", mode="rb")
pdfReader = PyPDF2.PdfFileReader(pdfFile)

# Select a page to work on
pageNumber = 2

# Get text
OnePage = pdfReader.getPage(pageNumber-1) #0-based count
OnePageText = OnePage.extractText()

# Close PDF file
pdfFile.close()

# Remove newlinesxmx82k-&a. It appears multiple newlines together makes
# Spacy think that is the end of a sentence. The PDF reader reads the text in
# an odd fashion
OnePageText = OnePageText.replace('\n','')

# create a spaCy doc object from the page and break it into sentences
doc = nlp(OnePageText)
l=0
for sent in doc.sents:
     print(l, ": ", sent)
     l = l+1


0 :  2   AC METCALFE  AC Metcalfe is a two-rowed spring malting barley.
1 :  It was released by Agriculture and Agri-Food Canada in 1997.
2 :  It was selected from the cross AC Oxbow/Manley.
3 :  Its experimental designations were TR 232 and WM8612-1.
4 :  It is widely adapted to western Canada and has excellent malting and brewing quality, particularly malt extract.
5 :  It is tall (averages about 40 inches in plant height) with fair straw strength and is medium late maturing (about one day later than Harrington).
6 :  At the time of release it was resistant to stripe rust, stem rust, loose smut, moderately resistant to the surface-borne smuts and the spot-form of net blotch (and had adult plant resistance to some net-form pathotypes), and susceptible to scald, leaf rust, speckled leaf blotch, common root rot, and BYD.
7 :  It was evaluated as Entry 1217 in the UC Regional Cereal Testing program from 2007-present for spring planting in the intermountain region of northern California. 

## Read in Per-line named entity file and match entities to sentence positions.

In [101]:
import re
import csv
import pandas as pd
fname = "barley_p2_ner2.txt"

# Covert the nlp senetence generator into a list of sentences
sentences = list(doc.sents)

# Open the file of manually matched pairs (sentence # <tab> word phrase <tab> named entity)
# e.g.:
#  0      AC Metcalfe     CVAR
#  0      two-rowed       TRAT
#  0      barley          CROP
#  1      Agri-Food Candada   ORG
#  1      1997    DATE
file = open(fname)
reader = csv.reader(file, delimiter='\t', quoting=csv.QUOTE_NONE)
data = list()

for row in reader:
    try:
        (sentIndex, phrase, label) = row
        sent = sentences[int(sentIndex)].string.rstrip()
        
        # find all instances of the 'phrase' in the 'sent'.
        iter = re.finditer(r"\b"+phrase+r"\b", sent)
        indices = [m.start(0) for m in iter]
                
        # print out all instances
        for i in indices:
#            print(sentIndex, sent, phrase, "("+str(i), i+len(phrase), "'"+label+"')")
            data.append([sentIndex, sent, phrase, "("+str(i)+", "+str(i+len(phrase))+", '"+label+"')"])
            
    except ValueError:
        print("Handle manually: ", row)
        
df = pd.DataFrame(data, columns = ["Index", "Sentence", "Phrase", "MatchInfo"])
print(df)


Handle manually:  ['([(Mentor x Minerva) x mutant of Vada] x [(Carlsberg x Union) x (Opavsky x Salle) x Ricardo]) x (Oriol x 6153 P40)', 'PED']
Handle manually:  ['WM861-5/TR118', 'PED']
    Index                                           Sentence          Phrase  \
0       0  2   AC METCALFE  AC Metcalfe is a two-rowed sp...     AC Metcalfe   
1       0  2   AC METCALFE  AC Metcalfe is a two-rowed sp...       two-rowed   
2       0  2   AC METCALFE  AC Metcalfe is a two-rowed sp...          spring   
3       0  2   AC METCALFE  AC Metcalfe is a two-rowed sp...         malting   
4       0  2   AC METCALFE  AC Metcalfe is a two-rowed sp...          barley   
..    ...                                                ...             ...   
167    64  At the time of evaluation it was moderately su...           scald   
168    64  At the time of evaluation it was moderately su...      net blotch   
169    64  At the time of evaluation it was moderately su...     stripe rust   
170    64  At

## Aggregate all matches for each sentence on a single line and output in spaCy training format

In [102]:
# use Pandas dataframes to aggregate all entity matches together for a single sentence
agg_rules = {'Sentence': 'first', 'Phrase': 'first', 'MatchInfo': lambda x: ', '.join(x)}
res = df.groupby('Index').agg(agg_rules)
#print(res)

# Now format it just like what is needed for the spaCy training module: 
# E.g.:
# ('Eight-Twelve is a six-rowed winter feed barley', {'entities': [(0, 12, 'CVAR'), (18, 27, 'TRAT'), (28, 39, 'TRAT'),(40, 46, 'CROP')]}),
records = res.to_dict('records')
for i in range(0,len(records)):
    print("('"+records[i]['Sentence']+"', {'entities': ["+records[i]['MatchInfo']+"]}),")


('2   AC METCALFE  AC Metcalfe is a two-rowed spring malting barley.', {'entities': [(17, 28, 'CVAR'), (34, 43, 'TRAT'), (44, 50, 'TRAT'), (51, 58, 'TRAT'), (59, 65, 'CROP')]}),
('It was released by Agriculture and Agri-Food Canada in 1997.', {'entities': [(19, 51, 'ORG'), (55, 59, 'DATE')]}),
('It is mid-late season in maturity (similar to Klages and 3-5 days later than Steptoe).', {'entities': [(6, 21, 'TRAT'), (46, 52, 'CVAR'), (77, 84, 'CVAR')]}),
('It is medium height (2 inches shorter than Steptoe) and has moderately stiff straw.', {'entities': [(6, 19, 'TRAT'), (43, 50, 'CVAR'), (60, 82, 'TRAT')]}),
('It was selected from the cross AC Oxbow/Manley.', {'entities': [(31, 46, 'PED')]}),
('Awns are rough.', {'entities': [(0, 4, 'PLAN')]}),
('Glumes are covered with long hairs.', {'entities': [(0, 6, 'PLAN')]}),
('Hulls are adhering and wrinkled.', {'entities': [(0, 5, 'PLAN')]}),
('Aleurone is colorless.', {'entities': [(0, 8, 'PLAN')]}),
('Rachilla hairs are long.', {'entities': [(