# Objective

Many agronomic terms appear in natural language in multiple forms, e.g.:
* "The awns are rough", "It has rough awns", or "It is rough-awned". In all these cases, the plant part (PLAN), awn, is modified by an adjective, rough. The combination, "rough" + "awn" is a trait (TRAT).
* "early maturing", "matures early". In these cases, a trait (TRAT), 'maturing' is modified by an adjective, early. This combination "early" + "maturing" is a compound trait (TRAT).

In this notebook, we will read in a small excerpt from a PDF file, run it against a trained NLP model, read the predictions, identify compoud traits based on the above rules, and output modified named entities in JSON format that include the compound traits.

# Read in a PDF

In [3]:
import spacy
import PyPDF2
#import slate3k as slate
# Use en_core_web_md across the board to be consistent. en_core_web_sm does
# not contain the vectors we need for pre-training
nlp = spacy.load('en_core_web_sm')

filename = "Data/DavisLJ11/BarCvDescLJ11.pdf"

#Open PDF file for reading
pdfFile = open(filename, mode="rb")
pdfReader = PyPDF2.PdfFileReader(pdfFile)

# Select a page to work on
pageNumber = 1

# Get text
OnePage = pdfReader.getPage(pageNumber-1) #0-based count
OnePageText = OnePage.extractText()

# Close PDF file
pdfFile.close()

# Replacement code to use slate which handles spacing within some PDFs better
# with open(filename, 'rb') as f:
#     OnePageText = slate.PDF(f).text()

# Remove newlinesxmx82k-&a. It appears multiple newlines together makes
# Spacy think that is the end of a sentence. The PDF reader reads the text in
# an odd fashion
OnePageText = OnePageText.replace('\n','')

# create a spaCy doc object from the page and break it into sentences
doc = nlp(OnePageText)
l=0
for sent in doc.sents:
     print(l, ": ", sent)
     l = l+1



0 :  1  
1 :  Revised September, 2011    BARLEY CULTIVARS FOR CALIFORNIA  
2 :  Lee Jackson, Extension Small Grain Specialist
3 :  (Retired)  Department of Plant Sciences  University of California, Davis     
4 :  The following are descriptions of barley cultivars evaluated in California from 1981 to 2011.
5 :  The descriptions are based on published cultivar releases and data from the UC Regional Cereal Evaluation Tests conducted each year throughout California.
6 :  Yield performance data for most of the cultivars can be found in University of California, Davis,
7 :  Agronomy Progress Reports (
8 :  No.'s 118, 128, 144, 155, 168, 180, 201,209, 217, 223, 229, 233, 236, 244, 249, 254, 259, 262, 265, 272, 276, 279, 286, 288, 290, 293, 295, 296, 301, 303, 304 for 1980-2011, respectively).
9 :  Reports #262 through 304 also can be seen at http://smallgrains.ucdavis.edu.    
10 :  CURRENT CULTIVARS
11 :  WINTER BARLEY   
12 :  EIGHT-TWELVE  
13 :  Eight-Twelve is a six-rowed winter feed ba

# Now read in a model

In [4]:
from spacy import displacy
agdata_nlp = spacy.load('NerModel')
test_text = '''Kold is a six-rowed winter feed barley. It was released by the Oregon and Idaho AESs in 1993'''

doc = agdata_nlp(test_text)
displacy.render(doc, style='ent', jupyter=True)

# Run the model against the PDF text

In [5]:
# Customize PDF
colors = {'ALAS':'BlueViolet','CROP': 'Aqua','CVAR':'Chartreuse','PATH':'red','PED':'orange','PLAN':'pink','PPTD':'brown','TRAT':'yellow'}
cust_options = {'ents': ['ALAS','CROP','CVAR','PATH','PED','PLAN','PPTD','TRAT'], 'colors':colors}

doc = agdata_nlp(OnePageText)

if doc.ents:
        displacy.render(doc, style='ent', jupyter=True, options=cust_options)

# Identify compound traits ADJ + PLAN = TRAT

In [7]:
doc.ents

(Revised September,
 2011,
 Jackson,
 Extension Small Grain Specialist,
 Retired,
 Department,
 Plant Sciences,
 University of California,
 Davis,
 barley,
 California,
 1981,
 2011,
 UC Regional Cereal Evaluation,
 California,
 Yield performance,
 University of California, Davis, Agronomy Progress Reports,
 's,
 128,
 144,
 217,
 223,
 259,
 272,
 2011,
 respectively,
 Reports #262,
 304,
 WINTER,
 EIGHT-TWELVE,
 Eight-Twelve,
 six-rowed,
 winter,
 feed,
 barley,
 USDA-ARS,
 Idaho AES,
 1991,
 Steveland/Luther//Wintermalt,
 79Ab812,
 awns,
 maturity,
 winter hardiness,
 height,
 lodging resistance,
 Spikes,
 Kernels,
 aleurone,
 rachilla hairs,
 susceptible,
 stripe rust,
 scald,
 snow mold,
 moderately susceptible,
 BYD,
 UC Regional,
 1982,
 2006,
 fall planting,
 northern California,
 Crop Science,
 32(3):828,
 Maja,
 six-rowed,
 winter,
 feed/malt,
 barley,
 Oregon AES,
 2006,
 AgriSource,
 Burley, Idaho,
 doubled haploid,
 Strider/88Ab536,
 STAB 113,
 Maja,
 88Ab536,
 Vrn-H1 locu