# Read categorical data from NSWFFRD 2014 (v2.1)
This scripts show how to read the spreadsheet from NSW Flora Fire response database and extract information for several traits, translate the original values into standard values and insert records into the Fireveg response database.

## Read data from spreadsheet

We will use the _openpyxl_ library in ***python*** to read the spreadsheet document.

In [1]:
import openpyxl
from pathlib import Path
import os
import re

import copy

We need to define a path to locate the documents relative to the current repository directory

In [2]:
repodir = Path("../..") 
inputdir = repodir / "data/"

### Open the workbook and read spreadsheets
Here we will load the workbook (_wb_):

In [3]:
wb = openpyxl.load_workbook(inputdir / "NSWFFRDv2.1.xlsx")

We will use the sheet names to read them. We need access to sheet 'Species data' and 'References', we will also check their column notes:

In [4]:
species_data = wb['SpeciesData']
references = wb['References']
column_notes = wb['Notes'] 

### Read cell values
We can use square brackets to refer to a column and then use python indices (starting with _0_ for the top row) to slice it. We use the property _value_ to show their stored content. 

In [5]:
print(species_data['X'][1].value)
print(species_data['X'][157].value)

Post-fire flowering
flowers well after fire


Descriptions of these columns are found in the _column_notes_ sheet:

In [6]:
for k in (23,24):
    print(" - *%s*" %column_notes.cell(row=k,column=2).value)
    print("\t%s" % column_notes.cell(row=k,column=3).value)

 - *Establishment*
	Seedling establishment groups of Noble & Slatyer (1980); See VA sheet for details: I=Intolerant, T=Tolerant, R=Requiring
 - *Post-fire flowering*
	exclusive or facultative post-fire flowering observed


We can use this approach to read several columns from one row, let's start checking the columns names in row 1:

In [7]:
sp_col='A'
spcode_col='B'
target_cols={'repr2':'X', 'rect2':'W'}

target_cols.values()
print("%s (%s) / %s / %s  " %
(species_data[sp_col][1].value,
 species_data[spcode_col][1].value,
species_data[target_cols['repr2']][1].value,
species_data[target_cols['rect2']][1].value))

Current Scientific Name (Species Code) / Post-fire flowering / Establishment  


Now select one record:

In [8]:
row_index=157

print("%s (%s)  ~ %s  / %s " %
(species_data[sp_col][row_index].value,
 species_data[spcode_col][row_index].value,
 species_data[target_cols['repr2']][row_index].value,
species_data[target_cols['rect2']][row_index].value))


Acianthus caudatus (4351)  ~ flowers well after fire  / None 


#### Dealing with hyperlinks

This cell has a hyperlink:

In [9]:
type(species_data[target_cols['repr2']][row_index].hyperlink)

openpyxl.worksheet.hyperlink.Hyperlink

If the cell is a hyperlink it will have a value to "display" and will point to a "location" within the workbook: 

In [10]:
species_data[target_cols['repr2']][row_index].hyperlink.display

'References!C22'

In [11]:
# This will fail if there is no hyperlink 
print(species_data[target_cols['repr2']][row_index].hyperlink.location)

References!C22


Let's see the value of this reference:

In [12]:
hlink = species_data[target_cols['repr2']][row_index].hyperlink.location
hlink = hlink.split("!")

This gives the name of the target sheet and the corresponding cell. We need to read the cell to its right side (add one to the column number) to get the information we need.

In [13]:
ref = wb[hlink[0]]
print("Cell value is :: " + str(ref[hlink[1]].value))
nlink = ref.cell(row=ref[hlink[1]].row,column=ref[hlink[1]].col_idx + 1)

print("Reference data is :: " + nlink.value) 


Cell value is :: 21
Reference data is :: Bishop T. (1996) Field Guide to the Orchids of NSW and Victoria


If there is no hyperlink, it will result in NoneType

In [14]:
type(species_data[target_cols['repr2']][row_index-1].hyperlink)

NoneType

### Create list(s) of references 
We need to prepare list of references from spreadsheet 'References'.

There are three sets of references:
- the  "normal" references in columns C and D (pink)
- the  "Recovery Plan / Regional Forest Agreement Report" references in columns N, O, and P (blue)
- the  "NFRR" references in columns S and T (lila)

Normal and NFRR references are identified by a simple two-cipher or -letter code and reference description, we will use a function to create a more descriptive reference code for the references based on the list of authors and date.

For Recovery plans and Regional Forest Agreement Reports, we will use the species or region as reference code.


In [15]:

r = re.compile("[A-Z][a-z]+")
def create_ref_code(x):
    
    if x.__contains__("personal communication"):
        y = x[0:x.find(" personal")].replace(",","")
        year = "pers. comm."
    elif x.__contains__("unpublished"):
        y = x[0:x.find("unpublished")].replace(",","")
        year = "unpub."
    else:
        y = x[0:x.find(")")].replace(",","")
        year = ''.join(re.findall("\d+", y))
    z = list(filter(r.match, y.split()))
    author = ' '.join(z)
    final_code =  "%s %s" % (author, year)
    if (len(final_code)>50):
        final_code=final_code[0:50]
    return(final_code)

def create_ref_code_RP(x):
    if x.__contains__("^RFA"):
        final_code = x
    else:
        final_code = "RP %s" % x
    if (len(final_code)>50):
        final_code=final_code[0:50]
    return(final_code)


val=references['O'][26].value.replace("(1) ","")
print(val)
create_ref_code_RP(val)

Asterolasia elegans


'RP Asterolasia elegans'

Now we check references of NFRR (notice that we will substitute number _1_ with capital _I_ in _refcode_ to avoid problems with one reference (see below):

In [16]:
NFRR_refs=list()
for row in range(1,66):
    cite_text = references['T'][row].value.replace("(1) ","")
    cite_code = create_ref_code(cite_text) 
    record={"refcode": references['S'][row].value.replace("1","I"),
            "refstring": cite_code,#re.sub(r", [A-Z\.]+"," ",cite_code),
            "refinfo": cite_text
    }
    NFRR_refs.append(record)

In [17]:
NFRR_refs[56]

{'refcode': 'SA',
 'refstring': 'Carolyn Sandercoe Qld. unpub.',
 'refinfo': 'Carolyn Sandercoe, Qld. (unpublished)'}

In [18]:
NFRR_refs[6]["refcode"]

'BF'

In [19]:
qry="FOI"
for elem in filter(lambda x: x['refcode'] == qry, NFRR_refs):
    print("NFRR reference %s refers to '%s'" % (qry, elem['refinfo']))

NFRR reference FOI refers to 'Fox, J.E.D. (1985). Fire in Mulga: Studies at the margins. In: Fire ecology and management of Western Australian ecosystems. (ed: J.R. Ford). Western Australian Institute of Technology, report no. 14.'


We do the same for the "normal" references column:

In [20]:
other_refs=list()
for row in range(1,139):
    cite_text = references['D'][row].value
    cite_code = create_ref_code(cite_text) 
    if cite_code == "Benson 1985":
        cite_code = "Benson 1985b"
    record={"refcode": references['C'][row].value,
            "refstring": cite_code,
            "refinfo": cite_text
    }
    other_refs.append(record)

In [21]:
other_refs[9]

{'refcode': 10,
 'refstring': 'Wark White Robertson Marriott 1987',
 'refinfo': 'Wark, M.C., White, M.D., Robertson, D.J. and Marriott, P.F. (1987). Regeneration of heath and heath woodland in the north-eastern Otway Ranges following the wildfire of February 1983. Proc.Roy.Soc.Vic. 99, 51-88.'}

Now the recovery plan references:

In [22]:
rp_refs=list()
for row in range(1,46):
    cite_code = create_ref_code_RP(references['O'][row].value) 
    cite_text = "%s. %s" % (cite_code, references['P'][row].value)
    record={"refcode": references['N'][row].value,
            "refstring": cite_code,
            "refinfo": cite_text
    }
    rp_refs.append(record)

Check if there are duplicated references:

In [23]:
l1 = list()
for r in NFRR_refs: 
    l1.append(r["refstring"])
l2 = list()
for r in other_refs: 
    l2.append(r["refstring"])

for i in l1:
    if i in l2:
        print(i)


Benwell 1998
Molnar Fletcher Parsons 1989
Wark White Robertson Marriott 1987
Wark 1997


In [24]:
qry="Benwell 1998"
for elem in filter(lambda x: x['refstring'] == qry, NFRR_refs):
    print("Reference %s refers to '%s'" % (qry, elem['refinfo']))
for elem in filter(lambda x: x['refstring'] == qry, other_refs):
    print("Reference %s refers to '%s'" % (qry, elem['refinfo']))
    

Reference Benwell 1998 refers to 'Benwell A.S. (1998). Post-fire seedling recruitment in coastal heathland in relation to regeneration strategy and habitat. Aust. J. Bot. 46, 75-101.'
Reference Benwell 1998 refers to 'Benwell, A.S. (1998) Post-fire seedling recruitment in coastal heathland in relation to regeneration strategy and habitat. Aust. J. Bot. 46:75-101.  Data compiled by D.Keith (Keith, D.A., McCaw, W.L. & Whelan, R.J. (2002) pp. 199-237 in "Flammable Australia: The fire regimes and biodiversity of a continent" Ed. R.A. Bradstock, J.E. Williams & M.A. Gill. Cambridge University Press, Cambridge)'


### Matching references from hyperlinks
We will create a function to translate hyperlinks to a reference:

In [25]:
def extract_link(target):
    p=re.compile('[,;\s]+')
    assert (target.hyperlink is not None),"Only works when cell has a hyperlink!"
    hlink = target.hyperlink.location
    hlink = hlink.split("!")
    if (hlink[0] != "References"): #"Expecting hyperlink to 'References' sheet"
        return None
    else:
        column=hlink[1][0:1]
        cell=hlink[1]
        refcodes=references[hlink[1]].value
        refinfo=list()
        if refcodes is not None:
            if isinstance(refcodes,int):
                for elem in filter(lambda x: x['refcode'] == refcodes, other_refs):
                    refinfo.append(elem['refstring'])
            else:
                for refcode in p.split(refcodes):
                    refcode=refcode.strip(" ")
                    refcode=re.sub("[abc]$","",refcode)
                    if refcode.isnumeric():
                        for elem in filter(lambda x: x['refcode'] == int(refcode), other_refs):
                            refinfo.append(elem['refstring'])
                    else:
                        for elem in filter(lambda x: x['refcode'] == refcode, rp_refs):
                            refinfo.append(elem['refstring'])
                        for elem in filter(lambda x: x['refcode'] == refcode, NFRR_refs):
                            refinfo.append(elem['refstring'])
            return (refcodes,refinfo)
        else:
            return None

            

We can test this function for several rows:

In [26]:
for row_index in (157,162,233):
    spname=species_data[sp_col][row_index].value
    pjp=species_data[target_cols['repr2']][row_index]
 
    raw=pjp.value
    if (pjp.hyperlink is not None):
        ref=extract_link(pjp)
        if ref is not None:
            print("%s :: [%s] // %s" % (row_index,raw,ref[1]))
        else:
            print("%s :: [%s] " % (row_index,raw))            
    else:
        print("%s :: [%s] " % (row_index,raw))

157 :: [flowers well after fire] // ['Bishop 1996']
162 :: [flowering 1 year post-fire] // ['Knox Clarke 2004']
233 :: [facultative] // ['Keith David pers. comm.']


### Colored and modified fonts

Some records include additional information coded in font color or strikethrough of values. With Python we can query cell colors and strikethrough properties of the font to verify if information has been annotated, but not with enough detail to distinguish with part of the value is annotated and which is not. For example:

In [27]:
for row in [22,23,66,67,70,72]:
    if species_data['BN'][row].font.color == None:
        print("Cell %s has no colored font" % (row+1))
    else:
        print("Cell %s has colored font" % (row+1))
        print(species_data['BN'][row].font.color.indexed)
    if species_data['BN'][row].font.strike != None:
        print("Cell %s has strikethrough" % (row+1))

Cell 23 has colored font
60
Cell 24 has no colored font
Cell 67 has colored font
60
Cell 68 has no colored font
Cell 71 has no colored font
Cell 73 has colored font
60
Cell 73 has strikethrough


### Processing strings with and without references
Cell values in the target columns might includes values in mixed formats, sometimes numbers and sometimes text, sometimes different observations are recorded for each species using delimiters and citing references in text, e.g.: 
> value1 (ref a) / value2 (ref b)
 
In such cases we want to split the values into different records and keep the values as 'raw value' and document the references cited. If the value in the cell matches our predefined values (e.g. Exclusive, Facultative, Negligible for post-fire flowering), we will fill a 'norm_value' with the corresponding category, if no match is found we will keep it empty for later processing.

In exceptional cases a reference is given in the text: "(12)" refers to reference 12.

We will define a _switcher_ function to transform raw values into normalised values:

In [28]:
switcher={
    "repr2":{
        "facultative": "Facultative",
        "yes": "Facultative",
        "yes?": "Facultative",
        "most profuse after fire": "Facultative",
        "exclusive": "Exclusive",
        "exclusive?": "Exclusive",
        "negligible": "Negligible"
    },
    "rect2":{
        "I":"Intolerant",
        "T":"Tolerant",
        "R":"Requiring",
        "T R":"Tolerant-Requiring",
        "I T":"Intolerant-Tolerant",
        "T I":"Intolerant-Tolerant"
    },
    "germ1":{
        'canopy': 'Canopy',
        
        'persistent soil': 'Soil-persistent', 
        'persistent': 'Soil-persistent', 
        'peristent': 'Soil-persistent', 
        'soil': 'Soil-persistent', 
        
        'transient': 'Transient', 
        'none':'Transient', 
        'shed at maturity': 'Transient', 
        'viviparous':'Transient', 
        'canopy / released at maturity':'Transient', 
        'canopy / regularly without fire':'Transient', 
        'canopy - transient':'Transient', 
        'transient': 'Transient', 
        
        'serotinous canopy': 'Canopy',
        'non-canopy': 'Non-canopy',
        'not canopy': 'Non-canopy',
        
        'other': 'Other'
    },
     "surv4":{
        'epicormic': 'Epicormic', 
        'stem buds': 'Epicormic', 
        'apical': 'Apical', 
        'lignotuber': 'Lignotuber',
        'root stock': 'Lignotuber',
        'rootstock': 'Lignotuber',
        'basal': 'Basal',
        'basal buds': 'Basal',
        'coppice': 'Basal',
        'tuber': 'Tuber',
        'taproot': 'Tuber',
        'tap root': 'Tuber',
        'tussock': 'Tussock',
        'rhizome': 'Long rhizome or root sucker',
        'rootucker': 'Long rhizome or root sucker',
        'rootuckers': 'Long rhizome or root sucker',
        'rootsuckers': 'Long rhizome or root sucker',
        'root buds': 'Long rhizome or root sucker',
        'root sucker': 'Long rhizome or root sucker',
        'root suckers': 'Long rhizome or root sucker',
        'rhizome': 'Short rhizome',
        'stolon': 'Stolon',
        'stolons': 'Stolon'
    }
}
isinstance(switcher["germ1"],dict)

True

And we will define a function to extract values from a target cell:

In [29]:
def extract_value(target, switcher, varname, splitstring="&|;|,| or | and "):
    assert (target.value is not None),"Only works whith non-empty cells"
    assert isinstance(switcher,dict),"Switcher argument must be a dictionary"
    assert isinstance(varname,str),"Variable name argument must be a string"
    p=re.compile('[,;\s]+')
    val = target.value
    rslts = list()
    note = list()
    if target.font.color != None:
        note.append('Cell color index %s' % target.font.color.indexed)
    if target.font.strike != None:
        note.append('Cell text has strikethrough')
    if isinstance(val,int) or isinstance(val,float):
        record={"raw_value":[varname,str(val)]}
        if len(note)>0:
            record["original_notes"]=note                
        rslts.append(record)
    else:
        for w in val.split('/'):
            transvalue=None
            oref=list()
            method=None
            w=w.strip(" ")
            start=0
            end=len(w)
            if w.find("(")>0:
                for refs in re.findall("\(([\w\d, ]+)\)",w):
                    for ref in p.split(refs):
                        ref=ref.strip(" ")
                        ref=re.sub("[abc]$","",ref)
                        if ref.isnumeric():
                            for elem in filter(lambda x: x['refcode'] == int(ref), other_refs):
                                oref.append(elem['refstring'])
                        else:
                            for elem in filter(lambda x: x['refcode'] == ref, rp_refs):
                                oref.append(elem['refstring'])
                            for elem in filter(lambda x: x['refcode'] == ref, NFRR_refs):
                                oref.append(elem['refstring'])
                end=w.index("(")
            if w.find("a-")==0:
                method='Inferred from plant morphology'
                start=2
            if w.find("?")>0:
                note.append("uncertain")
                
                
            sw=w[start:end].strip(" ").replace("?","")
            
            for sv in re.split(splitstring,sw):
                newnote=copy.deepcopy(note)
                sv=sv.strip(" ")
                transvalue=switcher.get(sv, None)
                record={"raw_value":[varname,w],"main_source":"NSWFFRDv2.1"}
                if sw != w:
                    record["raw_value"].extend(['->',sw])
                    newnote.append("original record split into multiple entries, prob. different sources")
                if sv != sw:
                    record["raw_value"].extend(['->',sv])
                    newnote.append("original record split into multiple entries separated by and/or")
                if transvalue is not None:
                    record["norm_value"]=transvalue
                if method is not None:
                    newnote.append(method)
                    #record["method_of_estimation"]=method
                if len(oref)>0:
                    record["original_sources"]=oref                
                if len(newnote)>0:
                    record["original_notes"]=newnote                
                rslts.append(record)
    return(rslts)

In [30]:
target_col=target_cols["repr2"]

varname=species_data[target_col][1].value

for row_index in (157,162,233):
    pjp=species_data[target_col][row_index]
    if (pjp.hyperlink is not None):
        ref=extract_link(pjp)
    else:
        ref=None
    if (pjp.value is not None):
        spname=species_data[sp_col][row_index].value
        spcode=species_data[spcode_col][row_index].value
        rec=extract_value(pjp,switcher["repr2"],varname)
        for record in rec:
            record["species"]=spname
            record["species_code"]=spcode
            if 'original_sources' not in record and ref is not None:
                record['original_sources'] = ref[1]
            print("%s ::  %s" % (row_index,record))
           
    else:
        print("%s is empty " % (row_index))

157 ::  {'raw_value': ['Post-fire flowering', 'flowers well after fire'], 'main_source': 'NSWFFRDv2.1', 'original_notes': ['Cell color index 12'], 'species': 'Acianthus caudatus', 'species_code': '4351', 'original_sources': ['Bishop 1996']}
162 ::  {'raw_value': ['Post-fire flowering', 'flowering 1 year post-fire'], 'main_source': 'NSWFFRDv2.1', 'species': 'Aciphylla simplicifolia', 'species_code': '1091', 'original_sources': ['Knox Clarke 2004']}
233 ::  {'raw_value': ['Post-fire flowering', 'facultative'], 'main_source': 'NSWFFRDv2.1', 'norm_value': 'Facultative', 'original_notes': ['Cell color index 12'], 'species': 'Amperea xiphoclada var. xiphoclada', 'species_code': '9713', 'original_sources': ['Keith David pers. comm.']}


We can wrap this in one single function call:

In [31]:
def create_record(spreadsheet,target_col,row_index,switcher,**kwarg):
    target=spreadsheet[target_col][row_index]
    varname=spreadsheet[target_col][1].value
    if (target.value is not None):
        records=list()
        if (target.hyperlink is not None):
            ref=extract_link(target)
        else:
            ref=None
        if (target.value is not None):
            spname=species_data[sp_col][row_index].value
            spcode=species_data[spcode_col][row_index].value
            rec=extract_value(target,switcher,varname,**kwarg)
            for record in rec:
                record["species"]=spname
                record["species_code"]=spcode
                if 'original_sources' not in record and ref is not None:
                    record['original_sources'] = ref[1]
                records.append(record)
        return(records)


Now we will get one or many records per cell with a simple function call:

In [32]:
target_col=target_cols["rect2"]
for row_index in (36,122,167):
    rr = create_record(species_data,target_col,row_index,switcher["rect2"])
    print(rr)

[{'raw_value': ['Establishment', 'I (R35)', '->', 'I'], 'main_source': 'NSWFFRDv2.1', 'norm_value': 'Intolerant', 'original_sources': ['RP RFA NSW - Eden'], 'original_notes': ['original record split into multiple entries, prob. different sources'], 'species': 'Acacia constablei', 'species_code': '3747'}, {'raw_value': ['Establishment', 'even aged stands indicate post fire recruitment; though some recruitment in absence of fire (R15)', '->', 'even aged stands indicate post fire recruitment; though some recruitment in absence of fire', '->', 'even aged stands indicate post fire recruitment'], 'main_source': 'NSWFFRDv2.1', 'original_sources': ['RP Threatened Flora of Rocky Outcrops in South Eas'], 'original_notes': ['original record split into multiple entries, prob. different sources', 'original record split into multiple entries separated by and/or'], 'species': 'Acacia constablei', 'species_code': '3747'}, {'raw_value': ['Establishment', 'even aged stands indicate post fire recruitment

## Format records for input in database

Using the code above it is possible to take each species (row) from the spreadsheet and add records for the trait tables in the database. 

First we need to connect to the database from python.

### Connect to database from Python

We use the library _psygopg2_ to connect to the database. We first read the database credential from a file with restricted read access:

In [33]:
from configparser import ConfigParser
import psycopg2
from psycopg2.extensions import AsIs

filename = repodir / 'secrets' / 'database.ini'
section = 'aws-lght-sl'

parser = ConfigParser()
parser.read(filename)

dbparams = {}
if parser.has_section(section):
    params = parser.items(section)
    for param in params:
        dbparams[param[0]] = param[1]
else:
    raise Exception('Section {0} not found in the {1} file'.format(section, filename))

Typically we will connect to the database, run a query and then disconnect:

### Add list of references

We have already added all references in previous imports.

### Inserting records from NSWFFRDv2.1

We will create one record per species, using "NSWFFRDv2.1" as _main reference_, adding the reported references in the _original sources_ column.

We will use the functions declared above to read row values and hyperlinks to create one or multiple records from each entry.

In [34]:
x=create_record(species_data,target_cols["rect2"],4,switcher["rect2"])
x is None
#x

True

Now we will read through the spreadsheet and prepare records

In [35]:
row_min = 2
row_max = species_data.max_row

print('Connecting to the PostgreSQL database...')
conn = psycopg2.connect(**dbparams)
cur = conn.cursor()
affected_rows=0

target_cols={'germ1':'M', 'repr2':'X', 'rect2':'W', 'surv4':'L'}

ready=('rect2', 'germ1', 'repr2','surv4')

for trait in target_cols.keys():
    if trait in ready: 
        print ("skip trait %s" % trait)
        continue

    if trait in ('surv4','germ1'):
        mysplitstring="&|;|,| or | and "
    else:
        mysplitstring="DO NOT SPLIT SENTENCE"
    
    insert_statement = 'insert into litrev.%s (%%s) values %%s ON CONFLICT DO NOTHING' % trait
    records=list()
    for row in range(row_min,row_max):
        rr = create_record(species_data,target_cols[trait],row,switcher[trait],splitstring=mysplitstring)
        if rr is not None :
            records.extend(rr)
        if (((row-row_min) % 250) == 0 and len(records)>10) or (row==(row_max-1)):
            print("total of %s records prepared" % len(records)) 
            for record in records: 
                cur.execute(insert_statement, (AsIs(','.join(record.keys())), tuple(record.values())))
                affected_rows = affected_rows+cur.rowcount
            records.clear()
            conn.commit()
            print("total number of lines updated: %s" % affected_rows)

cur.close()
if conn is not None:
    conn.close()
    print('Database connection closed.')     


Connecting to the PostgreSQL database...
total of 193 records prepared
total number of lines updated: 193
total of 132 records prepared
total number of lines updated: 325
total of 104 records prepared
total number of lines updated: 429
total of 134 records prepared
total number of lines updated: 563
total of 165 records prepared
total number of lines updated: 728
total of 132 records prepared
total number of lines updated: 860
total of 132 records prepared
total number of lines updated: 992
total of 141 records prepared
total number of lines updated: 1133
total of 124 records prepared
total number of lines updated: 1257
total of 108 records prepared
total number of lines updated: 1365
total of 109 records prepared
total number of lines updated: 1474
total of 118 records prepared
total number of lines updated: 1592
total of 43 records prepared
total number of lines updated: 1635
skip trait repr2
skip trait rect2
skip trait surv4
Database connection closed.


This is somehow slow, but it works, and all the records are in the database.

In [36]:
record.values()

dict_values([['Seed storage', 'persistent soil'], 'NSWFFRDv2.1', 'Soil-persistent', 'Zornia dyctiocarpa var. dyctiocarpa', '8691'])