# Read files for the Mallee Woodlands

This Excel workbook was prepared by Prof. David Keith, FAA, and imported on May 2023.

We need to adapt functions defined in modules `fireveg` and `firevegdb`  to:

- Read data from spreadsheets with field-work data
- Create records for data import into the database
- Insert or update records in the database

For this dataset we have several sites (S2007/1, T2001/1, etc), each site has several subplots with different treatments (A, K, N, R, G, X1, X2, X3) and replicates for each site/subplot.

This jupyter notebook runs through each step of data import, starting with field site and visit information. Then... other steps

## Set-up
Load libraries 

In [30]:
import openpyxl
from pathlib import Path
import os
from datetime import datetime
from configparser import ConfigParser
import psycopg2
from psycopg2.extensions import AsIs
import pyprojroot
import re

Load functions from `lib` folder, we will use a function to read db credentials and one for batch insert and updates:

In [31]:
from lib.parseparams import read_dbparams
from lib.firevegdb import batch_upsert
from lib.firevegdb import validate_and_update_site_records

import lib.fireveg as fv

Define path to workbooks

In [32]:
repodir = pyprojroot.find_root(pyprojroot.has_dir(".git"))
inputdir = repodir / "data" / "input-field-form"

Database credentials are stored in a database.ini file

In [33]:
dbparams = read_dbparams(repodir / 'secrets' / 'database.ini', section='aws-lght-sl')

## List of workbooks/spreadsheets in directory

Each spreadsheet has a slightly different structure, so these scripts have to be adapted for each case.

We use functions from module `fireveg` to read the data and create records, and functions from module `firevegdb` to execute the SQL insert or update query.


In [34]:
os.listdir(inputdir)

['UNSW_VegFireResponse_DataEntry_Yatteyattah all +DK +Milton_revisedfields_Mar2022.xlsx',
 'UNSW_VegFireResponse_KNP AlpAsh_firehistupdate.xlsx',
 'SthnNSWRF_data_bionet2.xlsx',
 'UNSWFireVegResponse_UplandBasalt_AlexThomsen+DK.xlsx',
 'PlantFireTraitData_2011-2018_Import.xlsx',
 'UNSW_VegFireResponse_RMK_reformat_Sep2021a.xlsx',
 'UNSW_VegFireResponse_DataEntry_Yatteyattah all +DK +Milton.xlsx',
 'UNSW_VegFireResponse_KNP AlpAsh.xlsx',
 'UNSW_VegFireResponse_AlpineBogs_reformat_Sep2021.xlsx',
 'RobertsonRF_data_bionet2.xlsx',
 'Fire response quadrat survey Newnes Nov2020_DK_revised IDs+AllNovData.xlsm']

In [35]:
filename =  'PlantFireTraitData_2011-2018_Import.xlsx'
valid_files = [filename]


Here we create an index of worksheets and column headers for each file

In [36]:
wbindex=dict()
for workbook_name in valid_files:
    inputfile=inputdir / workbook_name
    # using data_only=True to get the calculated cell values
    wb = openpyxl.load_workbook(inputfile,data_only=True)
    wbindex[workbook_name]=dict()
    for ws in wb.worksheets:
        wbindex[workbook_name][ws._WorkbookChild__title]=list()
        for k in range(1,ws.max_column):
            wbindex[workbook_name][ws._WorkbookChild__title].append(ws.cell(row=1,column=k).value)
        

## Database queries

Database connection

In [37]:
# connect to the PostgreSQL server
print('Connecting to the PostgreSQL database...')
conn = psycopg2.connect(**dbparams)
cur = conn.cursor()

Connecting to the PostgreSQL database...


### create new survey

In [38]:
updated_rows = 0
qry = "INSERT INTO form.surveys(survey_name) values ('Mallee Woodlands') ON CONFLICT DO NOTHING;"
cur.execute(qry)
if cur.rowcount > 0:
    updated_rows = cur.rowcount
else:
    print(qry)
conn.commit() 
print("%s rows updated" % (updated_rows))



INSERT INTO form.surveys(survey_name) values ('Mallee Woodlands') ON CONFLICT DO NOTHING;
0 rows updated


In [39]:
cur.execute("SELECT * FROM form.surveys;")
surveys = cur.fetchall()

In [40]:
surveys

[('TO BE CLASSIFIED',
  'Placeholder for field visits not yet assigned to a survey',
  'JR Ferrer-Paris'),
 ('NEWNES', 'NEWNES', None),
 ('KNP AlpAsh', 'Alpine Ash', None),
 ('UplandBasalt', 'Upland Basalt', None),
 ('Alpine Bogs', None, None),
 ('Robertson RF', None, None),
 ('Yatteyattah', None, None),
 ('SthnNSWRF', None, None),
 ('Rainforests NSW-Qld', 'Rainforests NE NSW & SE Qld', None),
 ('Mallee Woodlands', None, None)]

### Valid vocabularies

In [41]:
cur.execute("SELECT enumlabel FROM pg_enum e LEFT JOIN pg_type t ON e.enumtypid=t.oid where typname='resprout_organ_vocabulary';")
valid_organ_list = cur.fetchall()
organ_vocab = [item for t in valid_organ_list for item in t]

cur.execute("SELECT enumlabel FROM pg_enum e LEFT JOIN pg_type t ON e.enumtypid=t.oid where typname='seedbank_vocabulary';")
valid_seedbank_list = cur.fetchall()
seedbank_vocab = [item for t in valid_seedbank_list for item in t]

### Close DB connection

In [42]:
cur.close()
if conn is not None:
    conn.close()
    print('Database connection closed.')

Database connection closed.


## Import data from each worksheet

In the following section, I proceed to iterate through worksheets in the the workbook, using functions defined in the `fireveg` and `firevegdb` modules.

Here is the list of available worksheets:

In [43]:
wbindex[filename].keys()

dict_keys(['SiteData', 'FireEvents', 'PlantCounts'])

If we select one workbook, we can retrieve a list of column names that we will use in our column definitions for each function:

In [44]:
cols=wbindex[filename]['SiteData']
for k in range(1,len(cols)):
    print("%s :: %s" % (k-1,cols[k-1]))
               

0 :: Site_subplot_census
1 :: Site
2 :: Subplot
3 :: Replicate
4 :: Observers (comma sep if >1)
5 :: Date of samping
6 :: Survey Date Replicate 1
7 :: Survey Date Replicate 2
8 :: Survey Date Replicate 3
9 :: Survey Date Replicate 4
10 :: Survey Date Replicate 5
11 :: Survey Date Replicate 6
12 :: Location text
13 :: Zone
14 :: Easting
15 :: Northing
16 :: GPS Precision (m)
17 :: Latitude
18 :: Longitude
19 :: Layout & GPS marker position
20 :: 2nd ref point Zone
21 :: 2nd ref point Easting
22 :: 2nd ref point Northing
23 :: 2nd ref point Position of GPS
24 :: 3rd ref point Zone
25 :: 3rd ref point Easting
26 :: 3rd ref point Northing
27 :: 3rd ref point Position of GPS
28 :: 4th ref point Zone
29 :: 4th ref point Easting
30 :: 4th ref point Northing
31 :: 4th ref point Position of GPS
32 :: Total sample area (sq.m)
33 :: Subquadrat area (sq.m)
34 :: # subquadrats
35 :: Substrate
36 :: Notes
37 :: Slope
38 :: Aspect
39 :: Elevation
40 :: Disturbance notes
41 :: Cwth TEC
42 :: NSW TEC
4

### Import site visits records into database

- 56 sites/visits in the period 2011 to 2018
- But we need to fix the site label to exclude the replicate number

In [48]:
cdict = {'site_label':1,'location_description':12, 'utm_zone':13,'xs':(14,), 'ys':(15,), 
        'gps_geom_description':19, 
         'visit_date':(5,), 'replicate_nr':3,'observerlist':4, 'survey':"Mallee Woodlands"}

site_records = fv.import_records_from_workbook(filepath=inputdir,
                                               workbook='PlantFireTraitData_2011-2018_Import.xlsx',
                                                worksheet='SiteData',
                                                col_dictionary=cdict,
                                                create_record_function=fv.create_field_site_record)

In [49]:
site_records[1]

{'site_label': 'S2007/2',
 'location_description': 'Scotia Sanctuary, southwestern sector, West of Elliots Bore, edge of burnt area',
 'gps_geom_description': 'Centre point at intersection of A, K, R & N subplots with G subplot adjacent and X1-X3 separated and wrapped around G subplot',
 'geom': "ST_GeomFromText('POINT(505169 6318145)', 28354)"}

In [50]:
batch_upsert(dbparams,"form.field_site",site_records,keycol=('site_label',), idx='field_site_pkey1',execute=True)

Connecting to the PostgreSQL database...
56 rows updated
Database connection closed.


insert location and visit records based on the sample id, but then, how do we transform the subploots into sample nrs?


In [51]:
visit_records = fv.import_records_from_workbook(filepath=inputdir,
                                          workbook='PlantFireTraitData_2011-2018_Import.xlsx',
                                            worksheet='SiteData',
                                            col_dictionary=cdict,
                                            create_record_function=fv.create_field_visit_record) 

In [52]:
visit_records[1]

{'visit_id': 'S2007/2',
 'visit_date': datetime.datetime(2013, 9, 24, 0, 0),
 'survey_name': 'Mallee Woodlands',
 'observerlist': ['David Keith'],
 'replicate_nr': 3}

In [53]:
batch_upsert(dbparams,"form.field_visit",visit_records,keycol=('visit_id','visit_date'), idx='field_visit_pkey2',execute=True)

Connecting to the PostgreSQL database...
53 rows updated
Database connection closed.


### Import fire history records

This was provided by David in May 2023, check if it works...

In [54]:
worksheet = 'FireEvents'
#wbindex[filename][worksheet][0][0:13]
cols=wbindex[filename][worksheet]
for k in range(1,len(cols)):
    print("%s :: %s" % (k-1,cols[k-1]))

0 :: Site
1 :: Replicate
2 :: Date of last fire dd/mm/yyyy
3 :: Date of penultimate fire
4 :: Date of earlier fire
5 :: How date inferred1
6 :: How date inferred2
7 :: How date inferred3
8 :: Ignition cause1
9 :: Ignition cause2
10 :: Ignition cause3
11 :: Scorch hgt (m) min
12 :: Scorch hgt (m) mas
13 :: Scorch hgt (m) mode
14 :: % Tree foliage scorch
15 :: % Tree foliage c'sume
16 :: % Shb foliage scorch
17 :: % Shb foliage c'sume
18 :: % Herb layer foliage scorch
19 :: % Herb layer foliage c'sume
20 :: Twig diam (mm) 1
21 :: Twig diam (mm) 2
22 :: Twig diam (mm) 3
23 :: Twig diam (mm) 4
24 :: Twig diam (mm) 5
25 :: Twig diam (mm) 6
26 :: Twig diam (mm) 7
27 :: Twig diam (mm) 8
28 :: Twig diam (mm) 9
29 :: Twig diam (mm) 10
30 :: Peat depth burnt (cm)
31 :: Peat extent burnt %quad


In [23]:
col_dicts=[{'site_label':0,'fire_date':2,'how_inferred':5,'cause_of_ignition':8},
    {'site_label':0,'fire_date':3,'how_inferred':6,'cause_of_ignition':9},
    {'site_label':0,'fire_date':4,'how_inferred':7,'cause_of_ignition':10}]
fire_records = fv.import_records_from_workbook(inputdir, filename, worksheet, col_dicts, create_record_function=fv.create_fire_history_record)
len(fire_records)

833

In [24]:
fire_records[10]

{'site_label': 'S2007/1_X1_3',
 'fire_date': '2006-11-01',
 'earliest_date': datetime.date(2006, 11, 1),
 'latest_date': datetime.date(2006, 11, 1),
 'how_inferred': 'Land manager records & pers obs'}

Need to adjust the site label (remove the trailing replicate number, and include all the missing site labels (sites with fire history but no visit recorded yet).

In [25]:
all_sites = list()
for record in fire_records:
    record['site_label']=re.sub("_[0-9]$","",record['site_label'])
    all_sites.append(record['site_label'])

In [26]:
add_site_records=list()
all_sites = set(all_sites)

for site in all_sites:
    add_site_records.append({'site_label':site})

In [27]:
len(add_site_records)

397

In [28]:
add_site_records[1:10]

[{'site_label': 'S2011/3_G'},
 {'site_label': 'T2017/4_NX'},
 {'site_label': 'S2011/3_X1'},
 {'site_label': 'T2000/4_R'},
 {'site_label': 'T2000/4_X1'},
 {'site_label': 'T2003/2_G'},
 {'site_label': 'T2000/1_G'},
 {'site_label': 'T2006/4_X1'},
 {'site_label': 'T2001/3_A'}]

In [29]:
batch_upsert(dbparams,"form.field_site",add_site_records,keycol=(), idx=None,execute=True)

Connecting to the PostgreSQL database...
0 rows updated
Database connection closed.


Now we can do the batch upsert of all the fire history records

In [30]:
batch_upsert(dbparams,"form.fire_history",fire_records,keycol=('site_label','fire_date'), idx='fire_history_pkey1',execute=True)

Connecting to the PostgreSQL database...
639 rows updated
Database connection closed.


In [15]:
for record in site_records:
    record['site_label']=re.sub("_[AKRNGX123]+_[0-9]$","",record['site_label'])

In [16]:
site_records[22]['site_label']

'T2000/3_K'

In [19]:
visit_records[0]

{'visit_id': 'S2007/1_A_3',
 'visit_date': datetime.datetime(2011, 10, 28, 0, 0),
 'survey_name': 'Mallee Woodlands',
 'observerlist': ['Mark Tozer'],
 'replicate_nr': 3}

In [20]:
for record in visit_records:
    record['visit_id']=re.sub("_[0-9]$","",record['visit_id'])

### Import plant count data

In [16]:
worksheet = 'PlantCounts'
cols=wbindex[filename][worksheet]
for k in range(1,len(cols)):
    print("%s :: %s" % (k-1,cols[k-1]))

0 :: site_subplot_cen
1 :: Species_name
2 :: Recovery organ
3 :: Seedbank
4 :: Count of unburnt adlt individuals
5 :: Count of unburnt juv individuals
6 :: Count of resprouting juv individuals.
7 :: Count of resprouting adult individuals.
8 ::  # resprouted & died post-fire
9 :: Count of fire-killed juv individuals
10 :: Count of fire-killed adult individuals
11 :: #  reproductive pre-fire plants
12 :: Count of live postfire recruits
13 :: #  reproductive post-fire recruits
14 :: # recruits died post-fire
15 :: # reproductive recruits died post-fire
16 :: # live interfire recruits (>3yr postfire emerg
17 :: # live reproductive interfire recruits (>3yr postfire emerg
18 :: # deadinterfire recruits (>3yr postfire emerg)


In [17]:
col_dict={'visit_id':0, 'species':1,   
          'resprout_organ':2, 'seedbank':3,
          'adults_unburnt':4,
          'resprouts_live':6,
          'resprouts_kill':8,
          'resprouts_reproductive':7,
          'recruits_live':12, 
          'recruits_died':14, 
          'recruits_reproductive':13,
          'notes':19,'workbook':filename,'worksheet':worksheet}

In [18]:
quadrats = fv.import_records_from_workbook(inputdir, filename, worksheet, col_dict,
                                       fv.create_field_sample_record)

In [19]:
len(quadrats)

9051

In [20]:
quadrats[555]

{'visit_id': 'S2007/6_X1_3', 'replicate_nr': None, 'sample_nr': None}

In [21]:
for record in quadrats:
    record['replicate_nr']=re.findall(r'\d+$', quadrats[555]['visit_id'])[0]
    record['sample_nr']=1
    record['visit_id']=re.sub("_[0-9]$","",record['visit_id'])


In [22]:
quadrats[555]

{'visit_id': 'S2007/6_X1', 'replicate_nr': '3', 'sample_nr': 1}

Now check which ones are valid visit records (already present in the database)

In [23]:
new_conn = psycopg2.connect(**dbparams)
valid_visits = validate_and_update_site_records(quadrats,useconn=new_conn)

K250_ not found
K251_ not found
K253_ not found
K254_ not found
K258_ not found
K259_ not found
K260_ not found
K261_ not found
K265_ not found
K266_ not found
K267_ not found
K268_ not found
S2007/1_X1 not found
S2007/1_X2 not found
S2007/1_X3 not found
record for S2007/2_A is incomplete
S2007/2_G not found
S2007/2_K not found
S2007/2_N not found
S2007/2_R not found
S2007/2_X1 not found
S2007/2_X2 not found
record for S2007/3_A is incomplete
S2007/3_G not found
S2007/3_K not found
S2007/3_N not found
S2007/3_R not found
S2007/3_X1 not found
S2007/3_X2 not found
S2007/3_X3 not found
S2007/4_X1 not found
S2007/4_X2 not found
S2007/4_X3 not found
record for S2007/5_A is incomplete
S2007/5_G not found
S2007/5_K not found
S2007/5_N not found
S2007/5_R not found
S2007/5_X1 not found
S2007/5_X2 not found
S2007/5_X3 not found
record for S2007/6_A is incomplete
S2007/6_G not found
S2007/6_K not found
S2007/6_N not found
S2007/6_R not found
S2007/6_X1 not found
S2007/6_X2 not found
S2007/6_X3 n

In [24]:
new_conn.close()

In [25]:
len(valid_visits)
#len(quadrats)

46

In [27]:
def create_quadrat_sample_record(item,sw,lookup,valid_seedbank,valid_organ):
    species = item[sw['species']].value
    visit_id =  item[sw['visit_id']].value
    if 'sample_nr' in sw.keys():
        sample_nr=item[sw['sample_nr']].value
    else:
        sample_nr=1
    if species is not None:
        record={'visit_id': visit_id, 'sample_nr': sample_nr,
                'species': species}
        comms=list()
        if 'workbook' in sw.keys():
            comms.append("Imported from workbook %s using python script" % sw['workbook'])
        if 'worksheet' in sw.keys():
            comms.append("Imported from spreadsheet %s" % sw['worksheet'])
    
        if 'date' in sw.keys():
            visit_date = item[sw['date']].value
        else:
            visit_date = None
            
        if 'replicate_nr' in sw.keys():
            replicate_nr = item[sw['replicate_nr']].value
        elif 'fixed_replicate_nr' in sw.keys():
            replicate_nr = sw['fixed_replicate_nr']
        
        if isinstance(visit_date,datetime):
            record['visit_date'] = visit_date.date()
        else:    
            p=filter(lambda n: n['visit_id'] == visit_id and  n['replicate_nr'] == replicate_nr, lookup)
            found=list(p)
            if len(found)==1 and 'visit_date' in found[0].keys():
                visit_date=found[0]['visit_date']
                if isinstance(visit_date,datetime):
                    record['visit_date'] = visit_date.date()
                    comms.append("visit date not provided, matched by replicate nr %s" % replicate_nr)
                else:
                    record['visit_date'] = visit_date
                    comms.append("matched by replicate nr %s, assuming date object" % replicate_nr)
            else:
                comms.append("neither visit date nor replicate nr was matched ( replicate nr %s ), no date" % replicate_nr)

        if 'spcode' in sw.keys():
            spcode = item[sw['spcode']].value
            if (isinstance(spcode, str) and spcode.isnumeric()) or isinstance(spcode,int):
                record['species_code']=spcode
         
        for k in ('species_notes', 'resprout_organ', 'seedbank', 'adults_unburnt', 'resprouts_live', 'resprouts_died', 'resprouts_kill', 'resprouts_reproductive',
                  'recruits_live', 'recruits_reproductive', 'recruits_died','notes'):
            if k in sw.keys():
                vals=item[sw[k]].value
                if vals is not None and vals not in ('na','NA'):
                    if k == 'resprout_organ':
                        if vals in valid_organ:
                            record[k]=vals
                        elif vals.capitalize() in valid_organ:
                            record[k]=vals.capitalize()
                        else:
                            comms.append("resprout organ written as %s" % vals)
                    elif k == 'seedbank':
                        if vals in valid_seedbank:
                            record[k]=vals
                        elif vals.capitalize() in valid_seedbank:
                            record[k]=vals.capitalize()
                        else:
                            comms.append("seedbank written as %s" % vals)
                    elif k == 'notes':
                        if isinstance(vals,(int, float, complex)):
                            comms.append("Comment column included a numeric value of %s" % vals)
                        else:
                            comms.append(vals)
                    elif k in ('adults_unburnt', 'resprouts_live', 'resprouts_died', 'resprouts_kill', 'resprouts_reproductive',
                  'recruits_live', 'recruits_reproductive', 'recruits_died'):
                        if isinstance(vals,int):
                            record[k]=vals   
                        else:
                            comms.append("%s written as %s" % (k,vals))
                    else:
                        record[k]=vals        
        
        if len(comms)>0:
            record["comments"]=comms
        
        return(record)


In [28]:
records=fv.import_records_from_workbook(inputdir, filename, worksheet, col_dict,
                                         create_quadrat_sample_record,
                                         lookup=valid_visits, 
                                        valid_seedbank=seedbank_vocab, 
                                        valid_organ=organ_vocab)

UnboundLocalError: cannot access local variable 'replicate_nr' where it is not associated with a value

In [None]:


    
    
        valid_records=list()
    invalid_records=list()
    for record in records:
        if 'replicate_nr' in record.keys():
            replicate_nr = record['replicate_nr']
        elif 'fixed_replicate_nr' in record.keys():
            replicate_nr = col_dictionary['fixed_replicate_nr']
        else:
            replicate_nr = None
        
        if 'visit_date' in record.keys():
            p=filter(lambda n: n['visit_id'] == record['visit_id'] and  n['visit_date'] == record['visit_date'], valid_visits)
            found=list(p)
        elif 'replicate_nr' in record.keys():
            p=filter(lambda n: n['visit_id'] == record['visit_id'] and  n['replicate_nr'] == replicate_nr, valid_visits)
            found=list(p)
        else:
            found=list()
        
        if (len(found)==1):
            valid_records.append(record)
        else:
            invalid_records.append(record)

    print("%s valid records and %s invalid records" % (len(valid_records), len(invalid_records)))
    
    batch_upsert(params,table='form.quadrat_samples',records=valid_records,keycol=('visit_id','visit_date','sample_nr'),
             idx=None, execute=True)