# Fireveg DB imports -- import field work forms

Author: [JosÃ© R. Ferrer-Paris](https://github.com/jrfep)

Date: February 2022, updated 20 August 2024

This Jupyter Notebook includes [Python](https://www.python.org) code to:
- Read data from spreadsheets with field-work data
- Create records for data import into the database
- Insert or update records in the database

This notebook deals with reading files for the Mallee Woodlands and running the three previous steps of data import (sites and visits, fire history and sample quadrats.  

The Excel workbook was prepared by Prof. David Keith, FAA on May 2023. For this dataset we have several sites (S2007/1, T2001/1, etc), each site has several subplots with different treatments (A, K, N, R, G, X1, X2, X3) and replicates for each site/subplot.

**Please note:**
<div class="alert alert-warning">
    This repository contains code that is intended for internal project management and is documented for the sake of reproducibility.<br/>
    ðŸ›‚ Only users contributing directly to the project have access to the credentials for data download/upload. 
</div>

## Set-up
Load libraries 

In [1]:
import openpyxl
from pathlib import Path
import os,sys
from datetime import datetime
from configparser import ConfigParser
import psycopg2
from psycopg2.extensions import AsIs
import pyprojroot
import re

### Define paths for input and output

In [2]:
repodir = pyprojroot.find_root(pyprojroot.has_dir(".git"))
sys.path.append(str(repodir))

Define path to workbooks

In [3]:
inputdir = repodir / "data" / "input-field-form"

### Load own functions

Load functions from `lib` folder, we will use a function to read db credentials and one for batch insert and updates:

In [10]:
from lib.parseparams import read_dbparams
from lib.firevegdb import batch_upsert,dbquery
from lib.firevegdb import validate_and_update_site_records

import lib.fireveg as fv

### Database credentials

ðŸ¤« We use a folder named "secrets" to keep the credentials for connection to different services (database credentials, API keys, etc). This checked this folder in our `.gitignore` so that its content are not tracked by git and not exposed. Future users need to copy the contents of this folder manually.

We read database credentials stored in a `database.ini` file using our own `read_dbparams` function.

In [5]:
dbparams = read_dbparams(repodir / 'secrets' / 'database.ini', 
                         section='fireveg-db-v1.1')

## List of workbooks/spreadsheets in directory

Each spreadsheet has a slightly different structure, so these scripts have to be adapted for each case.

We use functions from module `fireveg` to read the data and create records, and functions from module `firevegdb` to execute the SQL insert or update query.


In [6]:
os.listdir(inputdir)

['UNSW_VegFireResponse_DataEntry_Yatteyattah all +DK +Milton_revisedfields_Mar2022.xlsx',
 'PlantFireTraitData_2011-2018_Import_AdditionalSiteInfo.xlsx',
 'UNSW_VegFireResponse_KNP AlpAsh_firehistupdate.xlsx',
 'SthnNSWRF_data_bionet2.xlsx',
 'UNSWFireVegResponse_UplandBasalt_AlexThomsen+DK.xlsx',
 'PlantFireTraitData_2011-2018_Import.xlsx',
 '.ipynb_checkpoints',
 'UNSW_VegFireResponse_RMK_reformat_Sep2021a.xlsx',
 'UNSW_VegFireResponse_DataEntry_Yatteyattah all +DK +Milton.xlsx',
 'UNSW_VegFireResponse_KNP AlpAsh.xlsx',
 'UNSW_VegFireResponse_AlpineBogs_reformat_Sep2021.xlsx',
 'RobertsonRF_data_bionet2.xlsx',
 'Fire response quadrat survey Newnes Nov2020_DK_revised IDs+AllNovData.xlsm']

In [7]:
valid_files =  ['PlantFireTraitData_2011-2018_Import.xlsx',
             'PlantFireTraitData_2011-2018_Import_AdditionalSiteInfo.xlsx']


Here we create an index of worksheets and column headers for each file

In [8]:
wbindex=dict()
for workbook_name in valid_files:
    inputfile=inputdir / workbook_name
    # using data_only=True to get the calculated cell values
    wb = openpyxl.load_workbook(inputfile,data_only=True)
    wbindex[workbook_name]=dict()
    for ws in wb.worksheets:
        wbindex[workbook_name][ws._WorkbookChild__title]=list()
        for k in range(1,ws.max_column):
            wbindex[workbook_name][ws._WorkbookChild__title].append(ws.cell(row=1,column=k).value)
        

## Database queries

### Survey info
Check if survey information is alread present in database:

In [11]:
dbquery("SELECT * FROM form.surveys;",dbparams)

[['TO BE CLASSIFIED',
  'Placeholder for field visits not yet assigned to a survey',
  'JR Ferrer-Paris'],
 ['UplandBasalt', 'Upland Basalt', None],
 ['Rainforests NSW-Qld', 'Rainforests NE NSW & SE Qld', None],
 ['NEWNES', 'Newnes plateau swamps', None],
 ['KNP AlpAsh', 'Kosciuszko NP Alpine Ash', None],
 ['SthnNSWRF', 'Southern NSW Rainforests', None],
 ['Alpine Bogs', 'Alpine Bogs', None],
 ['Robertson RF', 'Robertson RF', None],
 ['Yatteyattah', 'Yatteyattah', None],
 ['Mallee Woodlands', 'Mallee Woodlands', None]]

### Get updated vocabularies from database

In [12]:
qry = "SELECT enumlabel FROM pg_enum e LEFT JOIN pg_type t ON e.enumtypid=t.oid where typname='resprout_organ_vocabulary';"
valid_organ_list = dbquery(qry, dbparams)
organ_vocab = [item for t in valid_organ_list for item in t]

qry = "SELECT enumlabel FROM pg_enum e LEFT JOIN pg_type t ON e.enumtypid=t.oid where typname='seedbank_vocabulary';"
valid_seedbank_list = dbquery(qry, dbparams)
seedbank_vocab = [item for t in valid_seedbank_list for item in t]

## Import data from each worksheet

In the following section, I proceed to iterate through worksheets in the workbook, using functions defined in the `fireveg` and `firevegdb` modules.

Here is the list of available worksheets:

In [13]:
filename=valid_files[0]

wbindex[filename].keys()

dict_keys(['SiteData', 'FireEvents', 'PlantCounts'])

### Import site visits records into database

- 56 sites/visits in the period 2011 to 2018
- But we need to fix the site label to exclude the replicate number

The original list was incomplete, so we need to read two workbooks. We can retrieve the list of column names that we will use in our column definitions for each function:

In [14]:
cols=wbindex[valid_files[0]]['SiteData']
for k in range(1,len(cols)):
    print("%s :: %s" % (k-1,cols[k-1]))

0 :: Site_subplot_census
1 :: Site
2 :: Subplot
3 :: Replicate
4 :: Observers (comma sep if >1)
5 :: Date of samping
6 :: Survey Date Replicate 1
7 :: Survey Date Replicate 2
8 :: Survey Date Replicate 3
9 :: Survey Date Replicate 4
10 :: Survey Date Replicate 5
11 :: Survey Date Replicate 6
12 :: Location text
13 :: Zone
14 :: Easting
15 :: Northing
16 :: GPS Precision (m)
17 :: Latitude
18 :: Longitude
19 :: Layout & GPS marker position
20 :: 2nd ref point Zone
21 :: 2nd ref point Easting
22 :: 2nd ref point Northing
23 :: 2nd ref point Position of GPS
24 :: 3rd ref point Zone
25 :: 3rd ref point Easting
26 :: 3rd ref point Northing
27 :: 3rd ref point Position of GPS
28 :: 4th ref point Zone
29 :: 4th ref point Easting
30 :: 4th ref point Northing
31 :: 4th ref point Position of GPS
32 :: Total sample area (sq.m)
33 :: Subquadrat area (sq.m)
34 :: # subquadrats
35 :: Substrate
36 :: Notes
37 :: Slope
38 :: Aspect
39 :: Elevation
40 :: Disturbance notes
41 :: Cwth TEC
42 :: NSW TEC
4

In [15]:
cols=wbindex[valid_files[1]]['Sheet1']
for k in range(1,len(cols)):
    print("%s :: %s" % (k-1,cols[k-1]))

0 :: Site_subplot_census
1 :: Site
2 :: Subplot
3 :: Replicate
4 :: Observers (comma sep if >1)
5 :: Date of samping
6 :: Survey Date Replicate 1
7 :: Survey Date Replicate 2
8 :: Survey Date Replicate 3
9 :: Survey Date Replicate 4
10 :: Survey Date Replicate 5
11 :: Survey Date Replicate 6
12 :: Location text
13 :: Zone
14 :: Easting
15 :: Northing
16 :: GPS Precision (m)
17 :: Latitude
18 :: Longitude
19 :: Layout & GPS marker position
20 :: 2nd ref point Zone
21 :: 2nd ref point Easting
22 :: 2nd ref point Northing
23 :: 2nd ref point Position of GPS
24 :: 3rd ref point Zone
25 :: 3rd ref point Easting
26 :: 3rd ref point Northing
27 :: 3rd ref point Position of GPS
28 :: 4th ref point Zone
29 :: 4th ref point Easting
30 :: 4th ref point Northing
31 :: 4th ref point Position of GPS
32 :: Total sample area (sq.m)
33 :: Subquadrat area (sq.m)
34 :: # subquadrats
35 :: Substrate
36 :: Notes
37 :: Slope
38 :: Aspect
39 :: Elevation
40 :: Disturbance notes
41 :: Cwth TEC
42 :: NSW TEC
4

Slightly different dictionary for both workbooks!

In [16]:
cdict = {'site_label':1,'location_description':12, 'utm_zone':13,'xs':(14,), 'ys':(15,), 
        'gps_geom_description':19, 
         'visit_date':(5,), 'replicate_nr':3,'observerlist':4, 'survey':"Mallee Woodlands"}

In [17]:
site_records = fv.import_records_from_workbook(filepath=inputdir,
                                               workbook='PlantFireTraitData_2011-2018_Import.xlsx',
                                                worksheet='SiteData',
                                                col_dictionary=cdict,
                                                create_record_function=fv.create_field_site_record)

In [18]:
cdict2 = {'site_label':0,'location_description':12, 'utm_zone':13,'xs':(14,), 'ys':(15,), 
        'gps_geom_description':19, 
         'visit_date':(5,), 'replicate_nr':3,'observerlist':4, 'survey':"Mallee Woodlands"}

In [19]:
more_site_records = fv.import_records_from_workbook(filepath=inputdir,
                                               workbook=valid_files[1],
                                                worksheet='Sheet1',
                                                col_dictionary=cdict2,
                                                create_record_function=fv.create_field_site_record)

In [20]:
site_records[1]

{'site_label': 'S2007/2',
 'location_description': 'Scotia Sanctuary, southwestern sector, West of Elliots Bore, edge of burnt area',
 'gps_geom_description': 'Centre point at intersection of A, K, R & N subplots with G subplot adjacent and X1-X3 separated and wrapped around G subplot',
 'geom': "ST_GeomFromText('POINT(505169 6318145)', 28354)"}

In [21]:
len(more_site_records)

42

In [22]:
batch_upsert(dbparams,"form.field_site",site_records,keycol=('site_label',), idx='field_site_pkey',execute=True)

Connecting to the PostgreSQL database...
56 rows updated
Database connection closed.


In [23]:
batch_upsert(dbparams,"form.field_site",more_site_records,keycol=('site_label',), idx='field_site_pkey',execute=True)

Connecting to the PostgreSQL database...
15 rows updated
Database connection closed.


insert location and visit records based on the sample id, but then, how do we transform the subploots into sample nrs?


In [24]:
visit_records = fv.import_records_from_workbook(filepath=inputdir,
                                          workbook='PlantFireTraitData_2011-2018_Import.xlsx',
                                            worksheet='SiteData',
                                            col_dictionary=cdict,
                                            create_record_function=fv.create_field_visit_record) 

In [25]:
visit_records[1]

{'visit_id': 'S2007/2',
 'visit_date': datetime.datetime(2013, 9, 24, 0, 0),
 'survey_name': 'Mallee Woodlands',
 'observerlist': ['David Keith'],
 'replicate_nr': 3}

In [26]:
obslist=list()
for record in visit_records:
    if 'observerlist' in record.keys():
        for observer in record['observerlist']:
            obslist.append(observer)
uniq_obs = set(obslist)

In [27]:
uniq_obs

{'Chris Simpson',
 'David Keith',
 'Freya Thomas',
 'Kate Giljohann',
 'Mark Tozer',
 'Recorders not recorded in Notebook?',
 'Renee Woodward'}

In [28]:
more_visit_records = fv.import_records_from_workbook(filepath=inputdir,
                                          workbook=valid_files[1],
                                            worksheet='Sheet1',
                                            col_dictionary=cdict2,
                                            create_record_function=fv.create_field_visit_record) 

In [29]:
more_visit_records[1]

{'visit_id': 'S2010/2',
 'visit_date': datetime.datetime(2011, 10, 7, 0, 0),
 'survey_name': 'Mallee Woodlands',
 'replicate_nr': 1}

In [31]:
batch_upsert(dbparams,"form.field_visit",visit_records,
             keycol=('visit_id','visit_date'), idx='field_visit_pkey',execute=True)

Connecting to the PostgreSQL database...
53 rows updated
Database connection closed.


In [32]:
batch_upsert(dbparams,"form.field_visit",more_visit_records,
             keycol=('visit_id','visit_date'), idx='field_visit_pkey',execute=True)

Connecting to the PostgreSQL database...
42 rows updated
Database connection closed.


### Import fire history records

This was provided by David in May 2023, check if it works...

In [33]:
worksheet = 'FireEvents'
#wbindex[filename][worksheet][0][0:13]
cols=wbindex[filename][worksheet]
for k in range(1,len(cols)):
    print("%s :: %s" % (k-1,cols[k-1]))

0 :: Site
1 :: Replicate
2 :: Date of last fire dd/mm/yyyy
3 :: Date of penultimate fire
4 :: Date of earlier fire
5 :: How date inferred1
6 :: How date inferred2
7 :: How date inferred3
8 :: Ignition cause1
9 :: Ignition cause2
10 :: Ignition cause3
11 :: Scorch hgt (m) min
12 :: Scorch hgt (m) mas
13 :: Scorch hgt (m) mode
14 :: % Tree foliage scorch
15 :: % Tree foliage c'sume
16 :: % Shb foliage scorch
17 :: % Shb foliage c'sume
18 :: % Herb layer foliage scorch
19 :: % Herb layer foliage c'sume
20 :: Twig diam (mm) 1
21 :: Twig diam (mm) 2
22 :: Twig diam (mm) 3
23 :: Twig diam (mm) 4
24 :: Twig diam (mm) 5
25 :: Twig diam (mm) 6
26 :: Twig diam (mm) 7
27 :: Twig diam (mm) 8
28 :: Twig diam (mm) 9
29 :: Twig diam (mm) 10
30 :: Peat depth burnt (cm)
31 :: Peat extent burnt %quad


In [34]:
col_dicts=[{'site_label':0,'fire_date':2,'how_inferred':5,'cause_of_ignition':8},
    {'site_label':0,'fire_date':3,'how_inferred':6,'cause_of_ignition':9},
    {'site_label':0,'fire_date':4,'how_inferred':7,'cause_of_ignition':10}]
fire_records = fv.import_records_from_workbook(inputdir, filename, worksheet, col_dicts, create_record_function=fv.create_fire_history_record)
len(fire_records)

833

In [35]:
fire_records[10]

{'site_label': 'S2007/1_X1_3',
 'fire_date': '2006-11-01',
 'earliest_date': datetime.date(2006, 11, 1),
 'latest_date': datetime.date(2006, 11, 1),
 'how_inferred': 'Land manager records & pers obs'}

Need to adjust the site label (remove the trailing replicate number, and include all the missing site labels (sites with fire history but no visit recorded yet).

In [36]:
all_sites = list()
for record in fire_records:
    record['site_label']=re.sub("_[AKRNGX123]+_[0-9]$", "", record['site_label'])
    all_sites.append(record['site_label'])

In [37]:
add_site_records=list()
all_sites = set(all_sites)

for site in all_sites:
    add_site_records.append({'site_label':site})

In [38]:
len(add_site_records)

53

In [39]:
add_site_records[1:10]

[{'site_label': 'T2017/2'},
 {'site_label': 'T2003/4'},
 {'site_label': 'T2011/4'},
 {'site_label': 'S2007/4'},
 {'site_label': 'S2007/5'},
 {'site_label': 'T2001/4'},
 {'site_label': 'T2005/2'},
 {'site_label': 'S2010/2'},
 {'site_label': 'T2005/4'}]

In [40]:
batch_upsert(dbparams,"form.field_site",add_site_records,keycol=(), 
idx=None,execute=True)

Connecting to the PostgreSQL database...
0 rows updated
Database connection closed.


Now we can do the batch upsert of all the fire history records

In [42]:
batch_upsert(dbparams,"form.fire_history",fire_records,
             keycol=('site_label','fire_date'), 
             idx='fire_history_pkey',execute=True)

Connecting to the PostgreSQL database...
639 rows updated
Database connection closed.


### Import quadrat sample data

Again, we need to adjust information that is specific to this sampling design. 

In [43]:
worksheet = 'PlantCounts'
cols=wbindex[filename][worksheet]
for k in range(1,len(cols)):
    print("%s :: %s" % (k-1,cols[k-1]))

0 :: site_subplot_cen
1 :: Species_name
2 :: Recovery organ
3 :: Seedbank
4 :: Count of unburnt adlt individuals
5 :: Count of unburnt juv individuals
6 :: Count of resprouting juv individuals.
7 :: Count of resprouting adult individuals.
8 ::  # resprouted & died post-fire
9 :: Count of fire-killed juv individuals
10 :: Count of fire-killed adult individuals
11 :: #  reproductive pre-fire plants
12 :: Count of live postfire recruits
13 :: #  reproductive post-fire recruits
14 :: # recruits died post-fire
15 :: # reproductive recruits died post-fire
16 :: # live interfire recruits (>3yr postfire emerg
17 :: # live reproductive interfire recruits (>3yr postfire emerg
18 :: # deadinterfire recruits (>3yr postfire emerg)


In [44]:
col_dict={'visit_id':0, 'species':1,   
          'resprout_organ':2, 'seedbank':3,
          'adults_unburnt':4,
          'resprouts_live':6,
          'resprouts_kill':8,
          'resprouts_reproductive':7,
          'recruits_live':12, 
          'recruits_died':14, 
          'recruits_reproductive':13,
          'split_visit_id': True,
          'notes':19,'workbook':filename,'worksheet':worksheet}

In [45]:
quadrats = fv.import_records_from_workbook(inputdir, filename, worksheet, col_dict,
                                       fv.create_field_sample_record)

In [46]:
len(quadrats)

9051

We will be using treatment codes to identify different samples.

In [47]:
quadrats[175]

{'visit_id': 'S2007/1', 'replicate_nr': 3, 'sample_nr': 'X2'}

In [48]:
samples = {"A":1,"K":2,"R":3,"N":4,"G":5,"X1":6,"X2":7,"X3":8,
           "AX":9, "KX":10, "RX":11, "NX":12,}
for record in quadrats:
    if record['sample_nr']:
        samplenr=record['sample_nr']
        record['sample_nr']=samples[samplenr]
        record['comment']=['Original sample code was %s' % samplenr]
    else:
        record['sample_nr']=99
        record['comment']=['Original sample code/nr was missing']


In [49]:
quadrats[175]

{'visit_id': 'S2007/1',
 'replicate_nr': 3,
 'sample_nr': 7,
 'comment': ['Original sample code was X2']}

Now check which ones are valid visit records (already present in the database)

Manual fix for some missing visits

In [50]:
records = [{'visit_id': 'T1998/CON1',
 'visit_date': '2013-04-14',
 'replicate_nr': 1,
 'visit_description': 'visit date unknown, please replace placeholder date',
 'survey_name': 'Mallee Woodlands'},
{'visit_id': 'S2012/2',
 'visit_date': '2014-01-10',
 'replicate_nr': 1,
 'visit_description': 'visit date unknown, please replace placeholder date',
 'survey_name': 'Mallee Woodlands'}]

In [51]:
batch_upsert(dbparams,"form.field_visit",records,
             keycol=('visit_id','visit_date'), 
             idx=None,execute=True)

Connecting to the PostgreSQL database...
1 rows updated
Database connection closed.


In [52]:
valid_visits = validate_and_update_site_records(quadrats,dbparams)

Connecting to the PostgreSQL database...
record for S2012/2 is incomplete
record for S2012/2 is incomplete
record for S2012/2 is incomplete
record for S2012/2 is incomplete
record for S2012/2 is incomplete
record for S2012/2 is incomplete
record for S2012/2 is incomplete
record for S2012/2 is incomplete
record for T1998/CON1 is incomplete
record for T1998/CON1 is incomplete
record for T1998/CON1 is incomplete
record for T1998/CON1 is incomplete
526 rows updated
Database connection closed.


In [53]:
len(valid_visits)
#len(quadrats)

91

In [54]:
valid_visits[5]


['K259', datetime.date(2018, 9, 27), 6]

In [55]:
records=fv.import_records_from_workbook(inputdir, filename, worksheet, col_dict,
                                         fv.create_quadrat_sample_record,
                                         lookup=valid_visits, 
                                        valid_seedbank=seedbank_vocab, 
                                        valid_organ=organ_vocab)

In [56]:
len(records)

9057

In [57]:
records[555]

{'visit_id': 'S2007/6',
 'sample_nr': 'X1',
 'species': 'Triodia scariosa',
 'visit_date': datetime.date(2013, 4, 11),
 'resprouts_live': 0,
 'resprouts_kill': 0,
 'resprouts_reproductive': 34,
 'comments': ['visit_id originally recorded as S2007/6_X1_3',
  'Imported from workbook PlantFireTraitData_2011-2018_Import.xlsx using python script',
  'Imported from spreadsheet PlantCounts',
  'matched by replicate nr 3, assuming date object',
  'resprout organ written as rhizome short',
  'seedbank written as soil persistent']}

In [58]:
samples = {"A":1,"K":2,"R":3,"N":4,"G":5,"X1":6,"X2":7,"X3":8,
           "AX":9, "KX":10, "RX":11, "NX":12,}
valid_records=list()
invalid_records=list()
for record in records:
    if 'visit_date' in record.keys():
        if record['sample_nr']:
            samplenr=record['sample_nr']
            record['sample_nr']=samples[samplenr]
            record['comments'].append('Original sample code was %s' % samplenr)
        else:
            record['sample_nr']=99
            record['comments'].append('Original sample code/nr was missing')
        valid_records.append(record)
    else:
        invalid_records.append(record)

print("%s valid records and %s invalid records" % (len(valid_records), len(invalid_records)))


8588 valid records and 469 invalid records


In [59]:
batch_upsert(dbparams,table='form.quadrat_samples',
             records=valid_records,
             keycol=('visit_id','visit_date','sample_nr'),
             idx=None, execute=True)

Connecting to the PostgreSQL database...
8588 rows updated
Database connection closed.


## That is it for now!

âœ… Job done! ðŸ˜ŽðŸ‘ŒðŸ”¥

You can:
- go [back home](../Instructions-and-workflow.ipynb),
- continue navigating the repo on [GitHub](https://github.com/ces-unsw-edu-au/fireveg-db-exports)
- continue exploring the repo on [OSF](https://osf.io/h96q2/).
- visit the database at <http://fireecologyplants.net>