# Read files summarising field work and update database
These Excel workbooks were imported on February 2022.

The scripts documented here have been created to:

- Read data from spreadsheets with field-work data
- Create records for data import into the database
- Insert or update records in the database

This jupyter notebook deals with the first step, which is importing field site and visit information. A second notebook deals with importing information from `quadrats`.

## Set-up
Load libraries 

In [1]:
import openpyxl
from pathlib import Path
import os
from datetime import datetime
from configparser import ConfigParser
import psycopg2
from psycopg2.extensions import AsIs
import pyprojroot

Load functions from `lib` folder, we will use a function to read db credentials and one for batch insert and updates:

In [2]:
from lib.parseparams import read_dbparams
from lib.firevegdb import batch_upsert
import lib.fireveg as fv

Define path to workbooks

In [3]:
repodir = pyprojroot.find_root(pyprojroot.has_dir(".git"))
inputdir = repodir / "data" / "input-field-form"

Database credentials are stored in a database.ini file

In [4]:
dbparams = read_dbparams(repodir / 'secrets' / 'database.ini', section='aws-lght-sl')

## Database query for observer ids
Check list of observer ids:

In [5]:
# connect to the PostgreSQL server
print('Connecting to the PostgreSQL database...')
conn = psycopg2.connect(**dbparams)
cur = conn.cursor()

cur.execute("SELECT userkey,givennames,surname FROM form.observerid;")
observerid = cur.fetchall()
cur.close()
        
if conn is not None:
    conn.close()
    print('Database connection closed.')

Connecting to the PostgreSQL database...
Database connection closed.


In [6]:
observerid

[(7, 'David', 'Keith'),
 (9, 'D.', 'Benson'),
 (10, 'L.', 'Watts,'),
 (11, 'T.', 'Manson'),
 (12, 'Jackie', 'Miles'),
 (13, 'Robert', 'Kooyman'),
 (8, 'Alexandria', 'Thomsen'),
 (14, 'Jedda', 'Lemmen')]

## Functions to read records from workbooks
Each spreadsheet has a slightly different structure, so these scripts have to be adapted for each case.

We use functions from module `fireveg` to read the data and create records, and functions from module `firevegdb` to execute the SQL insert or update query.

### List of workbooks/spreadsheets in directory

In [7]:
os.listdir(inputdir)

['PlantFireTraitData_2011-2018_Import.xlsx',
 'SthnNSWRF_data_bionet2.xlsx',
 'UNSW_VegFireResponse_DataEntry_Yatteyattah all +DK +Milton.xlsx',
 'UNSWFireVegResponse_UplandBasalt_AlexThomsen+DK.xlsx',
 'UNSW_VegFireResponse_KNP AlpAsh_firehistupdate.xlsx',
 'RobertsonRF_data_bionet2.xlsx',
 'UNSW_VegFireResponse_RMK_reformat_Sep2021a.xlsx',
 'UNSW_VegFireResponse_AlpineBogs_reformat_Sep2021.xlsx',
 'UNSW_VegFireResponse_KNP AlpAsh.xlsx',
 'Fire response quadrat survey Newnes Nov2020_DK_revised IDs+AllNovData.xlsm',
 'UNSW_VegFireResponse_DataEntry_Yatteyattah all +DK +Milton_revisedfields_Mar2022.xlsx']

In [8]:
valid_files = ['PlantFireTraitData_2011-2018_Import.xlsx',
               'SthnNSWRF_data_bionet2.xlsx',
               'UNSWFireVegResponse_UplandBasalt_AlexThomsen+DK.xlsx',
               'UNSW_VegFireResponse_RMK_reformat_Sep2021a.xlsx',
               'UNSW_VegFireResponse_DataEntry_Yatteyattah all +DK +Milton_revisedfields_Mar2022.xlsx',
               'UNSW_VegFireResponse_KNP AlpAsh_firehistupdate.xlsx',
               'UNSW_VegFireResponse_AlpineBogs_reformat_Sep2021.xlsx',
               'RobertsonRF_data_bionet2.xlsx',
               'Fire response quadrat survey Newnes Nov2020_DK_revised IDs+AllNovData.xlsm']

Here we create an index of worksheets and column headers for each file

In [9]:
wbindex=dict()
for workbook_name in valid_files:
    inputfile=inputdir / workbook_name
    # using data_only=True to get the calculated cell values
    wb = openpyxl.load_workbook(inputfile,data_only=True)
    wbindex[workbook_name]=dict()
    for ws in wb.worksheets:
        wbindex[workbook_name][ws._WorkbookChild__title]=list()
        for k in range(1,ws.max_column):
            wbindex[workbook_name][ws._WorkbookChild__title].append(ws.cell(row=1,column=k).value)
        

## Processing data from all workbooks

In the following section, I proceed to iterate through all the workbooks, using functions defined in the `fireveg` and `firevegdb` modules.

Here is the list of available workbooks (again):

In [10]:
wbindex.keys()

dict_keys(['PlantFireTraitData_2011-2018_Import.xlsx', 'SthnNSWRF_data_bionet2.xlsx', 'UNSWFireVegResponse_UplandBasalt_AlexThomsen+DK.xlsx', 'UNSW_VegFireResponse_RMK_reformat_Sep2021a.xlsx', 'UNSW_VegFireResponse_DataEntry_Yatteyattah all +DK +Milton_revisedfields_Mar2022.xlsx', 'UNSW_VegFireResponse_KNP AlpAsh_firehistupdate.xlsx', 'UNSW_VegFireResponse_AlpineBogs_reformat_Sep2021.xlsx', 'RobertsonRF_data_bionet2.xlsx', 'Fire response quadrat survey Newnes Nov2020_DK_revised IDs+AllNovData.xlsm'])

If we select one workbook, we can retrieve a list of column names that we will use in our column definitions for each function:

In [11]:
cols=wbindex['UNSW_VegFireResponse_DataEntry_Yatteyattah all +DK +Milton_revisedfields_Mar2022.xlsx']['Site']
for k in range(1,len(cols)):
    print("%s :: %s" % (k-1,cols[k-1]))
               

0 :: Site
1 :: Replicate
2 :: Observers (comma sep if >1)
3 :: Date of samping
4 :: Survey Date Replicate 1
5 :: Survey Date Replicate 2
6 :: Survey Date Replicate 3
7 :: Survey Date Replicate 4
8 :: Survey Date Replicate 5
9 :: Survey Date Replicate 6
10 :: Location text
11 :: Zone
12 :: Easting
13 :: Northing
14 :: GPS Precision (m)
15 :: Latitude
16 :: Longitude
17 :: Layout & GPS marker position
18 :: 2nd ref point Zone
19 :: 2nd ref point Easting
20 :: 2nd ref point Northing
21 :: 2nd ref point Position of GPS
22 :: 3rd ref point Zone
23 :: 3rd ref point Easting
24 :: 3rd ref point Northing
25 :: 3rd ref point Position of GPS
26 :: 4th ref point Zone
27 :: 4th ref point Easting
28 :: 4th ref point Northing
29 :: 4th ref point Position of GPS
30 :: Total sample area (sq.m)
31 :: Subquadrat area (sq.m)
32 :: # subquadrats
33 :: Substrate
34 :: Notes
35 :: Slope
36 :: Aspect
37 :: Elevation
38 :: Disturbance notes
39 :: Cwth TEC
40 :: NSW TEC
41 :: variant
42 :: Vegetation formation


### Import records to database
I create here one function that will call functions from the modules to process data from a workbook into records that are then imported into the database.

This functions was renamed to `import_site_and_visit_records`.

This function passes the keyword arguments `**kwargs` to the next functions. This works, because the structure of both the `create_record_function`s is similar and we can define the column correspondence in the same dictionary as we will see in the examples below:


In [12]:
def import_site_and_visit_records(**kwargs):
    records = fv.import_records_from_workbook(**kwargs,create_record_function=fv.create_field_site_record) 
    # function to create upsert queries with plain substitution to handle geom string
    batch_upsert(dbparams,"form.field_site",records,keycol=('site_label',), idx='field_site_pkey1',execute=True)
    
    records = fv.import_records_from_workbook(**kwargs,create_record_function=fv.create_field_visit_record) 
    # this should work also without problem
    batch_upsert(dbparams,"form.field_visit",records,keycol=('visit_id','visit_date'), idx='field_visit_pkey2',execute=True)

This helps us to determine which column numbers corresponds to the field that we want to extract from the spreadsheet. 
Check the number of rows updated in each case, and compare the changes in the database.

### Upland / Basalt

- 28 sites:
    - all with location description and coordinates, 
    - elevation data for all but three.
- 42 visits:
    - all visited by Alexandria Thomsen
    - most recent visit in 2021
    - older visits including values from the 90's ???

In [13]:
import_site_and_visit_records(filepath=inputdir,
            workbook='UNSWFireVegResponse_UplandBasalt_AlexThomsen+DK.xlsx',
            worksheet='Site',
            col_dictionary={'site_label':0, 'location_description':10,'utm_zone':11, 'xs':(12,), 'ys':(13,),
                 'gps_uncertainty_m':14,
                 'gps_geom_description':17,
                 'observerlist':3,'replicate_nr':1,
                 'elevation':38, 'visit_date':(2,4,5,6,7,8,9),
                 'survey':"UplandBasalt"})

Connecting to the PostgreSQL database...
28 rows updated
Database connection closed.
Connecting to the PostgreSQL database...
42 rows updated
Database connection closed.


### Rainforest in NE NSW / SE Qld

- 17 sites:
    - all with location description and coordinates, 
    - all but one with elevation data
- 17 visits: 
    - each site visited once between September 2020 and August 2021
    - main observer is Robert Kooyman, except one by T. Manson


In [14]:
import_site_and_visit_records(filepath=inputdir,
            workbook='UNSW_VegFireResponse_RMK_reformat_Sep2021a.xlsx',
            worksheet='Site',
            col_dictionary={'site_label':0,'location_description':10, 
                            'utm_zone':11,'xs':(12,), 'ys':(13,), 'elevation':37, 
                            'gps_uncertainty_m':14, 'gps_geom_description':17,
                            'visit_date':range(3,9), 'replicate_nr':1,'observerlist':2,'survey':"Rainforests NSW-Qld"})

Connecting to the PostgreSQL database...
17 rows updated
Database connection closed.
Connecting to the PostgreSQL database...
17 rows updated
Database connection closed.


### Mallee data (prepared May 2023)

- 56 sites/visits in the period 2011 to 2018
- This was prepared in May 2023, we keep this in a separate notebook.


### Southern NSW rain forest

- 5 sites, all with location description and coordinates, elevation data missing for four sites
- each site visited once, November/December 2021, main observer is David Keith

In [24]:
import_site_and_visit_records(filepath=inputdir,
            workbook='SthnNSWRF_data_bionet2.xlsx',
            worksheet='Site',
            col_dictionary={'site_label':0,'location_description':10, 'visit_date':range(3,9), 
                'lons':(16,), 'lats':(15,), 'elevation':37,
                 'gps_uncertainty_m':14,
                 'gps_geom_description':17,
                 'observerlist':2,'replicate_nr':1,
                 'survey':"SthnNSWRF"})

Connecting to the PostgreSQL database...
5 rows updated
Database connection closed.
Connecting to the PostgreSQL database...
5 rows updated
Database connection closed.


### KNP Alpine Ash
- Edited Site sheet to remove two sites from 'Robertson RF' and fix one entry in date column
- 8 sites  
    - all with coordinates and elevation
- All sites visited once
    - all sites visited in April 2021
    - Main observer is Jackie Miles

In [27]:
coldict={'site_label':0,'location_description':10, 'visit_date':range(3,9), 
               'utm_zone':11, 'xs':(12,), 'ys':(13,), 'elevation':37,
                 'gps_uncertainty_m':14,
                 'gps_geom_description':17,
                 'observerlist':2,'replicate_nr':1,
                 'survey':"KNP AlpAsh"}

import_site_and_visit_records(filepath=inputdir,
            workbook='UNSW_VegFireResponse_KNP AlpAsh_firehistupdate.xlsx',
            worksheet='Site',
            col_dictionary=coldict)



Connecting to the PostgreSQL database...
8 rows updated
Database connection closed.
Connecting to the PostgreSQL database...
8 rows updated
Database connection closed.


In [28]:
wb = openpyxl.load_workbook(inputdir / 'UNSW_VegFireResponse_KNP AlpAsh_firehistupdate.xlsx', data_only=True)
ws=wb[ 'Site']
for row in ws:
    print("%s : %s" % (row[0].value, row[3].value))

Site : Date of samping
AlpAsh_68 : 2021-04-12 00:00:00
AlpAsh_19 : 2021-04-13 00:00:00
AlpAsh_26 : 2021-04-14 00:00:00
AlpAsh_70 : 2021-04-15 00:00:00
AlpAsh_25 : 2021-04-15 00:00:00
AlpAsh_18 : 2021-04-15 00:00:00
AlpAsh_40 : 2021-04-16 00:00:00
AlpAsh_69 : 2021-04-16 00:00:00


### Alpine bogs

- Six sites, all with full information (description, elevation, coords)
- All sites visited once in 2021:
    - two sites by Jackie Miles in March
    - four sites by David Keith between October - December

In [29]:
import_site_and_visit_records(filepath=inputdir,
            workbook='UNSW_VegFireResponse_AlpineBogs_reformat_Sep2021.xlsx',
            worksheet='Site',
            col_dictionary={'site_label':0,'location_description':10, 'visit_date':range(3,9), 
               'utm_zone':11, 'xs':(12,), 'ys':(13,), 'elevation':37,
                 'gps_uncertainty_m':14,
                 'gps_geom_description':17,
                 'observerlist':2,'replicate_nr':1,
                 'survey':"Alpine Bogs"})


Connecting to the PostgreSQL database...
6 rows updated
Database connection closed.
Connecting to the PostgreSQL database...
6 rows updated
Database connection closed.


### Robertson RF 

- Two sites
    - both included in file for KNP AlpAsh? duplicated codes? same entries?
    - both with full information
- Three visits
    - both sites visited in January 2021
    - one site visited in August 2002 ?
    - three different main observers: David Keith, Robert Kooyman and T. Mason

In [30]:
import_site_and_visit_records(filepath=inputdir,
            workbook='RobertsonRF_data_bionet2.xlsx',
            worksheet='Site',
            col_dictionary={'site_label':0,'location_description':10, 'visit_date':range(3,9), 
               'utm_zone':11, 'xs':(12,), 'ys':(13,), 'elevation':37,
                 'gps_uncertainty_m':14,
                 'gps_geom_description':17,
                 'observerlist':2,'replicate_nr':1,
                 'survey':"Robertson RF"})

Connecting to the PostgreSQL database...
2 rows updated
Database connection closed.
Connecting to the PostgreSQL database...
3 rows updated
Database connection closed.


### Newness
This file has a different format, and includes many empty site records, had to tweak functions and troubleshoot a bit.
Much slower processing

- 20 sites
    - description missing for six sites
    - all with elevation and coordinates
- 54 visits (!)
    - each site visited two or three times 
    - visits between 2020 and 2021
    - observer information is missing or incomplete in most visits

In [31]:
filename='Fire response quadrat survey Newnes Nov2020_DK_revised IDs+AllNovData.xlsm'
col_definitions={'site_label':0, 'visit_date':(8,), 'fixed_utm_zone':56, 'xs':(1,), 'ys':(2,), 'elevation':4, 'survey':"NEWNES"}

import_site_and_visit_records(filepath=inputdir,
            workbook=filename,
            worksheet='Site',
            col_dictionary=col_definitions)


Connecting to the PostgreSQL database...
20 rows updated
Database connection closed.
Connecting to the PostgreSQL database...
54 rows updated
Database connection closed.


### Yatteyattah 
This workbook had no 'Site' worksheet, had to reformat the data from 'Sample' and add the worksheet to make it work, also changed the format in "date of sampling" column.

- 7 sites, all with full information 
- 7 visits
    - two sites visited in July 2020
    - five sites visited in February 2021
    - all visits by Jackie Miles

In [33]:
filename='UNSW_VegFireResponse_DataEntry_Yatteyattah all +DK +Milton_revisedfields_Mar2022.xlsx'
col_definitions={'site_label':0,'location_description':10, 'utm_zone':11,'elevation':37, 'visit_date':range(3,9), 
                'xs':(12,), 'ys':(13,),
                 'gps_uncertainty_m':14,
                 'gps_geom_description':17,
                 'observerlist':2,'replicate_nr':1,
                 'survey':"Yatteyattah"}
import_site_and_visit_records(filepath=inputdir,
            workbook=filename,
            worksheet='Site',
            col_dictionary=col_definitions)


Connecting to the PostgreSQL database...
7 rows updated
Database connection closed.
Connecting to the PostgreSQL database...
7 rows updated
Database connection closed.


### Sites without survey

Some sites / visits in the database with missing information:

Site | Date | samples | species
---|---|---|---
BER1 | 2020-02-06 | 1 | No data added
BER2 | 2020-02-25 | 1 | No data added
BS1 | 2020-06-03 | 20 | No data added
Duffy | 2020-01-14 | 1 | No data added
Ka1 | 2020-01-07 | 1 | No data added
Ka3 | 2020-01-09 | 1 | No data added
Ka3_b | 2020-01-08 | 1 | No data added
Ka4 | 2020-02-04 | 1 | No data added
Ka5 | 2020-02-05 | 2 | No data added
LC1 | 2020-01-15 | 1 | No data added
LC2 | 2020-02-14 | 1 | No data added
Madden1 | 2020-02-26 | 1 | No data added
R0Y005 | 2019-12-12 | 2 | No data added
ROY001 | 2019-10-25 | 2 | No data added
ROY002 | 2019-10-25 | 1 | No data added
ROY003 | 2019-12-05 | 1 | No data added
ROY004 | 2019-12-11 | 1 | No data added
SCCJB13 | 2020-12-07 | 4 | No data added
SCCJB37-Near | 2020-12-07 | 4 | No data added
UppClydeRF1 | 2021-11-29 | 4 | 27
UppClydeRF1 | 2021-12-01 | 1 | 1

### Fill main observer id

We run this after import in order to translate the list of observers into a integer value for the main observer

In [34]:

print('Connecting to the PostgreSQL database...')
conn = psycopg2.connect(**params)
cur = conn.cursor()

updated_rows=0
qrystr = "UPDATE form.field_visit set mainobserver=%s WHERE observerlist[1]='%s' AND mainobserver is NULL;"
for k in observerid:
    qry = qrystr % (k[0]," ".join(k[1:3]))
    print(qry)
    cur.execute(qry)
    if cur.rowcount > 0:
        updated_rows = updated_rows + cur.rowcount
        print("%s rows updated" % updated_rows)

cur.close()
conn.commit()
        
if conn is not None:
    conn.close()
    print('Database connection closed.')

Connecting to the PostgreSQL database...
UPDATE form.field_visit set mainobserver=7 WHERE observerlist[1]='David Keith' AND mainobserver is NULL;
UPDATE form.field_visit set mainobserver=9 WHERE observerlist[1]='D. Benson' AND mainobserver is NULL;
UPDATE form.field_visit set mainobserver=10 WHERE observerlist[1]='L. Watts,' AND mainobserver is NULL;
UPDATE form.field_visit set mainobserver=11 WHERE observerlist[1]='T. Manson' AND mainobserver is NULL;
UPDATE form.field_visit set mainobserver=12 WHERE observerlist[1]='Jackie Miles' AND mainobserver is NULL;
1 rows updated
UPDATE form.field_visit set mainobserver=13 WHERE observerlist[1]='Robert Kooyman' AND mainobserver is NULL;
UPDATE form.field_visit set mainobserver=8 WHERE observerlist[1]='Alexandria Thomsen' AND mainobserver is NULL;
UPDATE form.field_visit set mainobserver=14 WHERE observerlist[1]='Jedda Lemmen' AND mainobserver is NULL;
Database connection closed.


In [32]:
updated_rows

0