# Summary report exported as a workbook

Author: [José R. Ferrer-Paris](https://github.com/jrfep)

This notebook:
- Reads information from the database, and
- Creates a workbook with:
    - Authoring information and instruction
    - Summary table for species with links
    - Trait codes and descriptions
    - Vocabularies
    - List of references

The outputs of this notebook are available as a dataset record at:

> Ferrer-Paris, José R.; Keith, D A (2024). Fire Ecology Traits for Plants: Database exports. figshare. Dataset. https://doi.org/10.6084/m9.figshare.24125088.v1

## Setup

These sections include basic set up for the project

### Import modules

In [1]:
# work with paths in operating system
from pathlib import Path
import os
import sys
# datetime support
import datetime

# work with xlsx workbooks
import openpyxl
from openpyxl import Workbook
from openpyxl.worksheet.table import Table, TableStyleInfo
from openpyxl.styles import Alignment, PatternFill, Border, Font # Side, Alignment, Protection,
from openpyxl.formatting import Rule
from openpyxl.styles.differential import DifferentialStyle
from openpyxl.worksheet.datavalidation import DataValidation

from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl.utils import get_column_letter

# For database connection
from configparser import ConfigParser
import psycopg2
from psycopg2.extras import DictCursor

# Pandas for calculations
import pandas as pd
# Pyprojroot for easier handling of working directory
import pyprojroot

### Define paths for input and output

In [2]:
repodir = pyprojroot.find_root(pyprojroot.has_dir(".git"))
sys.path.append(str(repodir))
inputdir = repodir / "data" / "output-report"

### Load own functions
Load functions from `lib` folder, we will use a function to read db credentials, one for executing database queries and three functions for extracting data from the reference description string

In [3]:
from lib.parseparams import read_dbparams
from lib.firevegdb import dbquery
import lib.firevegxport as fvx

## Read information from database

### Database connection parameters
Database credentials are stored in a `database.ini` file.

In [4]:
dbparams = read_dbparams(repodir / 'secrets' / 'database.ini', section='aws-lght-sl')

### Database queries

The table with trait information can be requested with a simple query:

In [5]:
qrystr = "SELECT code,name,description,value_type,life_stage,life_history_process,priority FROM litrev.trait_info ORDER BY code"
trait_info = dbquery(qrystr, dbparams)

If we want to build a table with all the traits, we need to concatenate the results of several queries. For some of these, we will need a custom array function for [handling empty arrays in postgresql](https://stackoverflow.com/questions/43472482/postgres-array-agg-throws-cannot-accumulate-empty-arrays-for-empty-arrays). So we have to run this on the postgres server side:

Now we use a general query string to retrieve selected columns, and we will run a for loop to run the query with different categorical traits, and then use functions from the pandas library to merge the data into a pandas data frame. Here the trait codes are also the names of the tables in the `litrev.` schema:

In [6]:
qry= """
SELECT "currentScientificName" as spp, "currentScientificNameCode" as sppcode,
    array_agg(species) as nspp,
    array_agg(norm_value::text) as val,array_agg(weight) as w,
    array_agg(main_source) as refs,
    array_accum(original_sources) as orefs
    
FROM litrev.{} 
LEFT JOIN species.caps
ON species_code="speciesCode_Synonym"
WHERE "currentScientificName" is not NULL AND main_source is not NULL AND weight>0
GROUP BY spp,sppcode;
"""

# test using only ecualipts
# WHERE species ilike '%euca%' and "currentScientificName" is not NULL AND weight>0

for trait in ['surv1','surv4','repr2','rect2','disp1','germ1','germ8']:
    res = dbquery(qry.format(trait), dbparams)
    
    df1 = pd.DataFrame(res)
    col1="%s.txn" % trait
    col2="%s.v" % trait
    col3="%s.w" % trait
    col4="%s.mref" % trait
    col5="%s.oref" % trait
   
    df1=df1.rename(columns={0:"Species",1:"Code",2:col1,3:col2,4:col3,5:col4,6:col5})
    df1[trait]=df1.apply(lambda row : fvx.summarise_values(row[col2],row[col3]), axis = 1)
    if "df" in globals():
        df = pd.merge(df, df1, on = ["Species","Code"], how = "outer").sort_values(by="Species",ascending=[1])
    else:
        df = df1

Our dataframe `df` has multiple columns for each trait:

In [7]:
df

Unnamed: 0,Species,Code,surv1.txn,surv1.v,surv1.w,surv1.mref,surv1.oref,surv1,surv4.txn,surv4.v,...,germ1.w,germ1.mref,germ1.oref,germ1,germ8.txn,germ8.v,germ8.w,germ8.mref,germ8.oref,germ8
0,Abelmoschus moschatus subsp. moschatus,9878,[Abelmoschus moschatus],[All],[1],[austraits-3.0.2],[Clarke Lawes Murphy Russell-Smith Nano Bradst...,All,,,...,,,,,,,,,,
1,Abildgaardia ovata,8856,,,,,,,,,...,,,,,,,,,,
2,Abildgaardia vaginata,9186,,,,,,,,,...,,,,,,,,,,
3,Abrophyllum ornans,3220,,,,,,,,,...,,,,,,,,,,
4,Abrotanella nivigena,1246,[Abrotanella nivigena],[All],[1],[austraits-3.0.2],[White Sinclair Frood 2020],All,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6679,Zygophyllum iodocarpum,6357,[Roepera iodocarpa],[None],[1],[austraits-3.0.2],[White Sinclair Frood 2020],,,,...,,,,,,,,,,
6680,Zygophyllum ovatum,6358,[Roepera ovata],[None],[1],[austraits-3.0.2],[White Sinclair Frood 2020],,,,...,,,,,,,,,,
6681,Zygophyllum prismatothecum,6359,[Roepera prismatotheca],[None],[1],[austraits-3.0.2],[White Sinclair Frood 2020],,,,...,,,,,,,,,,
6682,Zygophyllum simile,10095,[Roepera similis],[None],[1],[austraits-3.0.2],[White Sinclair Frood 2020],,,,...,,,,,,,,,,


The query string is different for numerical traits, so we update it and run another loop, this time with the numerical trait codes:

In [8]:
qry= """
SELECT "currentScientificName" as spp, "currentScientificNameCode" as sppcode,
    array_agg(species) as nspp,
    array_agg(best) as best,array_agg(lower) as lower,array_agg(upper) as upper,array_agg(weight) as w,
    array_agg(main_source) as refs,
    array_accum(original_sources) as orefs
FROM litrev.{} 
LEFT JOIN species.caps
ON species_code="speciesCode_Synonym"
WHERE "currentScientificName" is not NULL AND main_source is not NULL AND weight>0
GROUP BY spp,sppcode;
"""

# test using only ecualipts
# WHERE species ilike '%euca%' and "currentScientificName" is not NULL AND weight>0

for trait in ['repr3','repr3a','repr4',]:
    res = dbquery(qry.format(trait), dbparams)
    if len(res)>0:
        df1 = pd.DataFrame(res)
        col1="%s.txn" % trait
        col2="%s.best" % trait
        col3="%s.lower" % trait
        col4="%s.upper" % trait
        col5="%s.w" % trait
        col6="%s.mref" % trait
        col7="%s.oref" % trait
        df1=df1.rename(columns={0:"Species",1:"Code",2:col1,3:col2,4:col3,5:col4,6:col5,7:col6,8:col7})
        df1[trait]=df1.apply(lambda row : fvx.summarise_triplet(row[col2],row[col3],row[col4],row[col5]), axis = 1)
        df = pd.merge(df, df1, on = ["Species","Code"], how = "outer").sort_values(by="Species",ascending=[1])
   


We can now apply these functions to summary data from multple columns into a single value:

In [9]:
df['orig_species']=df.apply(lambda row : fvx.unique_taxa(row,'txn'), axis = 1)

In [10]:
df[1494:1495]

Unnamed: 0,Species,Code,surv1.txn,surv1.v,surv1.w,surv1.mref,surv1.oref,surv1,surv4.txn,surv4.v,...,repr3a,repr4.txn,repr4.best,repr4.lower,repr4.upper,repr4.w,repr4.mref,repr4.oref,repr4,orig_species
1494,Chenopodium desertorum,2091,"[Chenopodium desertorum, Chenopodium desertoru...","[None, None, None, None, None, None]","[1, 1, 10, 1, 1, 1]","[austraits-3.0.2, austraits-3.0.2, NSWFFRDv2.1...","[Vesk Leishman Westoby 2004, White Sinclair Fr...",,,,...,,,,,,,,,,Chenopodium desertorum


In [11]:
df['main_refs']=df.apply(lambda row : fvx.unique_taxa(row,'mref'), axis = 1)

In [12]:
df['orig_refs']=df.apply(lambda row : fvx.unique_taxa(row,'oref'), axis = 1)
df[['orig_species','main_refs','orig_refs']]

Unnamed: 0,orig_species,main_refs,orig_refs
0,Abelmoschus moschatus,austraits-3.0.2,Clarke Lawes Murphy Russell-Smith Nano Bradsto...
1,Abildgaardia ovata,austraits-3.0.2,Metcalfe 2020
2,Abildgaardia vaginata,austraits-3.0.2,Metcalfe 2020
3,Abrophyllum ornans,austraits-3.0.2,Hughes Rice 1992; Cooper Cooper 2013
4,Abrotanella nivigena,austraits-3.0.2,White Sinclair Frood 2020
...,...,...,...
6681,Roepera iodocarpa,austraits-3.0.2,White Sinclair Frood 2020
6682,Roepera ovata,austraits-3.0.2,White Sinclair Frood 2020
6683,Roepera prismatotheca,austraits-3.0.2,White Sinclair Frood 2020
6684,Roepera similis,austraits-3.0.2,White Sinclair Frood 2020


And now we can extract a list of valid references from this data frame:

In [13]:
refs=df.apply(lambda row : fvx.extract_refs(row,'mref'), axis = 1)
valid_refs=list()
for x in refs:
    if type(x)==list:
        valid_refs=valid_refs+x
        
refs=df.apply(lambda row : fvx.extract_refs(row,'oref'), axis = 1)
for x in refs:
    if type(x)==list:
        valid_refs=valid_refs+x
   
valid_refs=tuple(set(valid_refs))

And now query the database to include only references in that list:

In [14]:
qrystr = "SELECT ref_code,ref_cite FROM litrev.ref_list WHERE ref_code IN %s ORDER BY ref_code" % (valid_refs,)
ref_info = dbquery(qrystr, dbparams)

## Create workbook

In [15]:
wb = Workbook()

### Styles
Define styles to be used across the workbook

In [16]:
cent_align=Alignment(horizontal='center', vertical='center', wrap_text=False)
wrap_align=Alignment(horizontal='left', vertical='top', wrap_text=True)

fontSmall = Font(size = "9")


sheet_colors = {"intro": "1072BA" , "summary": "5AFF5A", "default":"505050", "addentry": "20CA82"}

table_style={"Instructions":TableStyleInfo(name="TableStyleMedium9", showFirstColumn=True, showLastColumn=False, 
                                           showRowStripes=True, showColumnStripes=False),
             "Contributor": TableStyleInfo(name="TableStyleMedium18", showFirstColumn=True,
                       showLastColumn=False, showRowStripes=False, showColumnStripes=False),
             "Lists": TableStyleInfo(name="TableStyleMedium14", showFirstColumn=True,
                       showLastColumn=False, showRowStripes=False, showColumnStripes=False),
             "Info":  TableStyleInfo(name="TableStyleMedium14", showFirstColumn=True,
                       showLastColumn=False, showRowStripes=False, showColumnStripes=False),
             "Vocabularies": TableStyleInfo(name="TableStyleMedium14", showFirstColumn=True,
                       showLastColumn=False, showRowStripes=False, showColumnStripes=False),
             "Entry": TableStyleInfo(name="TableStyleMedium18", showFirstColumn=False,
                       showLastColumn=False, showRowStripes=False, showColumnStripes=False)

             }




### Create worksheets

In [17]:
wsheets = (
    {"title": "About", "colWidths":[("A",90),("B",40)], "tabColor":"intro","active":True},
    {"title": "Summary", "colWidths":[("A",70),("B",10),(("C","D","E","F","G","H","I","J","K"),30),(("L","M","N",),25)], "tabColor":"summary"},
    {"title": "References", "colWidths":[("A",25),("B",80)], "tabColor":"addentry"},
    {"title": "Trait description", "colWidths":[("A",12),("B",30),("C",70)], "tabColor":"default"}
    )
for item in wsheets:
    if "active" in item.keys():
        ws = wb.active
        ws.title = item['title']
    else:
        ws = wb.create_sheet(item['title'])
    for k in item['colWidths']:
        for j in k[0]:
            ws.column_dimensions[j].width = k[1]
    ws.sheet_properties.tabColor = sheet_colors[item["tabColor"]]


### `About` worksheet

In [18]:
ws = wb["About"]

info = ("Fire Ecology Traits for Plants",
        "Version 1.00 (April 2022)",
        "This data export reflects the status of the database on the %s" % datetime.date.today().strftime('%d %b %Y'),
        "Developed by  José R. Ferrer-Paris and David Keith",
        "Centre for Ecosystem Science / University of New South Wales",
        "Please cite this work as:",
        "Ferrer-Paris, J. R. and Keith, D. A. (2024) Fire Ecology Traits for Plants: Database export. figshare. DOI: 10.6084/m9.figshare.24125088", 
        #"DISCLAIMER:",
        #"DATA IS NOT READY FOR FINAL USE OR CRITICAL APPLICATIONS AND YOU SHOULD NOT DISTRIBUTE THIS DATA."
        )

k = 1
for row in info:
    ws.cell(k,1,value=row)
    ws.cell(k,1).alignment=wrap_align
    k=k+1
    
ws.cell(1,1).style='Title'
ws.cell(5,1).hyperlink='https://www.unsw.edu.au/research/ecosystem'
ws.cell(5,1).style='Hyperlink'

# Disclaimer
ws.cell(8,1).font=Font(color="FF0000", bold=True,italic=False) 
ws.cell(9,1).font=Font(color="FF0000", italic=True) 


supporters = ({'institution':"University of New South Wales",'url':"https://www.unsw.edu.au/"},
              {'institution':"NSW Bushfire Research Hub",'url':"https://www.bushfirehub.org/"},
              {'institution':"NESP Threatened Species Recovery Hub",'url':"https://www.nespthreatenedspecies.edu.au/"},
              {'institution':"NSW Department of Planning & Environment",'url':"https://www.planning.nsw.gov.au/"})

k=k+2
ws.cell(k-1,1,value="This work has been supported by:")
for item in supporters:
    cell=ws.cell(k,1)
    cell.value=item['institution']
    cell.hyperlink=item['url']
    cell.style = "Hyperlink"
    k=k+1

k=k+2
description = (
              "Taxonomic nomenclature following BioNET (data export from February 2022)",
              "Data in the report is summarised based on BioNET fields 'currentScientificName' and 'currentScientificNameCode'",
              "For general description of the traits, please refer to the 'Trait description' sheet",
              "Vocabularies for categorical traits are available in the 'Vocabularies' sheet",
              "For categorical traits the values in the 'Summary' sheet show the different values reported in the literature records separated by slashes.",
               "If more than one category has been reported, the values are ordered from higher to lower 'weight', categories receiving less than 10% weight are in round brackets, categories with less than 5% in square brackets",
              "The default weight is calculated by multiplying the number of times a value is reported (nr. of records) with the weight given to each record (default to 1), and divided by the weight of all records for a given species.",
              "Default weights  overridden by expert advice to the administrator will be marked, with justification given in the Notes column of the output.",
              "An asterisk (*) in a trait cell indicates a potential data entry error or uncertainty in the assignment of a trait category or value.",
              "'Import/Entry sources' refer to references that were imported directly using automated scripts or manual entry. These include: 1) Primary observations of traits from published research or reports; and 2) Compilations of data (e.g. databases, spreadsheets, published reviews) that include two or more sources of primary observations.",
              "'Indirect sources' refer to references that were cited in Import/Entry sources, where the latter are compilations of multiple primary sources (see Import/Entry sources). Information from indirect sources may have been modified when it was incorporated into those compilations. The original source of primary trait observations has not yet been verified prior to import into this database. When the primary source is reviewed and the trait values are verified, these records will be attributed to the primary source as 'Import/Entry sources'.",
              "Some sheets are protected to avoid accidental changes, but they are not password protected. If you need to filter and reorder entries in the table, please unprotect the sheet first.",
              )

for row in description:
    ws.cell(k,1,value=row)
    ws.cell(k,1).alignment=wrap_align
    k=k+1
    
ws.protection.sheet = True

### `Trait description` worksheet

In [19]:
ws = wb["Trait description"]

k=1
description = ("The following table gives a general description of the traits used in the 'Summary' sheet",
               "This sheet is protected to avoid accidental changes, but it is not password protected. If you need to filter and reorder entries in the table, please unprotect the sheet first.",
              "Vocabularies for categorical traits are available in the 'Vocabularies' sheet","","")

for row in description:
    ws.cell(k,3,value=row)
    ws.cell(k,3).alignment=wrap_align
    k=k+1
    

ws.append(["Trait Code", "Trait Name", "Description", "Type", "Life stage", "Life history process", "Data migration"])

for row in trait_info:
    ws.append(row)
    
#ws.max_row
for j in range(k,ws.max_row+1):
    ws.cell(j,3).alignment=wrap_align
    
tab = Table(displayName="TraitInformation", ref="A{}:G{}".format(k,ws.max_row))

tab.tableStyleInfo = table_style["Info"]
ws.add_table(tab)
ws.protection.sheet = True

### `Summary` worksheet

In [20]:

ws = wb["Summary"]
ws.append(['Species','Code','surv1','surv4','germ1','germ8','rect2','repr2','repr3','repr3a','disp1','Original Species name(s) used','Import/Entry sources','Indirect sources'])
rows = dataframe_to_rows(df[['Species','Code','surv1','surv4','germ1','germ8','rect2','repr2','repr3','repr3a','disp1','orig_species','main_refs','orig_refs']],index=False, header=False)


for r_idx, row in enumerate(rows, 2):
    for c_idx, value in enumerate(row, 1):
        ws.cell(row=r_idx, column=c_idx, value=value)
    
    for k in (12,13,14):
        ws.cell(r_idx,k).alignment=wrap_align
        ws.cell(r_idx,k).font = fontSmall

    
tab = Table(displayName="Summary", ref="A1:{}{}".format(get_column_letter(c_idx),r_idx))
tab.tableStyleInfo = table_style["Lists"]
ws.add_table(tab)


### `References` worksheet

In [21]:
ws = wb["References"]

k=1
description = ("The following table includes bibliographical information for the sources referenced in the 'Summary' sheet",
               "This sheet is protected to avoid accidental changes, but it is not password protected. If you need to filter and reorder entries in the table, please unprotect the sheet first.",
              "","")

for row in description:
    ws.cell(k,2,value=row)
    ws.cell(k,2).alignment=wrap_align
    k=k+1
    

ws.append(["Reference code", "Reference information"])

for row in ref_info:
    ws.append(row)
    
#ws.max_row
for j in range(k+1,ws.max_row+1):
    ws.cell(j,2).alignment=wrap_align
    ws.cell(j,2).font = fontSmall
    
tab = Table(displayName="ReferenceInformation", ref="A{}:B{}".format(k,ws.max_row))

tab.tableStyleInfo = table_style["Lists"]
ws.add_table(tab)
ws.protection.sheet = True

### Save workbook

In [22]:
wb.save(inputdir / "fireveg-trait-report-model.xlsx")