# Fireveg DB imports -- Compare content of database between versions

Author: [José R. Ferrer-Paris](https://github.com/jrfep)

Date: 20 August 2024

This Jupyter Notebook includes [Python](https://www.python.org) code to check the status of the Fireveg database.

We will compare between two versions with different connection parameters:
- **version 1.0** 
- **version 1.1**


**Please note:**
<div class="alert alert-warning">
    This repository contains code that is intended for internal project management and is documented for the sake of reproducibility.<br/>
    🛂 Only users contributing directly to the project have access to the credentials for data download/upload. 
</div>

## Set-up
### Load modules

In [1]:
# work with paths in operating system
from pathlib import Path
import os, sys
import pandas as pd
from IPython.display import display, Markdown
# Pyprojroot for easier handling of working directory
import pyprojroot

### Define paths for input and output

Define project directory using the `pyprojroot` functions, and add this to the execution path.

In [2]:
repodir = pyprojroot.find_root(pyprojroot.has_dir(".git"))
sys.path.append(str(repodir))

### Load own functions
Load functions from `lib` folder, we will use a function to read db credentials, one for executing database queries and three functions for extracting data from the reference description string

In [3]:
from lib.parseparams import read_dbparams
from lib.firevegdb import dbquery

### Database credentials

🤫 We use a folder named "secrets" to keep the credentials for connection to different services (database credentials, API keys, etc). This checked this folder in our `.gitignore` so that its content are not tracked by git and not exposed. Future users need to copy the contents of this folder manually.

We read database credentials stored in a `database.ini` file using our own `read_dbparams` function.

Compare versions v1.0 and v1.1.

In [4]:
db_v1_0 = read_dbparams(repodir / 'secrets' / 'database.ini', 
                         section='fireveg-db-v1.0')
db_v1_1 = read_dbparams(repodir / 'secrets' / 'database.ini', 
                         section='fireveg-db-v1.1')

## What is in the database

### Field work
Here we query all visits and samples:

In [5]:
qry = """
SELECT survey_name,visit_id,visit_date,CONCAT(givennames,' ', surname) as observer,
COUNT(distinct sample_nr) as samples,
COUNT(distinct record_id) as records
FROM form.field_visit
LEFT JOIN form.observerid
 ON mainobserver=userkey
LEFT JOIN form.field_samples
 USING(visit_id,visit_date)
LEFT JOIN form.quadrat_samples
 USING(visit_id,visit_date,sample_nr)
GROUP BY survey_name,visit_id,visit_date,observer
;
"""

colnames=['survey','site_label','visit_date','observer','samples','records']


In [6]:
res = dbquery(qry,db_v1_0)
site_visits_v1_0 = pd.DataFrame(res,columns=colnames)
res = dbquery(qry,db_v1_1)
site_visits_v1_1 = pd.DataFrame(res,columns=colnames)


In [10]:
site_visits_v1_1

Unnamed: 0,survey,site_label,visit_date,observer,samples,records
0,KNP AlpAsh,AlpAsh_18,2021-04-15,Jackie Miles,4,118
1,KNP AlpAsh,AlpAsh_19,2021-04-13,Jackie Miles,4,49
2,KNP AlpAsh,AlpAsh_25,2021-04-15,Jackie Miles,4,102
3,KNP AlpAsh,AlpAsh_26,2021-04-14,Jackie Miles,4,108
4,KNP AlpAsh,AlpAsh_40,2021-04-16,Jackie Miles,4,94
...,...,...,...,...,...,...
240,Rainforests NSW-Qld,TOOL_1_UNSW,2020-12-02,Robert Kooyman,6,165
241,SthnNSWRF,UppClyde1,2021-11-29,David Keith,0,0
242,UplandBasalt,WINGE115,2016-01-08,Alexandria Thomsen,0,0
243,UplandBasalt,WINGE115,2020-11-12,Alexandria Thomsen,3,45


We now print a summary of all visits with vegetation records:

In [11]:
ss10=site_visits_v1_0[site_visits_v1_0.records>0]
ss11=site_visits_v1_1[site_visits_v1_1.records>0]

In [18]:
msg="""
**Version {}**: There are {} surveys, {} sites, and {} visits by {} main observers.
There are {} samples and {} records.
"""

In [19]:
prg1 = msg.format(
    "v1.0",
    ss10.survey.unique().size,
    ss10.site_label.unique().size,
    ss10.shape[0],
    ss10.observer.unique().size,
    ss10.samples.sum(),
    ss10.records.sum()
)
prg2 = msg.format(
    "v1.1",
    ss11.survey.unique().size,
    ss11.site_label.unique().size,
    ss11.shape[0],
    ss11.observer.unique().size,
    ss11.samples.sum(),
    ss11.records.sum()
)
display(Markdown(prg1))
display(Markdown(prg2))


**Version v1.0**: There are 10 surveys, 153 sites, and 178 visits by 11 main observers.
There are 1270 samples and 19242 records.



**Version v1.1**: There are 10 surveys, 146 sites, and 170 visits by 10 main observers.
There are 1224 samples and 18397 records.


#### Mallee Woodlands


We compare the results for one of the surveys:


In [24]:
ss10=site_visits_v1_0[site_visits_v1_0['survey'] == 'Mallee Woodlands']
ss11=site_visits_v1_1[site_visits_v1_1['survey'] == 'Mallee Woodlands']

In [25]:
prg1 = msg.format(
    "v1.0",
    ss10.survey.unique().size,
    ss10.site_label.unique().size,
    ss10.shape[0],
    ss10.observer.unique().size,
    ss10.samples.sum(),
    ss10.records.sum()
)
prg2 = msg.format(
    "v1.1",
    ss11.survey.unique().size,
    ss11.site_label.unique().size,
    ss11.shape[0],
    ss11.observer.unique().size,
    ss11.samples.sum(),
    ss11.records.sum()
)
display(Markdown(prg1))
display(Markdown(prg2))


**Version v1.0**: There are 1 surveys, 66 sites, and 93 visits by 7 main observers.
There are 526 samples and 8588 records.



**Version v1.1**: There are 1 surveys, 66 sites, and 93 visits by 7 main observers.
There are 526 samples and 8588 records.


### Species in field samples

In [26]:
qry = """
SELECT DISTINCT species,species_code
FROM form.quadrat_samples
WHERE species_code is not NULL
;
"""
colnames=['species','species_code']

In [27]:
res = dbquery(qry,db_v1_0)
quadrats_v1_0 = pd.DataFrame(res,columns=colnames)
res = dbquery(qry,db_v1_1)
quadrats_v1_1 = pd.DataFrame(res,columns=colnames)

In [28]:
quadrats_v1_1

Unnamed: 0,species,species_code
0,Pyrrosia rupestris,8163
1,Leucanthemum vulgare,1560
2,Epacris breviflora,2591
3,Goodenia heterophylla,3190
4,Baumea rubiginosa,2302
...,...,...
962,Olearia pimeleoides var. pimeleoides,7258
963,Lophostemon confertus,4242
964,Stephania japonica,3690
965,Viola hederacea,6272


In [29]:
msg="""
**Version {}**: There are {} species with {} unique codes.
"""

In [30]:
prg1 = msg.format(
    'v1.0',
    quadrats_v1_0.species.unique().size,
    quadrats_v1_0.species_code.unique().size
)
prg2 = msg.format(
    'v1.1',
    quadrats_v1_1.species.unique().size,
    quadrats_v1_1.species_code.unique().size
)
display(Markdown(prg1))
display(Markdown(prg2))


**Version v1.0**: There are 1043 species with 1047 unique codes.



**Version v1.1**: There are 966 species with 966 unique codes.


### Traits from the literature

Read trait info into a data frame.

In [56]:
qry = "select code,name,description,value_type,life_stage,life_history_process,category_vocabulary,method_vocabulary from litrev.trait_info ;"
res = dbquery(qry,db_v1_1)
data = pd.DataFrame(res)
trait_info=data.rename(columns={0:"Trait code", 1:"Trait name", 2:"Description", 4:"Life stage", 5:"Life history process",  3:"Value type",6:"category_vocabulary",7:"method_vocabulary"})


In [57]:
trait_info.shape

(38, 8)

Count how many are categorical, etc.

In [58]:
trait_info['Value type'].value_counts()

Value type
categorical    14
numerical      10
TBA             6
numeric         4
TO DO           3
text            1
Name: count, dtype: int64

In [59]:
trait_info['Life stage'].value_counts()

Life stage
Standing plant    21
Seed              13
Seedling           4
Name: count, dtype: int64

In [60]:
trait_info['Life history process'].value_counts()

Life history process
Reproduction    11
Germination     10
Survival         7
Growth           5
Recruitment      4
Dispersal        1
Name: count, dtype: int64

Overview of sources of trait data

In [61]:
qry="""
    SELECT 'repr2' AS table_name, main_source, species, species_code FROM litrev.repr2
    UNION SELECT 'germ8' AS table_name, main_source, species, species_code FROM litrev.germ8
    UNION SELECT 'rect2' AS table_name, main_source, species, species_code FROM litrev.rect2
    UNION SELECT 'germ1' AS table_name, main_source, species, species_code FROM litrev.germ1
    UNION SELECT 'grow1' AS table_name, main_source, species, species_code FROM litrev.grow1
    UNION SELECT 'repr4' AS table_name, main_source, species, species_code FROM litrev.repr4
    UNION SELECT 'surv5' AS table_name, main_source, species, species_code FROM litrev.surv5
    UNION SELECT 'surv6' AS table_name, main_source, species, species_code FROM litrev.surv6
    UNION SELECT 'surv7' AS table_name, main_source, species, species_code FROM litrev.surv7
    UNION SELECT 'disp1' AS table_name, main_source, species, species_code FROM litrev.disp1
    UNION SELECT 'repr3' AS table_name, main_source, species, species_code FROM litrev.repr3a
    UNION SELECT 'repr3a' AS table_name, main_source, species, species_code FROM litrev.repr3
    UNION SELECT 'surv4' AS table_name, main_source, species, species_code FROM litrev.surv4
    UNION SELECT 'surv1' AS table_name, main_source, species, species_code FROM litrev.surv1
;
"""
colnames=['traits', 'source', 'species', 'species_code']


In [62]:
res = dbquery(qry,db_v1_0)
lit_traits_v1_0 = pd.DataFrame(res,columns=colnames)
res = dbquery(qry,db_v1_1)
lit_traits_v1_1 = pd.DataFrame(res,columns=colnames)

In [63]:
msg = """
**Version {}**: There are {} records from {} main sources. 
They include {} traits for {} species/taxa with {} unique species codes.
"""

In [64]:
prg1=msg.format(
    'v1.0',
    lit_traits_v1_0.shape[0],
    lit_traits_v1_0.source.unique().size,
    lit_traits_v1_0.traits.unique().size,
    lit_traits_v1_0.species.unique().size,
    lit_traits_v1_0.species_code.unique().size)
prg2=msg.format(
    'v1.1',
    lit_traits_v1_1.shape[0],
    lit_traits_v1_1.source.unique().size,
    lit_traits_v1_1.traits.unique().size,
    lit_traits_v1_1.species.unique().size,
    lit_traits_v1_1.species_code.unique().size)
display(Markdown(prg1))
display(Markdown(prg2))


**Version v1.0**: There are 30414 records from 5 main sources. 
They include 14 traits for 12591 species/taxa with 7253 unique species codes.



**Version v1.1**: There are 41836 records from 4 main sources. 
They include 14 traits for 18009 species/taxa with 7677 unique species codes.


In [65]:
lit_traits_v1_0.groupby(['source','traits']).agg('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,species,species_code
source,traits,Unnamed: 2_level_1,Unnamed: 3_level_1
Bell Vollmer Gellie 1993,surv1,1,1
NSWFFRDv2.1,germ1,1596,1596
NSWFFRDv2.1,grow1,32,32
NSWFFRDv2.1,rect2,1026,1026
NSWFFRDv2.1,repr2,138,138
NSWFFRDv2.1,repr3,655,655
NSWFFRDv2.1,repr3a,830,830
NSWFFRDv2.1,repr4,21,21
NSWFFRDv2.1,surv1,3044,3044
NSWFFRDv2.1,surv4,1257,1257


In [66]:
lit_traits_v1_1.groupby(['source','traits']).agg('count')

Unnamed: 0_level_0,Unnamed: 1_level_0,species,species_code
source,traits,Unnamed: 2_level_1,Unnamed: 3_level_1
Bell Vollmer Gellie 1993,surv1,1,1
NSWFFRDv2.1,germ1,1596,1596
NSWFFRDv2.1,grow1,32,32
NSWFFRDv2.1,rect2,1026,1026
NSWFFRDv2.1,repr2,138,138
NSWFFRDv2.1,repr3,655,655
NSWFFRDv2.1,repr3a,830,830
NSWFFRDv2.1,repr4,21,21
NSWFFRDv2.1,surv1,3044,3044
NSWFFRDv2.1,surv4,1257,1257


In [69]:
lit_traits_v1_1.groupby(['source']).agg({
    'species_code':[pd.Series.nunique,'count'],
})

Unnamed: 0_level_0,species_code,species_code
Unnamed: 0_level_1,nunique,count
source,Unnamed: 1_level_2,Unnamed: 2_level_2
Bell Vollmer Gellie 1993,1,1
NSWFFRDv2.1,3001,9889
austraits-6.0.0,7515,15935


Define queries we are going to use multiple times:

In [119]:
# Check comments on vocabularies
qry_vocabulary = "SELECT pg_catalog.obj_description(t.oid, 'pg_type')::json from pg_type t where typname = '%s';" 

# Number of records per source
qry_source = 'SELECT main_source,count(*) FROM litrev.%s GROUP BY main_source'
# Number of records per value of categorical variable
qry_values = ' select norm_value,count(*),count(distinct species),count(distinct species_code) from litrev.%s group by norm_value;'

# Number of records per value of numerical variable
qry_triplet = ' select best is NOT NULL as b, lower is NOT NULL as l, upper is NOT NULL as u,count(*),count(distinct species),count(distinct species_code) from litrev.%s group by b,l,u;'

# Raw values when norm value is NULL 
qry_nulls = ' select raw_value,count(*),count(distinct species),count(distinct species_code) from litrev.%s where norm_value is NULL group by raw_value;'
# Raw values when best/lower/upper are all NULL 
qry_triplet_nulls = 'select raw_value,count(*),count(distinct species),count(distinct species_code) from litrev.%s where best is NULL and lower is NULL and upper is NULL group by raw_value;'


In [35]:
qry = """
SELECT survey_name, species, species_code, 
    count(distinct record_id) as records, 
    count(distinct visit_id) as sites, 
    count(distinct visit_date) as visits, 
    count(distinct sample_nr) as samples       
FROM form.quadrat_samples
LEFT JOIN form.field_visit
    USING(visit_id, visit_date)
WHERE species_code IS NOT NULL
GROUP BY survey_name,species, species_code;
"""
colnames=['survey', 'species', 'species_code', 
          'records', 'sites', 'visits', 'samples']


In [47]:
res = dbquery(qry,db_v1_1)
field_stream_v1_1 = pd.DataFrame(res,columns=colnames)
res = dbquery(qry,db_v1_0)
field_stream_v1_0 = pd.DataFrame(res,columns=colnames)

In [52]:
field_stream_v1_0['survey type'] = field_stream_v1_0['survey'] == "Mallee Woodlands"
field_stream_v1_1['survey type'] = field_stream_v1_0['survey'] == "Mallee Woodlands"

In [53]:
field_stream_v1_1.groupby(['survey type',]).agg({
    'species':pd.Series.nunique,
    'species_code':pd.Series.nunique,
    'records':sum,
})

Unnamed: 0_level_0,species,species_code,records
survey type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,830,830,9589
True,148,148,8356


In [54]:
field_stream_v1_0.groupby(['survey type',]).agg({
    'species':pd.Series.nunique,
    'species_code':pd.Series.nunique,
    'records':sum,
})

Unnamed: 0_level_0,species,species_code,records
survey type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
False,901,905,10484
True,148,148,8408


# References
This represents the existing sources data stream

In [6]:
qrystr = "select count(*) from litrev.ref_list;"
dbquery(qrystr,db_v1_0)

[[309]]

In [7]:
dbquery(qrystr,db_v1_1)

[[347]]

In [77]:
qry="""
    SELECT 'repr2' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.repr2
    UNION SELECT 'germ8' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.germ8
    UNION SELECT 'rect2' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.rect2
    UNION SELECT 'germ1' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.germ1
    UNION SELECT 'grow1' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.grow1
    UNION SELECT 'repr4' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.repr4
    UNION SELECT 'surv5' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.surv5
    UNION SELECT 'surv6' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.surv6
    UNION SELECT 'surv7' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.surv7
    UNION SELECT 'disp1' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.disp1
    UNION SELECT 'repr3' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.repr3a
    UNION SELECT 'repr3a' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.repr3
    UNION SELECT 'surv4' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.surv4
    UNION SELECT 'surv1' AS table_name, record_id, main_source, unnest(original_sources), 
        species, species_code FROM litrev.surv1
;
"""
colnames=['traits', 'rid', 'source', 'orig_source', 'species', 'species_code']


In [78]:
res = dbquery(qry,db_v1_0)
lit_traits_v1_0 = pd.DataFrame(res,columns=colnames)
res = dbquery(qry,db_v1_1)
lit_traits_v1_1 = pd.DataFrame(res,columns=colnames)

In [79]:
len(lit_traits_v1_1['orig_source'].unique())

410

In [106]:
ss = lit_traits_v1_1['species_code'].isnull()

In [108]:
lit_traits_v1_1.loc[ss == False].groupby(['source']).agg({
    'rid': pd.Series.nunique,
    'species_code': pd.Series.nunique,
    'orig_source':pd.Series.nunique
})

Unnamed: 0_level_0,rid,species_code,orig_source
source,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
NSWFFRDv2.1,10417,2868,208
austraits-6.0.0,44832,7515,156


In [110]:
lit_traits_v1_1.loc[ss == False]

Unnamed: 0,traits,rid,source,orig_source,species,species_code
1,disp1,68162,austraits-6.0.0,Barlow Clifford George McCusker 1981,Diploglottis campbellii,5889
2,disp1,68163,austraits-6.0.0,Barlow Clifford George McCusker 1981,Diploglottis australis,7432
8,disp1,68169,austraits-6.0.0,Barlow Clifford George McCusker 1981,Elattostachys nervosa,5914
9,disp1,68170,austraits-6.0.0,Barlow Clifford George McCusker 1981,Elattostachys xylocarpa,5915
13,disp1,68174,austraits-6.0.0,{Australian National Botanic Gardens} 2018,Acacia acinacea,3699
...,...,...,...,...,...,...
91009,surv7,168,NSWFFRDv2.1,RP Thesium australe,Thesium australe,5871
91010,surv7,169,NSWFFRDv2.1,RP Nowra heath-myrtle Triplarina nowraensis,Triplarina nowraensis,9618
91011,surv7,170,NSWFFRDv2.1,Benson McDougall Ecology Sydney Plant Species ...,Tristaniopsis laurina,4297
91012,surv7,171,NSWFFRDv2.1,RP Velleia perfoliata,Velleia perfoliata,3218
