# World Checklist of Vascular Plants Dataset Exploration

Exploration of the WCVP dataset:

* DWCA
  - Distribution
  - Taxonomy

## Setup and Loading

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

### Load Distribution File

In [2]:
dfdist = pd.read_csv('wcvp_distribution.csv', sep='|')
print('dfdist columns:')
for column_name, dtype in zip(dfdist.columns, dfdist.dtypes):
    print(f'* {column_name} - {dtype}')
dfdist[:2]

dfdist columns:
* coreid - int64
* locality - object
* establishmentmeans - object
* locationid - object
* occurrencestatus - object
* threatstatus - object


Unnamed: 0,coreid,locality,establishmentmeans,locationid,occurrencestatus,threatstatus
0,1,Argentina Northeast,,TDWG:AGE,,
1,1,Argentina Northwest,,TDWG:AGW,,


### Load Taxonomy File

In [3]:
dftaxon = pd.read_csv('wcvp_taxon.csv', sep='|')
dftaxon = dftaxon.rename(
    columns={
        'scientfiicname': 'scientificname',
        'scientfiicauthorname': 'scientificauthorname',
    },
)

print('dftaxon columns:')
for column_name, dtype in zip(dftaxon.columns, dftaxon.dtypes):
    print(f'* {column_name} - {dtype}')
print()

dftaxon[:2]

dftaxon columns:
* taxonid - int64
* family - object
* genus - object
* specificepithet - object
* infraspecificepithet - object
* scientificname - object
* scientfiicnameauthorship - object
* taxonrank - object
* taxonomicstatus - object
* acceptednameusageid - float64
* parentnameusageid - float64
* originalnameusageid - float64
* namepublishedin - object
* nomenclaturalstatus - object
* taxonremarks - object
* scientificnameid - object
* dynamicproperties - object
* references - object



Unnamed: 0,taxonid,family,genus,specificepithet,infraspecificepithet,scientificname,scientfiicnameauthorship,taxonrank,taxonomicstatus,acceptednameusageid,parentnameusageid,originalnameusageid,namepublishedin,nomenclaturalstatus,taxonremarks,scientificnameid,dynamicproperties,references
0,3152012,Polypodiaceae,Elaphoglossum,discolor,,Elaphoglossum discolor,(Kuhn) C.Chr.,Species,Accepted,3152012.0,3189746.0,3179959.0,Index Filic.: 306 (1905),,S. Trop. America,ipni:89115-2,"{""powoid"":""89115-2"",""lifeform"":""epiphyte"",""climate"":""wet tropical"",""homotypicsynonym"":"""",""hybridformula"":"""",""reviewed"":""N""}",https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:89115-2
1,2741565,Crassulaceae,Crassula,capensis,,Crassula capensis,(L.) Baill.,Species,Accepted,2741565.0,2741460.0,2490224.0,Hist. Pl. 3: 312 (1871),,Cape Prov.,ipni:273023-1,"{""powoid"":""273023-1"",""lifeform"":""succulent tuberous geophyte"",""climate"":""subtropical"",""homotypicsynonym"":"""",""hybridformula"":"""",""reviewed"":""N""}",https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:273023-1


## Exploration of Taxonomy Dataset

**What are the possible values of `taxonrank`?**

In [4]:
dftaxon['taxonrank'].value_counts()

taxonrank
Species        1036226
Variety         227155
Subspecies       72642
Form             42815
Genus            41869
Subvariety        3280
proles            2335
lusus              637
Subform            624
nothosubsp.        519
microgène          364
Convariety         181
nothovar.          115
monstr.             89
grex                29
stirps              18
subproles           18
provar.             16
nothof.             15
psp.                 6
modif.               6
mut.                 5
sublusus             4
subspecioid          2
nid                  1
agamosp.             1
microf.              1
positio              1
micromorphe          1
group                1
ecas.                1
Name: count, dtype: int64

We are primarily interested in the taxonomic rank `Species` and any ranks which are more specific.
We assume that the plants listed with more specific taxonomic ranks are in the same species and can thus interbreed if the species is fertile.
The following values for `taxonrank` are of interest:

* `Species`
* `Subspecies`
* `Variety`
* `Subvariety`
* `Form`
* `Subform`

**What are the possible values of `taxonomicstatus`?**

In [5]:
dftaxon['taxonomicstatus'].value_counts()

taxonomicstatus
Synonym                   870674
Accepted                  428941
Illegitimate               47567
Unplaced                   36712
Invalid                    35650
Artificial Hybrid           4231
Provisionally Accepted      3380
Orthographic                2232
Misapplied                  1198
Local Biotype               1092
Name: count, dtype: int64

We are interested in records that are accurate to the current understanding in biology.
The following values for `taxonomicstatus` are of interest:

* `Accepted`
* `Synonym`
* `Artificial Hybrid`
* `Local Biotype`

**Which records do synonym records point to?**

In [6]:
df = dftaxon[
    dftaxon['taxonid'].isin(
        dftaxon[dftaxon['taxonomicstatus'] == 'Synonym']['acceptednameusageid']
    )
]

print(df['taxonomicstatus'].value_counts())
print()
print(df['taxonrank'].value_counts())

taxonomicstatus
Accepted                  224825
Provisionally Accepted      2788
Artificial Hybrid            601
Local Biotype                218
Name: count, dtype: int64

taxonrank
Species        187172
Subspecies      18963
Variety         15810
Genus            5792
Form              372
nothosubsp.       278
nothovar.          27
Subvariety          8
nothof.             4
microgène           4
Name: count, dtype: int64


We can see that most synonym records point to accepted or provisionally accepted records.

**What does a synonym record and the record it points to look like?**

In [7]:
dfsyn = dftaxon[dftaxon['taxonomicstatus'] == 'Synonym'][:1]

df = dftaxon[
    dftaxon['taxonid'].isin(dfsyn['acceptednameusageid'])
]

print(dfsyn)
print()
print(df)

      taxonid       family        genus specificepithet infraspecificepithet  \
2210  3179960  Pteridaceae  Acrostichum        conforme             discolor   

                          scientificname scientfiicnameauthorship taxonrank  \
2210  Acrostichum conforme var. discolor             (Kuhn) Baker   Variety   

     taxonomicstatus  acceptednameusageid  parentnameusageid  \
2210         Synonym            3152012.0                NaN   

      originalnameusageid  \
2210            3179959.0   

                                                        namepublishedin  \
2210  C.F.P.von Martius & auct. suc. (eds.), Fl. Bras. 1(2): 568 (1870)   

     nomenclaturalstatus taxonremarks scientificnameid  \
2210                 NaN          NaN      ipni:3696-2   

                                                                                            dynamicproperties  \
2210  {"powoid":"3696-2","lifeform":"","climate":"","homotypicsynonym":"T","hybridformula":"","reviewed":"N"}  

Here, we see the synonym "Acrostichum conforme var. discolor" from 1870 is referring to the scientific name "Elaphoglossum discolor" established in 1905. There was a change of genus in this case.

**What are a `Local Biotype` and a `Artificial Hybrid` record like?**

In [8]:
df = dftaxon[
    dftaxon['taxonid'].isin(
        dftaxon[dftaxon['taxonomicstatus'] == 'Synonym']['acceptednameusageid']
    )
]
dfbio = df[df['taxonomicstatus'] == 'Local Biotype'][:1]
dfahyb = df[df['taxonomicstatus'] == 'Artificial Hybrid'][:1]

print(dfbio)
print()
print(dfahyb)

       taxonid    family  genus specificepithet infraspecificepithet  \
14351  3013548  Rosaceae  Rubus   grandipetalus                  NaN   

            scientificname scientfiicnameauthorship taxonrank taxonomicstatus  \
14351  Rubus grandipetalus                    Gand.   Species   Local Biotype   

       acceptednameusageid  parentnameusageid  originalnameusageid  \
14351            3013548.0          2976315.0                  NaN   

                           namepublishedin nomenclaturalstatus taxonremarks  \
14351  Mém. Soc. Émul. Doubs 8: 220 (1884)                 NaN       France   

      scientificnameid  \
14351              NaN   

                                                                                               dynamicproperties  \
14351  {"powoid":"3013548-4","lifeform":"","climate":"","homotypicsynonym":"","hybridformula":"","reviewed":"N"}   

                                                                 references  
14351  https://powo.science.

In both these records, note that the `acceptednameusageid` is equal to the `taxonid`, suggesting it is actually a class of accepted names.

**How does filtering by `acceptednameusageid`/`taxonid` compare to filtering by `taxonomicstatus`?**

In [9]:
dfeq = dftaxon[dftaxon['taxonid'] == dftaxon['acceptednameusageid']]
dfacc = dftaxon[
    (dftaxon['taxonomicstatus'] == 'Accepted')
    | (dftaxon['taxonomicstatus'] == 'Artificial Hybrid')
    | (dftaxon['taxonomicstatus'] == 'Local Biotype')
]

print(dfeq['taxonomicstatus'].value_counts())
print()
print(dfacc['taxonomicstatus'].value_counts())
print()

dfmerge = pd.merge(dfeq, dfacc)
if len(dfmerge) != len(dfeq) or len(dfmerge) != len(dfacc):
    print('These sets contain different data!!!')
    raise Exception('These sets contain different data!!!')
else:
    print('These sets contain the same data.')

taxonomicstatus
Accepted             428941
Artificial Hybrid      4231
Local Biotype          1092
Name: count, dtype: int64

taxonomicstatus
Accepted             428941
Artificial Hybrid      4231
Local Biotype          1092
Name: count, dtype: int64

These sets contain the same data.


We can see that filtering on `taxonomicstatus` is equivalent to filtering on `acceptednameusageid` and `taxonid` to get accepted records.

**Do synonym records reliably point to records with the same taxonrank?**

In [10]:
df = pd.merge(
    dftaxon,
    dftaxon[dftaxon['taxonomicstatus'] == 'Synonym'][['acceptednameusageid', 'taxonrank']],
    left_on='taxonid',
    right_on='acceptednameusageid',
    how='inner',
)

# taxonrank_x is from dftaxon_y is from the synonym record, dftaxon_x is from the record it refers to.
print(df[['taxonrank_y', 'taxonrank_x']].value_counts()[:20])
print('...')

taxonrank_y  taxonrank_x
Species      Species        475153
Variety      Species        137118
Species      Subspecies      52657
             Variety         38887
Variety      Subspecies      32184
Subspecies   Species         27958
Form         Species         27553
Genus        Genus           20768
Variety      Variety         19809
Subspecies   Subspecies      11567
Form         Subspecies       6952
             Variety          4997
Subspecies   Variety          2852
Subvariety   Species          2203
proles       Species          1397
Species      Form              818
proles       Subspecies        715
Subvariety   Subspecies        662
Variety      Form              567
Species      nothosubsp.       488
Name: count, dtype: int64
...


We can see that the answer is "no", a synonym at one rank might actually refer to another rank.

**What do some `Provisionally Accepted` records look like?**

In [11]:
dftaxon[dftaxon['taxonomicstatus'] == 'Provisionally Accepted'][:4]

Unnamed: 0,taxonid,family,genus,specificepithet,infraspecificepithet,scientificname,scientfiicnameauthorship,taxonrank,taxonomicstatus,acceptednameusageid,parentnameusageid,originalnameusageid,namepublishedin,nomenclaturalstatus,taxonremarks,scientificnameid,dynamicproperties,references
14233,120970,Lamiaceae,Marsypianthes,glomerata,,Marsypianthes glomerata,C.Presl,Species,Provisionally Accepted,,120964.0,,"Abh. Königl. Böhm. Ges. Wiss., ser. 5, 3: 531 (1845)",,,ipni:449999-1,"{""powoid"":""449999-1"",""lifeform"":"""",""climate"":"""",""homotypicsynonym"":"""",""hybridformula"":"""",""reviewed"":""Y""}",https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:449999-1
14629,2708032,Scrophulariaceae,Celsia,doniana,,Celsia doniana,Walp.,Species,Provisionally Accepted,,2707981.0,,Repert. Bot. Syst. 3: 150 (1844),,,ipni:801263-1,"{""powoid"":""801263-1"",""lifeform"":"""",""climate"":"""",""homotypicsynonym"":"""",""hybridformula"":"""",""reviewed"":""N""}",https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:801263-1
38750,3310953,Scrophulariaceae,Nemesia,suttonii,,Nemesia × suttonii,(Anon.) Anon.,Species,Provisionally Accepted,,2382653.0,3310952.0,Sunset Mag. 69: 39 (1932),,,,"{""powoid"":""3310953-4"",""lifeform"":"""",""climate"":"""",""homotypicsynonym"":"""",""hybridformula"":"""",""reviewed"":""N""}",https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:3310953-4
62011,3138415,Lycopodiaceae,Pseudolycopodiella,paradoxa,,Pseudolycopodiella paradoxa,(Mart.) Holub,Species,Provisionally Accepted,,3141666.0,3140883.0,Folia Geobot. Phytotax. 18: 442 (1983),,Venezuela (Bolívar) to Paraguay,ipni:211465-2,"{""powoid"":""211465-2"",""lifeform"":""subshrub"",""climate"":""wet tropical"",""homotypicsynonym"":"""",""hybridformula"":"""",""reviewed"":""N""}",https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:211465-2


Notice the lack of an `acceptednameusageid` for these records.

**Do all accepted taxonomy records have a record in the distribution dataset?**

In [12]:
df1 = dftaxon[
    (dftaxon['taxonid'] == dftaxon['acceptednameusageid'])
]
df2 = dftaxon[
    (dftaxon['taxonid'] == dftaxon['acceptednameusageid'])
    & (dftaxon['taxonid'].isin(dfdist['coreid']))
]

percent = len(df2) / len(df1) * 100.0
print(f'There are {len(df1)} accepted names and {len(df2)} ({percent:.2f}%) are present in the distribution file.')

There are 434264 accepted names and 430017 (99.02%) are present in the distribution file.


No, not every accepted name is in the distribution file.

**Do any synonym taxonomy records have a record in the distribution dataset?**

In [13]:
df = dftaxon[
    (dftaxon['taxonomicstatus'] == 'Synonym')
]
dfd = dfdist[dfdist['coreid'].isin(df['taxonid'])]

print(len(dfd))

0


No.

**How can I create an effective lookup for `establishmentmeans` by `scientificname`?**

In [14]:
dft = dftaxon[
    (dftaxon['taxonid'] == dftaxon['acceptednameusageid'])
    | (dftaxon['taxonomicstatus'] == 'Synonym')
]
dfd = dfdist[
    (dfdist['locality'] == 'Washington')
]

df = pd.merge(dft, dfd[['coreid', 'locality', 'establishmentmeans']], left_on='acceptednameusageid', right_on='coreid', how='left')

# prune for the sake of slim output
df = df[['scientificname', 'taxonrank', 'taxonomicstatus', 'locality', 'establishmentmeans']]

print(
    df[
        (df['establishmentmeans'] != 'introduced')
          & (df['locality'] == 'Washington')
    ][:1]
)
print()
print(df[df['establishmentmeans'] == 'introduced'][:1])
print()
print(df[df['scientificname'] == 'Cercis canadensis'])
print()
print(df['scientificname'].value_counts()[:10])

       scientificname taxonrank taxonomicstatus    locality establishmentmeans
15  Phlox hendersonii   Species        Accepted  Washington                NaN

            scientificname taxonrank taxonomicstatus    locality  \
364  Lysimachia terrestris   Species        Accepted  Washington   

    establishmentmeans  
364         introduced  

           scientificname taxonrank taxonomicstatus locality  \
572856  Cercis canadensis   Species        Accepted      NaN   

       establishmentmeans  
572856                NaN  

scientificname
Laurus cassia                      6
Cacalia tomentosa                  6
Crepis hieracioides                5
Draba ciliaris                     5
Solanum acanthifolium              5
Laurus exaltata                    5
Medicago obscura f. sinistrorsa    5
Medicago obscura f. dextrosa       5
Eupatorium acuminatum              5
Lycopsis vesicaria                 5
Name: count, dtype: int64


By joining a filtered subset of the distribution dataset to the taxonomy dataset on the `acceptednameusageid` column, we can associate to each plant the `establishmentmeans` for that plant's record in the distribution file. Then, checking both the `locality` and `establishmentmeans` coulmns are required to determine whether the plant is introduced, native, or not found in the area.

* Native: `locality` is the locality of interest and `establishmentmeans` is not `introduced`.
* Introduced: `locality` is the locality of interest and `establishmentmeans` is `introduced`.
* Not found: `locality` is not the locality of interest (is NA or NaN).

This also requires filtering the taxonomy dataset down to accepted and synonym names. However, keep in mind records for a given scientific name are still not unique in the dataset.

**What do multiple accepted records for a given scientific name look like?**

In [15]:
dft = dftaxon[
    (dftaxon['taxonid'] == dftaxon['acceptednameusageid'])
    | (dftaxon['taxonomicstatus'] == 'Synonym')
]
dfd = dfdist[
    (dfdist['locality'] == 'Washington')
]

df = pd.merge(dft, dfd[['coreid', 'locality', 'establishmentmeans']], left_on='acceptednameusageid', right_on='coreid', how='left')
df = df[df['locality'] == 'Washington']

print(df[['scientificname', 'taxonomicstatus']].value_counts()[:4])
print()
print(df[df['taxonomicstatus'] != 'Synonym'][['scientificname', 'taxonomicstatus']].value_counts()[:4])
print()
print('------------- Multiple Records -------------')
dfp = df[
    (df['scientificname'] == 'Crataegus pyracantha')
]
print(dfp)
print()
print('------------- Corresponding Accepted Records -------------')
print(df[df['taxonid'].isin(dfp['acceptednameusageid'])])

scientificname                  taxonomicstatus
Briza minor var. virens         Synonym            3
Crataegus pyracantha            Synonym            3
Rosa canina f. genuina          Synonym            3
Avena sativa subsp. orientalis  Synonym            3
Name: count, dtype: int64

scientificname         taxonomicstatus
× Elyleymus aristatus  Accepted           1
Abies                  Accepted           1
Abies amabilis         Accepted           1
Abies grandis          Accepted           1
Name: count, dtype: int64

------------- Multiple Records -------------
         taxonid    family      genus specificepithet infraspecificepithet  \
336290   2947016  Rosaceae  Crataegus      pyracantha                  NaN   
1287578  2947018  Rosaceae  Crataegus      pyracantha                  NaN   
1287580  2947019  Rosaceae  Crataegus      pyracantha                  NaN   

               scientificname scientfiicnameauthorship taxonrank  \
336290   Crataegus pyracantha              (L

We see that there are no duplicate names in the set of accepted records, so in the case where a scientific name leads to multiple records, those records should be synonym records.

**What kind of accepted records have duplicate scientific names?**

In [16]:
df = dftaxon
df = df[df['taxonomicstatus'] != 'Illegitimate']
df = df[df['taxonomicstatus'] != 'Unplaced']
df = df[df['taxonomicstatus'] != 'Invalid']
df = df[df['taxonomicstatus'] != 'Misapplied']
df = df[df['taxonomicstatus'] != 'Synonym']
df[['scientificname', 'taxonomicstatus']].value_counts()[:21]

scientificname                                      taxonomicstatus       
Rubus gerezianus                                    Local Biotype             2
× Trichopsis                                        Artificial Hybrid         2
Genista triquetra                                   Provisionally Accepted    2
× Hueylihara                                        Artificial Hybrid         2
Rosa hecleliana                                     Orthographic              2
× Jesupara                                          Artificial Hybrid         2
× Scullyara                                         Artificial Hybrid         2
× Smithara                                          Artificial Hybrid         2
Rosa virginica                                      Orthographic              2
× Nakamotoara                                       Artificial Hybrid         2
Potentilla thomasii                                 Orthographic              2
Lonicera × heckrottii                        

**What do some accepted records with duplicate scientific names look like?**

In [17]:
df = dftaxon[dftaxon['scientificname'] == 'Potentilla thomasii']

print(df)

        taxonid    family       genus specificepithet infraspecificepithet  \
798517  2957822  Rosaceae  Potentilla        thomasii                  NaN   
950693  2957820  Rosaceae  Potentilla        thomasii                  NaN   
950694  2957821  Rosaceae  Potentilla        thomasii                  NaN   

             scientificname scientfiicnameauthorship taxonrank  \
798517  Potentilla thomasii                 C.A.Mey.   Species   
950693  Potentilla thomasii             Ten. ex Ser.   Species   
950694  Potentilla thomasii                     Ser.   Species   

       taxonomicstatus  acceptednameusageid  parentnameusageid  \
798517         Synonym            2955944.0                NaN   
950693    Orthographic            2954440.0                NaN   
950694    Orthographic            2954440.0                NaN   

        originalnameusageid                        namepublishedin  \
798517                  NaN     Verz. Pfl. Casp. Meer.: 168 (1831)   
950693           

**Do any synonym records with duplicate scientific name refer to multiple accepted records with differing establishment means?**

In [18]:
dft = dftaxon[
    (dftaxon['taxonid'] == dftaxon['acceptednameusageid'])
    | (dftaxon['taxonomicstatus'] == 'Synonym')
]
dfd = dfdist[
    (dfdist['locality'] == 'Washington')
]

df = pd.merge(dft, dfd[['coreid', 'locality', 'establishmentmeans']], left_on='acceptednameusageid', right_on='coreid', how='left')
df = df[df['locality'] == 'Washington']

dfsyn = df[df['taxonomicstatus'] == 'Synonym']

dfg = dfsyn.groupby('scientificname')['establishmentmeans'].agg(set).reset_index()
dfg['count'] = dfg['establishmentmeans'].apply(lambda x: len(x))

dfdem = dfg[dfg['count'] > 1]

print(len(dfdem))
print()
print(dfdem[:10])
print()
print('--------------- Synonym Records ---------------')
print(dfsyn[dfsyn['scientificname'] == 'Artemisia violacea'])
print()
print('--------------- Accepted Records ---------------')
print(df[df['taxonid'].isin(dfsyn[dfsyn['scientificname'] == 'Artemisia violacea']['acceptednameusageid'])])

24

                          scientificname establishmentmeans  count
6159                  Artemisia violacea  {nan, introduced}      2
6316                             Aspasia  {nan, introduced}      2
6997   Atriplex patula subsp. littoralis  {nan, introduced}      2
7182                             Aurelia  {nan, introduced}      2
11936  Cardamine flexuosa f. grandiflora  {nan, introduced}      2
15326                          Cicutaria  {nan, introduced}      2
16745                             Costia  {nan, introduced}      2
17455                   Crepis umbellata  {nan, introduced}      2
19822                           Dondisia  {nan, introduced}      2
21484                 Epilobium uralense  {nan, introduced}      2

--------------- Synonym Records ---------------
         taxonid      family      genus specificepithet infraspecificepithet  \
662644   3107808  Asteraceae  Artemisia        violacea                  NaN   
1005047  3108894  Asteraceae  Artemisia        vio

We can see that there are 24 synonyms which have conflicting establishmentmeans for our locality.

**How many different taxonomy records are there for Washington?**

In [19]:
dft = dftaxon[
    (dftaxon['taxonid'] == dftaxon['acceptednameusageid'])
    | (dftaxon['taxonomicstatus'] == 'Synonym')
]
dfd = dfdist[
    (dfdist['locality'] == 'Washington')
]

df = pd.merge(dft, dfd[['coreid', 'locality', 'establishmentmeans']], left_on='acceptednameusageid', right_on='coreid', how='left')

print(len(df[
    (df['locality'] == 'Washington')
    & (df['establishmentmeans'] != 'introduced')
]))

36181
