**Notebook to create the Plants of the World Online and Wikipedia datasets.**

This is done by combining descriptions per species and source to get one combined description for a species that might contain several POWO or WIKI categories. The descriptions are then merged with information on species traits from the Global Inventory of Floras and Traits (GIFT). 

Furthermore, two smaller datasets are created that contain only species descriptions for specific categories: Morphology General Habit and Morphology Leaf.

# Libraries & Functions

In [43]:
'''Math & Data Libraries'''
import numpy as np
import pandas as pd

In [44]:
''' Miscellaneous Libraries'''
from tqdm import tqdm

# Input Description Data

## Plants of the World Online - POWO Dataset

In [45]:
df_POWO = pd.read_excel("..//Data//Preprocessed Databases//POWO_preprocessed_descriptions.xlsx")

In [46]:
df_POWO

Unnamed: 0.1,Unnamed: 0,POWO_id,description,Growth Form,source,name,authors,i,ID,fqId,...,modified,language,creator,Problematic,prep_description_1,Language,trans_prep_description_1,QA_description,BERT_description,BOW_description
0,0,morphologyLeaf,Leaves anisophyllous; lamina 12 – 45 × 6 – 15&...,,"Kelbessa, E. 2009. Three new species of Acanth...",Acanthopale aethiogermanica,Ensermu,1,77098516-1,urn:lsid:ipni.org:names:77098516-1,...,,,,,Leaves anisophyllous; lamina 12 - 45 x 6 - 15;...,en,Leaves anisophyllous; lamina 12 - 45 x 6 - 15;...,leaves anisophyllous ; lamina 12 - 45 x 6 - 15...,leaves anisophyllous lamina cm broadly ellipti...,leaves anisophyllous lamina cm broadly ellipti...
1,2,morphologyReproductiveFruit,"Capsule 14 – 16 × 5.6 – 6.5&nbsp;mm, glabrous,...",,"Kelbessa, E. 2009. Three new species of Acanth...",Acanthopale aethiogermanica,Ensermu,1,77098516-1,urn:lsid:ipni.org:names:77098516-1,...,,,,,"Capsule 14 - 16 x 5.6 - 6.5;mm, glabrous, 4-se...",en,"Capsule 14 - 16 x 5.6 - 6.5;mm, glabrous, 4-se...","capsule 14 - 16 x 5.6 - 6.5 ; mm , glabrous , ...",capsule mm glabrous seeded,capsule mm glabrous seeded
2,3,morphologyReproductiveInflorescenceSpikelet,"Spikes axillary, with (1 –) 2 – 3 (– 4) flower...",,"Kelbessa, E. 2009. Three new species of Acanth...",Acanthopale aethiogermanica,Ensermu,1,77098516-1,urn:lsid:ipni.org:names:77098516-1,...,,,,,"Spikes axillary, with (1 -) 2 - 3 (- 4) flower...",en,"Spikes axillary, with (1 -) 2 - 3 (- 4) flower...","spikes axillary , with ( 1 - ) 2 - 3 ( - 4 ) f...",spikes axillary with flowers per node and the ...,spikes axillary flowers per node spike somewha...
3,4,morphologyReproductiveFlowerGynoeciumOvary,"Ovary 3.7 – 4.5&nbsp;mm long, partially enclos...",,"Kelbessa, E. 2009. Three new species of Acanth...",Acanthopale aethiogermanica,Ensermu,1,77098516-1,urn:lsid:ipni.org:names:77098516-1,...,,,,,"Ovary 3.7 - 4.5;mm long, partially enclosed wi...",en,"Ovary 3.7 - 4.5;mm long, partially enclosed wi...","ovary 3.7 - 4.5 ; mm long , partially enclosed...",ovary mm long partially enclosed within mm lon...,ovary mm long partially enclosed within mm lon...
4,5,morphologyReproductiveInflorescenceBract,"Bracts 2, linear to linear-oblanceolate, 8 – 1...",,"Kelbessa, E. 2009. Three new species of Acanth...",Acanthopale aethiogermanica,Ensermu,1,77098516-1,urn:lsid:ipni.org:names:77098516-1,...,,,,,"Bracts 2, linear to linear-oblanceolate, 8 - 1...",en,"Bracts 2, linear to linear-oblanceolate, 8 - 1...","bracts 2 , linear to linear-oblanceolate , 8 -...",bracts linear to linearoblanceolate mm sparsel...,bracts linear linearoblanceolate mm sparsely p...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
288249,412274,morphologyGeneralHabit,Woody-based Thesium-like herb with several ere...,,"Rubiaceae, B. Verdcourt. Flora Zambesiaca 5:1....",Manostachya staelioides,(K.Schum.) Bremek.,302653,755812-1,urn:lsid:ipni.org:names:755812-1,...,,,,,Woody-based Thesium-like herb with several ere...,en,Woody-based Thesium-like herb with several ere...,woody-based thesium-like herb with several ere...,woodybased thesiumlike herb with several erect...,woodybased thesiumlike herb several erect glab...
288250,412278,morphologyGeneralHabit,"Woody-stemmed herb, 4 ft. high",,"Papilionaceae, Hutchinson and Dalziel. Flora o...",Indigofera megacephala,J.B.Gillett,143886,499640-1,urn:lsid:ipni.org:names:499640-1,...,,,,,"Woody-stemmed herb, 4 ft. high",en,"Woody-stemmed herb, 4 ft. high","woody-stemmed herb , 4 ft. high",woodystemmed herb ft high,woodystemmed herb ft high
288251,412279,morphologyReproductiveFlower,"Flowers white, very fragrant.",,"Solanaceae, H. heine. Flora of West Tropical A...",Datura candida,Saff.,433345,76934-2,urn:lsid:ipni.org:names:76934-2,...,,,,,"Flowers white, very fragrant.",en,"Flowers white, very fragrant.","flowers white , very fragrant .",flowers white very fragrant,flowers white fragrant
288252,412280,morphologyGeneralHabit,Young branches and leaves rusty; leaves slight...,,"Apocynaceae, E.A. Omino. Flora of Tropical Eas...",Beaumontia grandiflora,Wall.,19449,77539-1,urn:lsid:ipni.org:names:77539-1,...,,,,,Young branches and leaves rusty; leaves slight...,en,Young branches and leaves rusty; leaves slight...,young branches and leaves rusty ; leaves sligh...,young branches and leaves rusty leaves slightl...,young branches leaves rusty leaves slightly ob...


## Wikipedia - WIKI Dataset

In [47]:
df_WIKI = pd.read_excel("..//Data//Preprocessed Databases//WIKI_preprocessed_descriptions.xlsx")

In [48]:
df_WIKI

Unnamed: 0.1,Unnamed: 0,WIKI_id,description,source,name,Date Retrieved,Binomial Name,prep_description_1,Language,trans_prep_description_1,QA_description,BERT_description,BOW_description
0,0,Summary,Aa achalensis is a species of orchid in the ge...,"Schltr., 1920",Aa achalensis,01/07/2022,Aa achalensis,Aa achalensis is a species of orchid in the ge...,en,Aa achalensis is a species of orchid in the ge...,aa achalensis is a species of orchid in the ge...,aa achalensis is species of orchid in the genu...,aa achalensis species orchid genus aa references
1,1,Summary,Aa argyrolepis is an orchid in the genus Aa. ...,"Rchb.f., 1854",Aa argyrolepis,01/07/2022,Aa argyrolepis,Aa argyrolepis is an orchid in the genus Aa. ...,en,Aa argyrolepis is an orchid in the genus Aa. ...,aa argyrolepis is an orchid in the genus aa . ...,aa argyrolepis is an orchid in the genus aa it...,aa argyrolepis orchid genus aa grows altitudes...
2,2,References,"\nReichenbach, H.G. (1854) Xenia Orchidacea 1:...","Rchb.f., 1854",Aa argyrolepis,01/07/2022,Aa argyrolepis,"\nReichenbach, H.G. (1854) Xenia Orchidacea 1:...",en,"\nReichenbach, H.G. (1854) Xenia Orchidacea 1:...","reichenbach , h.g . ( 1854 ) xenia orchidacea ...",reichenbach hg xenia orchidacea hammel be al m...,reichenbach hg xenia orchidacea hammel al manu...
3,3,Summary,Aa aurantiaca is a species of orchid in the ge...,D. Trujillo (2011)[1],Aa aurantiaca,01/07/2022,Aa aurantiaca,Aa aurantiaca is a species of orchid in the ge...,en,Aa aurantiaca is a species of orchid in the ge...,aa aurantiaca is a species of orchid in the ge...,aa aurantiaca is species of orchid in the genu...,aa aurantiaca species orchid genus aa native p...
4,4,Summary,Aa calceata is a species of orchid in the genu...,"Schltr., 1912",Aa calceata,01/07/2022,Aa calceata,Aa calceata is a species of orchid in the genu...,en,Aa calceata is a species of orchid in the genu...,aa calceata is a species of orchid in the genu...,aa calceata is species of orchid in the genus ...,aa calceata species orchid genus aait found bo...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
194989,194989,Distribution,"Native to West Tropical Africa, found in Niger...",(Pax) Mildbr.,Zygotritonia bongensis,09/07/2022,Zygotritonia bongensis,"Native to West Tropical Africa, found in Niger...",en,"Native to West Tropical Africa, found in Niger...","native to west tropical africa , found in nige...",native to west tropical africa found in nigeri...,native west tropical africa found nigeria ghan...
194990,194990,Summary,Zyzyxia is a genus of tropical shrubs in the f...,"(H.Robinson) Strother, 1991",Zyzyxia lundellii,09/07/2022,Zyzyxia lundellii,Zyzyxia is a genus of tropical shrubs in the f...,en,Zyzyxia is a genus of tropical shrubs in the f...,zyzyxia is a genus of tropical shrubs in the f...,zyzyxia is genus of tropical shrubs in the fam...,zyzyxia genus tropical shrubs family asteracea...
194991,194991,Description and distribution,Zyzyxia is a shrub that grows to 3 meters tall...,"(H.Robinson) Strother, 1991",Zyzyxia lundellii,09/07/2022,Zyzyxia lundellii,Zyzyxia is a shrub that grows to 3 meters tall...,en,Zyzyxia is a shrub that grows to 3 meters tall...,zyzyxia is a shrub that grows to 3 meters tall...,zyzyxia is shrub that grows to meters tall its...,zyzyxia shrub grows meters tall leaves covered...
194992,194992,Naming,"Around 1990, John L. Strother was revising the...","(H.Robinson) Strother, 1991",Zyzyxia lundellii,09/07/2022,Zyzyxia lundellii,"Around 1990, John L. Strother was revising the...",en,"Around 1990, John L. Strother was revising the...","around 1990 , john l. strother was revising th...",around john strother was revising the six nort...,around john strother revising six north americ...


## Statistics

In [49]:
print("Number of Unique Species: {}".format(df_POWO["name"].nunique()))
print("Number of Unique Sources: {}".format(df_POWO["authors"].nunique()))
print("Number of Unique Species x Sources: {}".format((df_POWO["name"]+df_POWO["authors"]).nunique()))
print("Number of Unique Description Types: {}".format(df_POWO["POWO_id"].nunique()))

Number of Unique Species: 59507
Number of Unique Sources: 15023
Number of Unique Species x Sources: 59151
Number of Unique Description Types: 251


In [50]:
df_POWO["POWO_id"].value_counts().describe()

count      251.000000
mean      1148.422311
std       5280.918318
min          1.000000
25%          7.000000
50%         39.000000
75%        263.000000
max      67977.000000
Name: count, dtype: float64

In [51]:
print("Number of Unique Species: {}".format(df_WIKI["name"].nunique()))
print("Number of Unique Sources: {}".format(df_WIKI["source"].nunique()))
print("Number of Unique Species x Sources: {}".format((df_WIKI["name"]+df_WIKI["source"]).nunique()))
print("Number of Unique Description Types: {}".format(df_WIKI["WIKI_id"].nunique()))

Number of Unique Species: 55654
Number of Unique Sources: 22035
Number of Unique Species x Sources: 55631
Number of Unique Description Types: 7903


In [52]:
df_WIKI["WIKI_id"].value_counts().describe()

count     7903.000000
mean        24.673415
std        784.345896
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max      58534.000000
Name: count, dtype: float64

# Combine Descriptions on Species x Source

In [53]:
def combine_descriptions(df, df_name):
    df_new = pd.DataFrame()
    
    src_variable = "authors" if df_name=="POWO" else "source" 

    # Adding variables for the categories and their count
    df_tmp = df.groupby(["name", src_variable])[f'{df_name}_id'].apply(';'.join).reset_index()
    df_new = df_new.assign(name = df_tmp["name"])
    df_new = df_new.assign(authors = df_tmp[src_variable])
    df_new[f'{df_name}_ids'] = df_tmp[f'{df_name}_id']
    df_new[f'{df_name}_id_N'] = [len(x.split(";")) for x in df_tmp[f'{df_name}_id']]

    # Adding variables for the Language
    df_tmp = df.groupby(["name", src_variable])['Language'].apply(', '.join).reset_index()
    df_new = df_new.assign(Language=[", ".join(set(x.split(", "))) for x in df_tmp["Language"]])
    
    # Combining the original description
    description = df["description"].apply(str)
    df = df.assign(description_str=description)
    df_tmp = df.groupby(["name", src_variable])['description_str'].apply(' ; '.join).reset_index()
    df_new = df_new.assign(description=df_tmp["description_str"])

    # Combining the QA description
    QA_description = df["QA_description"].apply(str)
    df = df.assign(QA_description_str=QA_description)
    df_tmp = df.groupby(["name", src_variable])['QA_description_str'].apply(' ; '.join).reset_index()
    df_new = df_new.assign(QA_description=df_tmp["QA_description_str"])
    
    # Combining the BERT description
    BERT_description = df["BERT_description"].apply(str)
    df = df.assign(BERT_description_str=BERT_description)
    df_tmp = df.groupby(["name", src_variable])['BERT_description_str'].apply(' '.join).reset_index()
    df_new = df_new.assign(BERT_description=df_tmp["BERT_description_str"])
    
    # Combining the BOW description
    BOW_description = df["BOW_description"].apply(str)
    df = df.assign(BOW_description_str=BOW_description)
    df_tmp = df.groupby(["name", src_variable])['BOW_description_str'].apply(' '.join).reset_index()
    df_new = df_new.assign(BOW_description=df_tmp["BOW_description_str"])

    return df_new

In [13]:
df_POWO_combined = combine_descriptions(df_POWO, "POWO")

In [14]:
df_POWO_combined

Unnamed: 0,name,authors,POWO_ids,POWO_id_N,Language,description,QA_description,BERT_description,BOW_description
0,Aa argyrolepis,Rchb.f.,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb
1,Aa colombiana,Schltr.,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb
2,Aa denticulata,Schltr.,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb
3,Aa leucantha,(Rchb.f.) Schltr.,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb
4,Aa maderoi,Schltr.,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb
...,...,...,...,...,...,...,...,...,...
59146,× Agropogon lutosus,(Poir.) P.Fourn.,morphologyReproductiveInflorescenceBractGlume;...,7,en,Glumes persistent; similar; exceeding apex of ...,glumes persistent ; similar ; exceeding apex o...,glumes persistent similar exceeding apex of fl...,glumes persistent similar exceeding apex flore...
59147,× Calicharis butcheri,(Traub) Meerow,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb
59148,× Chrismatopteris holttumii,Quansah & D.S.Edwards,note;morphologyGeneral;morphologyReproductiveS...,3,en,Holttum had annotated _x000D_\n<i>Faden &amp; ...,holttum had annotated faden amp ; evans 70/422...,holttum had annotated faden amp evans in with ...,holttum annotated faden amp evans new name bas...
59149,× Dupoa labradorica,(Steud.) J.Cay. & Darbysh.,morphologyReproductiveInflorescenceBractGlume;...,7,en,Glumes persistent; similar; shorter than spike...,glumes persistent ; similar ; shorter than spi...,glumes persistent similar shorter than spikele...,glumes persistent similar shorter spikelet fir...


In [15]:
df_WIKI_combined = combine_descriptions(df_WIKI, "WIKI")

In [16]:
df_WIKI_combined

Unnamed: 0,name,authors,WIKI_ids,WIKI_id_N,Language,description,QA_description,BERT_description,BOW_description
0,Aa achalensis,"Schltr., 1920",Summary;Summary,2,en,Aa achalensis is a species of orchid in the ge...,aa achalensis is a species of orchid in the ge...,aa achalensis is species of orchid in the genu...,aa achalensis species orchid genus aa referenc...
1,Aa argyrolepis,"Rchb.f., 1854",Summary;References;Summary;References,4,en,Aa argyrolepis is an orchid in the genus Aa. ...,aa argyrolepis is an orchid in the genus aa . ...,aa argyrolepis is an orchid in the genus aa it...,aa argyrolepis orchid genus aa grows altitudes...
2,Aa aurantiaca,D. Trujillo (2011)[1],Summary;Summary,2,en,Aa aurantiaca is a species of orchid in the ge...,aa aurantiaca is a species of orchid in the ge...,aa aurantiaca is species of orchid in the genu...,aa aurantiaca species orchid genus aa native p...
3,Aa calceata,"Schltr., 1912",Summary;Summary,2,en,Aa calceata is a species of orchid in the genu...,aa calceata is a species of orchid in the genu...,aa calceata is species of orchid in the genus ...,aa calceata species orchid genus aait found bo...
4,Aa colombiana,Schltr.,Summary;Summary,2,en,Aa colombiana is a species of orchid in the ge...,aa colombiana is a species of orchid in the ge...,aa colombiana is species of orchid in the genu...,aa colombiana species orchid genus aa found co...
...,...,...,...,...,...,...,...,...,...
55626,Zygosepalum labiosum,(Rich.) C.Schweinf.,Summary;Description,2,en,Zygosepalum labiosum is an epiphytic orchid fo...,zygosepalum labiosum is an epiphytic orchid fo...,zygosepalum labiosum is an epiphytic orchid fo...,zygosepalum labiosum epiphytic orchid found so...
55627,Zygostigma australe,(Cham. & Schltdl.) Griseb.,Summary,1,en,Zygostigma australe is a species of flowering ...,zygostigma australe is a species of flowering ...,zygostigma australe is species of flowering pl...,zygostigma australe species flowering plant fa...
55628,Zygotritonia bongensis,(Pax) Mildbr.,Summary;Morphology;Distribution,3,en,Zygotritonia bongensis is a perennial herb of ...,zygotritonia bongensis is a perennial herb of ...,zygotritonia bongensis is perennial herb of th...,zygotritonia bongensis perennial herb iridacea...
55629,Zyzyxia lundellii,"(H.Robinson) Strother, 1991",Summary;Description and distribution;Naming,3,en,Zyzyxia is a genus of tropical shrubs in the f...,zyzyxia is a genus of tropical shrubs in the f...,zyzyxia is genus of tropical shrubs in the fam...,zyzyxia genus tropical shrubs family asteracea...


### Description Word Count & Character Count Analysis 

In [17]:
character_count_description = df_POWO_combined["BERT_description"].apply(lambda x: len(x))
df_POWO_combined = df_POWO_combined.assign(description_character_count=character_count_description)

word_count_description = df_POWO_combined["BERT_description"].apply(lambda x: len(x.split()))
df_POWO_combined = df_POWO_combined.assign(description_word_count=word_count_description)

character_count_description = df_WIKI_combined["BERT_description"].apply(lambda x: len(x))
df_WIKI_combined = df_WIKI_combined.assign(description_character_count=character_count_description)

word_count_description = df_WIKI_combined["BERT_description"].apply(lambda x: len(x.split()))
df_WIKI_combined = df_WIKI_combined.assign(description_word_count=word_count_description)

# Create Morphology General Habit & Morphology Leaf Datasets 

In [54]:
df_POWO_MGH = df_POWO[df_POWO["POWO_id"]=="morphologyGeneralHabit"]
df_POWO_ML = df_POWO[df_POWO["POWO_id"]=="morphologyLeaf"]

In [55]:
df_POWO_MGH

Unnamed: 0.1,Unnamed: 0,POWO_id,description,Growth Form,source,name,authors,i,ID,fqId,...,modified,language,creator,Problematic,prep_description_1,Language,trans_prep_description_1,QA_description,BERT_description,BOW_description
6,8,morphologyGeneralHabit,"&nbsp;Annual or short-lived perennial, erect t...",,Flora Zambesiaca Leguminosae subfamily Papilli...,Tephrosia longipes,Meisn.,151954,520704-1,urn:lsid:ipni.org:names:520704-1,...,,,,2.0,";Annual or short-lived perennial, erect to 1.6...",en,";Annual or short-lived perennial, erect to 1.6...","; annual or short-lived perennial , erect to 1...",annual or shortlived perennial erect to from t...,annual shortlived perennial erect taproot suff...
15,21,morphologyGeneralHabit,&nbsp;Erect to climbing shrub to 2 m high; bra...,Shrub,"Asparagaceae, Sebsebe Demissew. Flora of Tropi...",Asparagus scaberulus,A.Rich.,34764,531301-1,urn:lsid:ipni.org:names:531301-1,...,,,,,;Erect to climbing shrub to 2 m high; branches...,en,;Erect to climbing shrub to 2 m high; branches...,; erect to climbing shrub to 2 m high ; branch...,erect to climbing shrub to high branches purpl...,erect climbing shrub high branches purplish br...
24,32,morphologyGeneralHabit,(Annual? or) perennial herb with a stout verti...,Herb,"Amaranthaceae, C. C. Townsend. Flora Zambesiac...",Alternanthera nodiflora,R.Br.,7634,59266-1,urn:lsid:ipni.org:names:59266-1,...,,,,,(Annual? or) perennial herb with a stout verti...,en,(Annual? or) perennial herb with a stout verti...,( annual ? or ) perennial herb with a stout ve...,annual or perennial herb with stout vertical r...,annual perennial herb stout vertical rootstock...
27,37,morphologyGeneralHabit,(Annual? or) perennial herb with a stout verti...,Herb,"Amaranthaceae, C.C. Townsend. Flora of Tropica...",Alternanthera nodiflora,R.Br.,7634,59266-1,urn:lsid:ipni.org:names:59266-1,...,,,,,(Annual? or) perennial herb with a stout verti...,en,(Annual? or) perennial herb with a stout verti...,( annual ? or ) perennial herb with a stout ve...,annual or perennial herb with stout vertical r...,annual perennial herb stout vertical rootstock...
38,53,morphologyGeneralHabit,"? Perennial herb, probably with a trailing woo...",Herb,"Polygonaceae, R. A. Graham. Flora of Tropical ...",Oxygonum stuhlmannii,Dammer,273190,694850-1,urn:lsid:ipni.org:names:694850-1,...,,,,,"? Perennial herb, probably with a trailing woo...",en,"? Perennial herb, probably with a trailing woo...","? perennial herb , probably with a trailing wo...",perennial herb probably with trailing woody ba...,perennial herb probably trailing woody base as...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
288248,412270,morphologyGeneralHabit,"Woody-based perennial to slender shrub, 0.1–1....",,"M. Thulin. Flora of Somalia, Vol. 1–4 [updated...",Thesium hararensis,A.G.Mill.,313989,903247-1,urn:lsid:ipni.org:names:903247-1,...,,,,,"Woody-based perennial to slender shrub, 0.1-1....",en,"Woody-based perennial to slender shrub, 0.1-1....","woody-based perennial to slender shrub , 0.1-1...",woodybased perennial to slender shrub tall gla...,woodybased perennial slender shrub tall glabro...
288249,412274,morphologyGeneralHabit,Woody-based Thesium-like herb with several ere...,,"Rubiaceae, B. Verdcourt. Flora Zambesiaca 5:1....",Manostachya staelioides,(K.Schum.) Bremek.,302653,755812-1,urn:lsid:ipni.org:names:755812-1,...,,,,,Woody-based Thesium-like herb with several ere...,en,Woody-based Thesium-like herb with several ere...,woody-based thesium-like herb with several ere...,woodybased thesiumlike herb with several erect...,woodybased thesiumlike herb several erect glab...
288250,412278,morphologyGeneralHabit,"Woody-stemmed herb, 4 ft. high",,"Papilionaceae, Hutchinson and Dalziel. Flora o...",Indigofera megacephala,J.B.Gillett,143886,499640-1,urn:lsid:ipni.org:names:499640-1,...,,,,,"Woody-stemmed herb, 4 ft. high",en,"Woody-stemmed herb, 4 ft. high","woody-stemmed herb , 4 ft. high",woodystemmed herb ft high,woodystemmed herb ft high
288252,412280,morphologyGeneralHabit,Young branches and leaves rusty; leaves slight...,,"Apocynaceae, E.A. Omino. Flora of Tropical Eas...",Beaumontia grandiflora,Wall.,19449,77539-1,urn:lsid:ipni.org:names:77539-1,...,,,,,Young branches and leaves rusty; leaves slight...,en,Young branches and leaves rusty; leaves slight...,young branches and leaves rusty ; leaves sligh...,young branches and leaves rusty leaves slightl...,young branches leaves rusty leaves slightly ob...


# Input Label Data 

## Global Inventory of Floras and Traits - GIFT Dataset

### GIFT Trait Data

In [18]:
df_GIFT_traits = pd.read_csv("..//Data//Initial Databases//GIFT_traits.csv")

  df_GIFT_traits = pd.read_csv("..//Data//GIFT_traits.csv")


In [19]:
df_GIFT_traits = df_GIFT_traits[df_GIFT_traits["restricted"]==1].drop(["restricted"], axis=1)

In [20]:
df_GIFT_traits

Unnamed: 0,work_ID,trait_ID,trait_value,agreement,references,bias_by_reference,bias_by_derivation
0,1,1.1.1,non-woody,1.0,10465,0,1
1,1,1.2.1,herb,1.0,10465,0,1
2,1,1.2.2,herb,1.0,10465,0,1
4,1,1.6.1,0.05,,1025410342,0,0
6,1,1.6.2,0.2,,1025410342,0,0
...,...,...,...,...,...,...,...
5793217,437924,3.3.2,anemochorous,1.0,10652,0,0
5793218,437962,3.3.1,anemochorous,1.0,10652,0,0
5793219,437962,3.3.2,anemochorous,1.0,10652,0,0
5793220,437984,3.3.1,zoochorous,1.0,10652,0,0


#### Filter Traits

In [21]:
trait_names_cat = ["Growth_form_1", "Epiphyte_1", "Climber_1", "Lifecycle_1", "Life_form_1"]
traits_cat = ["1.2.1", "1.3.1", "1.4.1", "2.1.1", "2.3.1"]

In [22]:
trait_names_num = ["Plant_height_max", "Leaf_length_max", "Leaf_width_max"]
traits_num = ["1.6.2", "4.6.2", "4.7.2"]

In [23]:
trait_dict = {trait_code: trait_name for trait_code, trait_name in zip(np.hstack((traits_cat, traits_num)), np.hstack((trait_names_cat, trait_names_num)))}

In [24]:
for trait_code, trait_name in zip(np.hstack((traits_cat, traits_num)), np.hstack((trait_names_cat, trait_names_num))):
    print("Trait: {}\nCode: {}\nCount {}".format(trait_name, trait_code, df_GIFT_traits["trait_ID"].value_counts()[trait_code]))
    print("------")

Trait: Growth_form_1
Code: 1.2.1
Count 251006
------
Trait: Epiphyte_1
Code: 1.3.1
Count 216396
------
Trait: Climber_1
Code: 1.4.1
Count 230761
------
Trait: Lifecycle_1
Code: 2.1.1
Count 202302
------
Trait: Life_form_1
Code: 2.3.1
Count 102055
------
Trait: Plant_height_max
Code: 1.6.2
Count 78968
------
Trait: Leaf_length_max
Code: 4.6.2
Count 20347
------
Trait: Leaf_width_max
Code: 4.7.2
Count 17110
------


In [25]:
# Filtering Data based on selected traits - WARNING: REWRITING DATAFRAME
df_GIFT_traits = df_GIFT_traits[df_GIFT_traits["trait_ID"].apply(lambda x: x in np.hstack((traits_cat, traits_num)))]

In [26]:
# Drop Duplicates - WARNING: REWRITING DATAFRAME
df_GIFT_traits = df_GIFT_traits.drop_duplicates(["work_ID", "trait_ID", "trait_value", "references", "bias_by_reference", "bias_by_derivation"])

In [27]:
df_GIFT_traits

Unnamed: 0,work_ID,trait_ID,trait_value,agreement,references,bias_by_reference,bias_by_derivation
1,1,1.2.1,herb,1.0,10465,0,1
6,1,1.6.2,0.2,,1025410342,0,0
9,1,2.3.1,hemicryptophyte,1.0,10465,0,0
33,3,1.2.1,tree,1.0,1032110598,1,0
35,3,2.1.1,perennial,1.0,1032110598,1,1
...,...,...,...,...,...,...,...
5793183,437906,1.6.2,0.3,,10651,0,0
5793187,437906,2.1.1,perennial,1.0,10651,0,0
5793197,437907,1.2.1,herb,1.0,10651,0,0
5793201,437907,1.6.2,0.6,,10651,0,0


In [28]:
print("Number of Unique Species: {}".format(df_GIFT_traits["work_ID"].nunique()))
print("Average Number of Traits Per Species: {} (Max: {})".format(np.round(len(df_GIFT_traits)/df_GIFT_traits["work_ID"].nunique(), 2), len(trait_dict)))

Number of Unique Species: 288159
Average Number of Traits Per Species: 3.88 (Max: 8)


### GIFT Species Data

In [29]:
df_GIFT_names = pd.read_csv("..//Data//Initial Databases//GIFT_names_matched.csv")

In [30]:
df_GIFT_names

Unnamed: 0,orig_ID,name_ID,genus,species_epithet,subtaxon,author,family,matched,epithetscore,overallscore,resolved,service,work_ID,species
0,1.0,3136,Amaranthus,interruptus,,R.Br.,Amaranthaceae,1,1.0,1.000000,1,tpl,1943,Amaranthus interruptus
1,2.0,5078,Argusia,argentea,,(L.f.) Heine,Boraginaceae,1,1.0,1.000000,0,tpl,3128,Argusia argentea
2,3.0,18061,Cordia,subcordata,,Lam.,Boraginaceae,1,1.0,1.000000,1,tpl,11046,Cordia subcordata
3,4.0,16442,Cleome,gynandra,,L.,Cleomaceae,1,1.0,1.000000,1,tpl,9985,Cleome gynandra
4,5.0,37713,Ipomoea,macrantha,,Roem. & Schult.,Convolvulaceae,1,1.0,1.000000,1,tpl,23490,Ipomoea violacea
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
992258,,877624,,,,,Leguminosae,1,1.0,0.648649,1,tpl,102406,Trifolium tumens
992259,,877626,,,,,Leguminosae,1,1.0,0.742857,1,tpl,381907,Trigonella gracilis
992260,,877633,,,,,Caprifoliaceae,1,1.0,0.697674,0,tpl,79926,Valeriana stracheyi
992261,,877635,,,,,Leguminosae,1,1.0,0.666667,1,tpl,57728,Vicia johannis


## Combine Label Information & Convert to Wide

In [31]:
df_GIFT_long = df_GIFT_traits.merge(df_GIFT_names[["work_ID", "species"]], how='left', on='work_ID').drop_duplicates()

In [32]:
df_GIFT_long

Unnamed: 0,work_ID,trait_ID,trait_value,agreement,references,bias_by_reference,bias_by_derivation,species
0,1,1.2.1,herb,1.0,10465,0,1,Aaronsohnia pubescens
11,1,1.6.2,0.2,,1025410342,0,0,Aaronsohnia pubescens
22,1,2.3.1,hemicryptophyte,1.0,10465,0,0,Aaronsohnia pubescens
33,3,1.2.1,tree,1.0,1032110598,1,0,Abarema abbottii
36,3,2.1.1,perennial,1.0,1032110598,1,1,Abarema abbottii
...,...,...,...,...,...,...,...,...
3907626,437906,1.6.2,0.3,,10651,0,0,Silene ispirensis
3907627,437906,2.1.1,perennial,1.0,10651,0,0,Silene ispirensis
3907628,437907,1.2.1,herb,1.0,10651,0,0,Antitoxicum raddeanum
3907629,437907,1.6.2,0.6,,10651,0,0,Antitoxicum raddeanum


In [33]:
def long_to_wide(df, trait_dict):
    df_new = pd.DataFrame(index=np.unique(df["work_ID"]))
    df_new["work_ID"] = np.unique(df["work_ID"])
    df_new["species"] = df.groupby("work_ID")["species"].aggregate(lambda x: ''.join(set(x)))
    
    for trait_code in trait_dict:
        df_tmp = df[df["trait_ID"]==trait_code]#.astype(str)
        df_tmp["agreement"] = np.round(df_tmp["agreement"], 2).astype(str)
        df_tmp["bias_by_reference"] = df_tmp["bias_by_reference"].astype(str)
        df_tmp["bias_by_derivation"] = df_tmp["bias_by_derivation"].astype(str)
        
        df_new[trait_code + "_count"] = df_tmp[df_tmp["trait_value"].notna()].groupby("work_ID")["trait_value"].count()
        df_new[trait_code] = df_tmp[df_tmp["trait_value"].notna()].groupby("work_ID")["trait_value"].aggregate(lambda x: "|".join(np.unique(np.hstack(x))))
        df_new[trait_code + "_agreement"] = df_tmp[df_tmp["trait_value"].notna()].groupby("work_ID")["agreement"].aggregate(lambda x: "|".join(np.hstack(x)))
        df_new[trait_code + "_bias_by_reference"] = df_tmp[df_tmp["trait_value"].notna()].groupby("work_ID")["bias_by_reference"].aggregate(lambda x: "|".join(np.hstack(x)))
        df_new[trait_code + "_bias_by_derivation"] = df_tmp[df_tmp["trait_value"].notna()].groupby("work_ID")["bias_by_derivation"].aggregate(lambda x: "|".join(np.hstack(x)))

    return df_new

In [34]:
df_GIFT_wide = long_to_wide(df_GIFT_long, trait_dict)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp["agreement"] = np.round(df_tmp["agreement"], 2).astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp["bias_by_reference"] = df_tmp["bias_by_reference"].astype(str)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tmp["bias_by_derivation"] = df_tmp["bias_by_derivation"].astype(s

In [35]:
df_GIFT_wide

Unnamed: 0,work_ID,species,1.2.1_count,1.2.1,1.2.1_agreement,1.2.1_bias_by_reference,1.2.1_bias_by_derivation,1.3.1_count,1.3.1,1.3.1_agreement,...,4.6.2_count,4.6.2,4.6.2_agreement,4.6.2_bias_by_reference,4.6.2_bias_by_derivation,4.7.2_count,4.7.2,4.7.2_agreement,4.7.2_bias_by_reference,4.7.2_bias_by_derivation
1,1,Aaronsohnia pubescens,1.0,herb,1.0,0,1,,,,...,,,,,,,,,,
3,3,Abarema abbottii,1.0,tree,1.0,1,0,,,,...,,,,,,,,,,
4,4,Abarema alexandri,1.0,tree,1.0,0,0,,,,...,,,,,,,,,,
5,5,Abarema asplenifolia,1.0,tree,1.0,1,0,,,,...,,,,,,,,,,
6,6,Abarema glauca,1.0,tree,1.0,0,0,1.0,terrestrial,1.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
437903,437903,Razoumowskya oxycerdi,1.0,shrub,1.0,0,0,,,,...,,,,,,,,,,
437904,437904,Rosa bifera,1.0,shrub,1.0,0,0,,,,...,,,,,,,,,,
437905,437905,Saussurea pulviniformis,,,,,,,,,...,,,,,,,,,,
437906,437906,Silene ispirensis,,,,,,,,,...,,,,,,,,,,


# Combine Description & Trait Data

In [36]:
"""Combine the POWO/WIKI and GIFT databases regards to their species name and add the wanted traits"""
def combine_description_trait_data(df, df_GIFT):
    df_GIFT["key"] = df_GIFT["species"]
    df["key"] = df["name"]
    df = pd.merge(df, df_GIFT, how = "left", on = "key")
    df.drop("key", axis=1, inplace=True)
    return df

## POWO_GIFT

In [37]:
df_POWO_GIFT = combine_description_trait_data(df_POWO_combined, df_GIFT_wide)

In [38]:
df_POWO_GIFT

Unnamed: 0,name,authors,POWO_ids,POWO_id_N,Language,description,QA_description,BERT_description,BOW_description,description_character_count,...,4.6.2_count,4.6.2,4.6.2_agreement,4.6.2_bias_by_reference,4.6.2_bias_by_derivation,4.7.2_count,4.7.2,4.7.2_agreement,4.7.2_bias_by_reference,4.7.2_bias_by_derivation
0,Aa argyrolepis,Rchb.f.,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb,4,...,,,,,,,,,,
1,Aa colombiana,Schltr.,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb,4,...,,,,,,,,,,
2,Aa denticulata,Schltr.,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb,4,...,,,,,,,,,,
3,Aa leucantha,(Rchb.f.) Schltr.,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb,4,...,,,,,,,,,,
4,Aa maderoi,Schltr.,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb,4,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
59146,× Agropogon lutosus,(Poir.) P.Fourn.,morphologyReproductiveInflorescenceBractGlume;...,7,en,Glumes persistent; similar; exceeding apex of ...,glumes persistent ; similar ; exceeding apex o...,glumes persistent similar exceeding apex of fl...,glumes persistent similar exceeding apex flore...,1532,...,,,,,,,,,,
59147,× Calicharis butcheri,(Traub) Meerow,morphologyGeneralHabit,1,es,Hierba,herb,herb,herb,4,...,,,,,,,,,,
59148,× Chrismatopteris holttumii,Quansah & D.S.Edwards,note;morphologyGeneral;morphologyReproductiveS...,3,en,Holttum had annotated _x000D_\n<i>Faden &amp; ...,holttum had annotated faden amp ; evans 70/422...,holttum had annotated faden amp evans in with ...,holttum annotated faden amp evans new name bas...,1938,...,,,,,,,,,,
59149,× Dupoa labradorica,(Steud.) J.Cay. & Darbysh.,morphologyReproductiveInflorescenceBractGlume;...,7,en,Glumes persistent; similar; shorter than spike...,glumes persistent ; similar ; shorter than spi...,glumes persistent similar shorter than spikele...,glumes persistent similar shorter spikelet fir...,1422,...,,,,,,,,,,


## WIKI_GIFT

In [39]:
df_WIKI_GIFT = combine_description_trait_data(df_WIKI_combined, df_GIFT_wide)

In [40]:
df_WIKI_GIFT

Unnamed: 0,name,authors,WIKI_ids,WIKI_id_N,Language,description,QA_description,BERT_description,BOW_description,description_character_count,...,4.6.2_count,4.6.2,4.6.2_agreement,4.6.2_bias_by_reference,4.6.2_bias_by_derivation,4.7.2_count,4.7.2,4.7.2_agreement,4.7.2_bias_by_reference,4.7.2_bias_by_derivation
0,Aa achalensis,"Schltr., 1920",Summary;Summary,2,en,Aa achalensis is a species of orchid in the ge...,aa achalensis is a species of orchid in the ge...,aa achalensis is species of orchid in the genu...,aa achalensis species orchid genus aa referenc...,123,...,,,,,,,,,,
1,Aa argyrolepis,"Rchb.f., 1854",Summary;References;Summary;References,4,en,Aa argyrolepis is an orchid in the genus Aa. ...,aa argyrolepis is an orchid in the genus aa . ...,aa argyrolepis is an orchid in the genus aa it...,aa argyrolepis orchid genus aa grows altitudes...,1077,...,,,,,,,,,,
2,Aa aurantiaca,D. Trujillo (2011)[1],Summary;Summary,2,en,Aa aurantiaca is a species of orchid in the ge...,aa aurantiaca is a species of orchid in the ge...,aa aurantiaca is species of orchid in the genu...,aa aurantiaca species orchid genus aa native p...,229,...,,,,,,,,,,
3,Aa calceata,"Schltr., 1912",Summary;Summary,2,en,Aa calceata is a species of orchid in the genu...,aa calceata is a species of orchid in the genu...,aa calceata is species of orchid in the genus ...,aa calceata species orchid genus aait found bo...,181,...,,,,,,,,,,
4,Aa colombiana,Schltr.,Summary;Summary,2,en,Aa colombiana is a species of orchid in the ge...,aa colombiana is a species of orchid in the ge...,aa colombiana is species of orchid in the genu...,aa colombiana species orchid genus aa found co...,247,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55626,Zygosepalum labiosum,(Rich.) C.Schweinf.,Summary;Description,2,en,Zygosepalum labiosum is an epiphytic orchid fo...,zygosepalum labiosum is an epiphytic orchid fo...,zygosepalum labiosum is an epiphytic orchid fo...,zygosepalum labiosum epiphytic orchid found so...,443,...,,,,,,,,,,
55627,Zygostigma australe,(Cham. & Schltdl.) Griseb.,Summary,1,en,Zygostigma australe is a species of flowering ...,zygostigma australe is a species of flowering ...,zygostigma australe is species of flowering pl...,zygostigma australe species flowering plant fa...,238,...,,,,,,,,,,
55628,Zygotritonia bongensis,(Pax) Mildbr.,Summary;Morphology;Distribution,3,en,Zygotritonia bongensis is a perennial herb of ...,zygotritonia bongensis is a perennial herb of ...,zygotritonia bongensis is perennial herb of th...,zygotritonia bongensis perennial herb iridacea...,745,...,,,,,,,,,,
55629,Zyzyxia lundellii,"(H.Robinson) Strother, 1991",Summary;Description and distribution;Naming,3,en,Zyzyxia is a genus of tropical shrubs in the f...,zyzyxia is a genus of tropical shrubs in the f...,zyzyxia is genus of tropical shrubs in the fam...,zyzyxia genus tropical shrubs family asteracea...,1626,...,,,,,,,,,,


In [56]:
df_POWO_MGH_GIFT = combine_description_trait_data(df_POWO_MGH, df_GIFT_wide)
df_POWO_ML_GIFT = combine_description_trait_data(df_POWO_ML, df_GIFT_wide)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["key"] = df["name"]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["key"] = df["name"]


## Save Data

In [41]:
df_POWO_GIFT.to_excel("..//Data//Final Databases//POWO_GIFT.xlsx", index = False)

In [42]:
df_WIKI_GIFT.to_excel("..//Data//Final Databases//WIKI_GIFT.xlsx", index = False)

In [59]:
df_POWO_MGH_GIFT.to_excel("..//Data//Final Databases//POWO_MGH_GIFT.xlsx", index = False)
df_POWO_ML_GIFT.to_excel("..//Data//Final Databases//POWO_ML_GIFT.xlsx", index = False)