In this file I will work (not sure what I will do yet) with the measurements of the 16 popular traits for the 3 most popular species in iNaturalist. 

To better understand how the traits and the species were chosen, refer to the file __traits_exploring.ipynb__

For the documentation, refer to the file __TRY_6.0_Data_Release_Notes.pdf__

Funny thing! In the file __filtered_MIS_traits.txt__ the name of the file is truncated to a certain number of characters. Therefore, the name of the trait 

"Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): undefined if petiole is in- or excluded" 

appears as 

"Leaf area per leaf dry mass (specific leaf area, SLA or 1/LMA): undefined if petiole is in- or exclu"

and initially wasn't found in the trait list when looking at "top3_species_top_traits.txt" with the names. I then switched to TraitID for this reason.

In [32]:
import pandas as pd

In [59]:
with open('trait_id_list.txt', 'r') as openfile:
    trait_ids = openfile.read().splitlines()

print(f"There are {len(trait_ids)} traits in total\n")

#mapping them to int because they were read as strings
trait_ids = list(map(lambda x : int(x), trait_ids))


There are 16 traits in total



In [43]:
#specifing encoding='latin' is necessary or you get an error
top3_species_top_traits = pd.read_csv('top3_species_top_traits.txt', sep='\t', encoding='latin')

print("Total:", len(top3_species_top_traits))

#clear outliers: refer to the documentation obtained with the data request to better understand
#only keeping the measurement that have a value included in a range of 4 standard deviations. The rest are likely outliers or wrong.
#also keep the rows with no value in ErrorRisk, because that is metadata

#500 measurements dropped.
top3_species_top_traits = top3_species_top_traits[(top3_species_top_traits['ErrorRisk'] < 4.0) | (top3_species_top_traits['ErrorRisk'].isnull())]

print("After removing outliers:", len(top3_species_top_traits))

#now remove the duplicates. Again, refer to the documentation for more information
#4032 observations removed
top3_species_top_traits = top3_species_top_traits[top3_species_top_traits['OrigObsDataID'].isnull()]

#Note that duplicates are only referred to trait measurements.
#Therefore removing rows marked as "duplicates" might leave in the dataframe rows related to the single observation (eg. location)
#For these values, TraitName is not present. The information is located in the "DataName" column.
#If exploration is necessary, refer to the "ObservationID" of the original dataframe

print("After removing duplicates:", len(top3_species_top_traits))

#Related to what I mentioned above, now I only keep data related to the trait measurements. No metadata.
trait_measurements = top3_species_top_traits[top3_species_top_traits['TraitID'].isin(trait_ids)]

#I remove this trait because I only want numerical traits, for now.
#I can always use it again if I need to do something with it.
#For example, should I perform different analysis according to the plant growth form? (there are 121 distinct forms...)
trait_measurements_no_plant_form = trait_measurements[trait_measurements['TraitName'] != 'Plant growth form']



Total: 116532
After removing outliers: 116032
After removing duplicates: 112000


  top3_species_top_traits = pd.read_csv('top3_species_top_traits.txt', sep='\t', encoding='latin')


In [35]:
dactylis_glomerata_df = trait_measurements[trait_measurements['AccSpeciesName']=='Dactylis glomerata']
achillea_millefolium_df = trait_measurements[trait_measurements['AccSpeciesName']=='Achillea millefolium']
trifolium_pratense_df = trait_measurements[trait_measurements['AccSpeciesName']=='Trifolium pratense']

#important: remember to select AccSpeciesName and not SpeciesName, because only the first referes to the consolidated name of the
#species and you risk losing data if you don't do so (eg Achillea lanulosa)


In [None]:
columns_of_interest = ['ObservationID', 'TraitID', 'TraitName', 'OrigValueStr', 'OrigUnitStr', 'StdValue', 'UnitName']

dac_glo_trait_df = dactylis_glomerata_df[columns_of_interest]
ach_mil_trait_df = achillea_millefolium_df[columns_of_interest]
tri_pra_trait_df = trifolium_pratense_df[columns_of_interest]

dac_glo_trait_df[:5]

#I keep the observationID in case I need to link it to its metadata
#I also have to keep both Original Value and Standardized Value, because for textual data (e.g. Plant Growth Form)
#there is no standardized value

Unnamed: 0,ObservationID,TraitID,TraitName,OrigValueStr,OrigUnitStr,StdValue,UnitName
9,19150,13.0,Leaf carbon (C) content per leaf dry mass,45.85,%,458.5,mg/g
15,19194,13.0,Leaf carbon (C) content per leaf dry mass,41.62,%,416.2,mg/g
20,19216,14.0,Leaf nitrogen (N) content per leaf dry mass,2.78,%,27.8,mg/g
28,19514,42.0,Plant growth form,Herbaceous Monocot,,,
29,19514,14.0,Leaf nitrogen (N) content per leaf dry mass,1.591335,%,15.91335,mg/g


In [41]:
#just wanted to check that they are all there
# unique = list(dac_glo_trait_df['TraitName'].unique())
# print(sorted(unique))


In [66]:
def trait_analysis(df, id):

    trait_specific_df = df[df['TraitID']==id]

    trait_values = {
        'TraitID' : id,
        'TraitName': trait_specific_df['TraitName'].iloc[0] #they are all the same
    }

    trait_values['Mean'] = trait_specific_df['StdValue'].mean()
    trait_values['Std'] = trait_specific_df['StdValue'].std()
    trait_values['Median'] = trait_specific_df['StdValue'].median()
    trait_values['UnitName'] = trait_specific_df['UnitName'].iloc[0]

    return pd.Series(trait_values)

In [75]:
dac_glo_trait_analysis = pd.DataFrame(columns=['TraitID', 'TraitName', 'Mean', 'Std', 'Median', 'UnitName'])

#all minus plant growth form as specified above
trait_ids_minus_plant_growth_form = [tid for tid in trait_ids if (tid != 42 and tid != 159)]

for id in trait_ids_minus_plant_growth_form:
    dac_glo_trait_analysis.loc[len(dac_glo_trait_analysis)] = trait_analysis(dac_glo_trait_df, id)

dac_glo_trait_analysis

#careful, seedbank density median is set as 0

Unnamed: 0,TraitID,TraitName,Mean,Std,Median,UnitName
0,3117,Leaf area per leaf dry mass (specific leaf are...,24.992908,7.758924,23.7225,mm2 mg-1
1,13,Leaf carbon (C) content per leaf dry mass,453.11471,23.759855,449.932,mg/g
2,55,Leaf dry mass (single leaf),73.261125,63.818979,53.3,mg
3,47,Leaf dry mass per leaf fresh mass (leaf dry ma...,0.275087,0.057974,0.267,g/g
4,163,Leaf fresh mass,0.277901,0.22095,0.224,g
5,50,Leaf nitrogen (N) content per leaf area,0.967233,0.314514,0.919261,g m-2
6,14,Leaf nitrogen (N) content per leaf dry mass,24.257659,7.137246,24.222146,mg/g
7,403,Plant biomass and allometry: Shoot dry mass (p...,1.897759,4.998163,0.32635,g
8,3106,Plant height vegetative,0.472429,0.322961,0.4,m
9,33,Seed (seedbank) longevity,0.167742,0.491532,0.0,dimensionless
