# Specie data collection 

This notebook is part of a Technical test at FCD. The main task is to write Python code that describes the sample dataset using categorical attributes. The second part of the task should include visualization of proportion per category and geo-visualization. 

Run the next cells to get all results. If needed.

## GitHub Repository

Find more information about this code in the repository [SpCollection_FCD Repository](https://github.com/bryanvallejo16/SpCollection_FCD)


## Hands on coding

In [17]:
import pandas as pd
import plotly.express as px

- Load data using pandas

In [18]:
# read data
data = pd.read_csv('data/CDFCollections240508csv.csv')

In [19]:
data.head()

Unnamed: 0,Kingdom,Family,SciName,LatitudeDecimal,LongitudeDecimal,CollectionCode,Origin,StartDate
0,Plantae,Salviniaceae,Azolla filiculoides subsp. cristata,-0.6167,-90.35,CDS,Native,2003-04-20 00:00:00
1,Plantae,Boraginaceae,Tournefortia pubescens,-0.496722,-90.358638,CDS,Endemic,1963-02-10 00:00:00
2,Plantae,Sapindaceae,Cardiospermum galapageium,-0.396666,-90.290833,CDS,Endemic,1963-02-10 00:00:00
3,Plantae,Euphorbiaceae,Croton scouleri var. darwinii,1.375,-91.8194,CDS,Endemic,1963-02-16 00:00:00
4,Plantae,Boraginaceae,Tiquilia fusca,-0.582327,-90.165816,CDS,Endemic,1963-02-20 00:00:00


- For each collection (identified by CollectionCode), calculate taxonomic family proportion

In [20]:
print(f'The dataset contains in total {data.CollectionCode.nunique()} unique Collection Codes\n\n{data.CollectionCode.unique()}')

The dataset contains in total 4 unique Collection Codes

['CDS' 'ICCDRS' 'VCCDRS' 'MCCDRS']


In [21]:
# get subset with only needed classes - Family and CollectionCode
data_taxonomy = data[['Family', 'CollectionCode']]

# get a DF with proportion - Normalize=True gives a total proportion of 1 per collection code.
data_taxonomy_df = data_taxonomy.groupby('CollectionCode', as_index=False).value_counts(normalize=True)

data_taxonomy_df.head(12)

Unnamed: 0,CollectionCode,Family,proportion
0,CDS,Asteraceae,0.061828
1,CDS,Poaceae,0.052307
2,CDS,Physciaceae (= Caliciaceae),0.04548
3,CDS,Parmeliaceae,0.044223
4,CDS,Fabaceae,0.040301
5,CDS,Euphorbiaceae,0.034402
6,CDS,Roccellaceae,0.033594
7,CDS,Boraginaceae,0.033504
8,CDS,Ramalinaceae,0.032246
9,CDS,Malvaceae,0.027516


In [22]:
# save table in OUTPUT folder
data_taxonomy_df.to_csv('output/CollectioCode_Family_proportion.csv', index=False)

- Calculate the *species origin* status proportion

In [23]:
print(f'The dataset contains in total {data.Origin.nunique()} unique Specie Origins\n\n{data.Origin.unique()}')

The dataset contains in total 9 unique Specie Origins

['Native' 'Endemic' 'Cryptogenic' 'Introduced - established' 'No data'
 'Introduced - eradicated' 'Introduced - intercepted' 'Migrant' 'Vagrant']


In [24]:
# get subset with only needed classes - Specie origin
data_sp_origin = data[['Origin']]

# get a DF with proportion - Normalize=True gives a total proportion of 
data_sp_origin_df = data_sp_origin.value_counts(normalize=True).to_frame().reset_index(drop=False).rename(columns={0:'proportion'})

data_sp_origin_df


Unnamed: 0,Origin,proportion
0,Native,0.368857
1,Endemic,0.30396
2,Introduced - established,0.271644
3,Cryptogenic,0.031823
4,No data,0.02137
5,Migrant,0.000803
6,Vagrant,0.000571
7,Introduced - eradicated,0.000494
8,Introduced - intercepted,0.000479


In [25]:
# save table in OUTPUT folder
data_taxonomy_df.to_csv('output/SpecieOrigin_proportion.csv', index=False)

## Visualization

- Visualize these results using Plotly, with proper labels and a title

In [26]:
# define a fig
fig = px.bar(data_taxonomy_df.sort_values('proportion', ascending=False), 
             x="CollectionCode", 
             y="proportion", 
             labels = {'proportion': 'Proportion',
                        'CollectionCode': 'Collection Code'},
             hover_data = {'proportion':True,
                           'CollectionCode':False,
                           'Family':True},
             color="Family", 
             title="Proportion of Taxonomic Family, per Collection Code",
             width=None, 
             height=600             
             )

# font
fig.update_layout(
    font_family="Arial",
    title_font_family="Arial")

fig.show()

In [27]:
data_sp_origin_df

Unnamed: 0,Origin,proportion
0,Native,0.368857
1,Endemic,0.30396
2,Introduced - established,0.271644
3,Cryptogenic,0.031823
4,No data,0.02137
5,Migrant,0.000803
6,Vagrant,0.000571
7,Introduced - eradicated,0.000494
8,Introduced - intercepted,0.000479


In [28]:
fig = px.pie(data_sp_origin_df.sort_values('proportion', ascending=False), 
             values='proportion', 
             names='Origin', 
             title='Species Origin Status proportion',
             width=None, 
             height=600             
             )

fig.update_traces(textposition='outside', textinfo='percent+label')


# font
fig.update_layout(
    font_family="Arial",
    title_font_family="Arial")

fig.show()

- Share the created code in a GitHub repo (Done!)