# README

you may use this notebook to create useful subsets of the GBIF data and write these subsets to new csv files. you can then read your new files created using the lines below into your group project notebooks in order to complete the analyses while using less data.

In [1]:
import pandas as pd

In [2]:

file_name = "GBIF_can_us.csv" # replace this with the file you want to clean up, if you are using this notebook to do so.

dat = pd.read_csv(file_name,sep="\t")
dat.shape

(498735, 22)

In [3]:
dat.head()

Unnamed: 0,taxonKey,scientificName,acceptedTaxonKey,acceptedScientificName,numberOfOccurrences,taxonRank,taxonomicStatus,kingdom,kingdomKey,phylum,...,classKey,order,orderKey,family,familyKey,genus,genusKey,species,speciesKey,iucnRedListCategory
0,341,Turbellaria,341.0,Turbellaria,106,CLASS,ACCEPTED,Animalia,1,Platyhelminthes,...,341.0,,,,,,,,,
1,830,Nitrospirales,830.0,Nitrospirales,1014,ORDER,ACCEPTED,Bacteria,3,Nitrospirota,...,10889728.0,Nitrospirales,830.0,,,,,,,
2,982,Neogastropoda,982.0,Neogastropoda,681,ORDER,ACCEPTED,Animalia,1,Mollusca,...,225.0,Neogastropoda,982.0,,,,,,,
3,1445,Psittaciformes,1445.0,Psittaciformes,1,ORDER,ACCEPTED,Animalia,1,Chordata,...,212.0,Psittaciformes,1445.0,,,,,,,
4,2390,Scrophulariaceae,2390.0,Scrophulariaceae,128,FAMILY,ACCEPTED,Plantae,6,Tracheophyta,...,220.0,Lamiales,408.0,Scrophulariaceae,2390.0,,,,,NE


# removing imprecise organism sightings

one peculiarity to this dataset is that it contains entries for identifies other than species, such as 'class' and 'order'. so for example, you might se a listing for 'Primates' under acceptedScientificName, where taxonRank would be 'CLASS'. we probably just want to compare _species_. so we can omit anything from taxonRank that is not 'SPECIES'.

note: if you are interested, this occurs because someone may find an organism and only be able to say "this is a maple", without knowing what _specific_ type (species) of maple it is. so they submit "maple" to the database. we do not need such imprecise data at the moment.

In [5]:
sp_only = dat[dat['taxonRank'] == "SPECIES"]
sp_only.shape

(294771, 22)

In [17]:
sp_only.head()

Unnamed: 0,taxonKey,scientificName,acceptedTaxonKey,acceptedScientificName,numberOfOccurrences,taxonRank,taxonomicStatus,kingdom,kingdomKey,phylum,...,classKey,order,orderKey,family,familyKey,genus,genusKey,species,speciesKey,iucnRedListCategory
15,1002454,"Lecane ludwigii (Eckstein, 1883)",1002454.0,"Lecane ludwigii (Eckstein, 1883)",2,SPECIES,ACCEPTED,Animalia,1,Rotifera,...,307.0,Ploima,1235.0,Lecanidae,8064.0,Lecane,1002167.0,Lecane ludwigii,1002454.0,NE
16,1003652,"Cristatella mucedo Cuvier, 1798",1003652.0,"Cristatella mucedo Cuvier, 1798",26,SPECIES,ACCEPTED,Animalia,1,Bryozoa,...,139.0,Plumatellida,1070.0,Cristatellidae,6907.0,Cristatella,1003651.0,Cristatella mucedo,1003652.0,NE
18,1015506,"Psochodesmus crescentis Cook, 1896",1015506.0,"Psochodesmus crescentis Cook, 1896",20,SPECIES,ACCEPTED,Animalia,1,Arthropoda,...,361.0,Polydesmida,1247.0,Pyrgodesmidae,8241.0,Psochodesmus,1015505.0,Psochodesmus crescentis,1015506.0,NE
19,1019678,"Apheloria kleinpeteri Hoffman, 1949",1019677.0,"Rudiloria kleinpeteri (Hoffman, 1949)",1,SPECIES,SYNONYM,Animalia,1,Arthropoda,...,361.0,Polydesmida,1247.0,Xystodesmidae,4036.0,Rudiloria,1019670.0,Rudiloria kleinpeteri,1019677.0,NE
21,1033457,"Helichus striatus LeConte, 1852",1033457.0,"Helichus striatus LeConte, 1852",900,SPECIES,DOUBTFUL,Animalia,1,Arthropoda,...,216.0,Coleoptera,1470.0,Dryopidae,4732.0,Helichus,9813320.0,Helichus striatus,1033457.0,NE


# calculating the species counts within the dataset

### "how many times does species X occur in this region?"

this reduces our dataset size by half. it should also, therefore reduce our memory problems by half(ish).

but another thing worth noting: many of these columns are duplicates for "acceptedScientificName". why is that?  it is because each of these datasets contains a comprehensive record of all the species sightings reported for a particular geographic region. so, the north american dataset may contain many rows for _Acer saccharum_ (sugar maple). this is because the sugar maple is a very common species on this continent. understanding this facet of the data can help us to understand what to do with them if we want to save memory while still answering our question. for example, some of the proposed questions may simply require a list of all the unique species reported within a region to perform comparisons. no problem-- we can do that easily:

In [11]:
sp_list = sp_only['species'].unique()
len(sp_list)

225716

alternatively, we may actually _want_ those non-unique values because they allow us to construct a probability distribution for species counts ("which species occur more frequently than others?")

In [9]:
sp_counts = pd.DataFrame(pd.DataFrame(sp_only['species']).groupby("species").size())
sp_counts.columns = ["count"]
sp_counts.head()

Unnamed: 0_level_0,count
species,Unnamed: 1_level_1
Aa achalensis,1
Aaptoryctes ivyi,1
Aaptos aaptos,1
Aaptos bergmanni,1
Aaptos simplex,1




this cell gives us a dataframe where one column is the species name and the other column is the number of times that species is found in that region.



# write to file

so we have simplifed our data, but the memory overhead is still great. how can we fix this? unfortunately, our workflow in jupyter is set up such that the easiest thing to do is probably just to write our modified data to a new file like so:


In [29]:
sp_counts.to_csv("GBIF_can_us_species_counts.csv")

we can then read this new file into a separate notebook (probably the one you are creating to construct your group project):


`new_sp_counts = pd.read_csv("GBIF_can_us_species_counts.csv")`

*(you'd run this in your group project notebook, not here)

# subsetting only genera

alternatively, we may feel that it is appropriate to compare the distribution of _genera_ rather than species. (i.e., we would consider a row marked for humans, _Homo sapiens_ as _Homo_. this would allow us to group any human and neanderthal (_Homo neanderthalensis_) entries togebther if both occurred in the dataset. this may be adequate to answer questions (such as question #2) about the distribution and diversity of organisims within each region while saving us even more memory. there is no super-compelling biological reason to do one or the other. the choice can be made by you, according to your inclinations memory constraints.

IF YOU DO NOT WANT TO DECIDE YOURSELF: i would advise just doing the below subset. it will use much lower memory and thus probably simplify your life.

In [6]:
gen_list = sp_only['genus'].unique()
len(gen_list)

51682

In [10]:
gen_counts = pd.DataFrame(pd.DataFrame(sp_only['genus']).groupby("genus").size())
gen_counts.columns = ["count"]
gen_counts.head()

Unnamed: 0_level_0,count
genus,Unnamed: 1_level_1
Aa,1
Aaptoryctes,1
Aaptos,3
Aaroniella,2
Aartsenia,3


# write to file

we can then write the results to a file and move on with our lives with our more manageable dataset ready to be read into our group project notebook.

In [11]:
gen_counts.to_csv("GBIF_can_us_genus_counts.csv")