This notebook reads in 3 different files from the Catalogue of Life into dataframes: vernacular, taxa, and locationdescription (cleaned in CatalogueLife_Distribution_cleaning, and imported as distribution). Then it merges all the dataframes into one using the taxonID. This final dataframe is exported into cataloguelife_poly.csv to be merged with GenBank records on scientific name, when applicable.  

In [1]:
import re
import pandas as pd

In [2]:
vernacular = "poly_vernacular.txt"
taxa = "poly_taxa.txt"
distribution = "poly_distribution.csv"
outfile = "poly_cataloguelife.csv"

In [3]:
common = pd.read_csv(vernacular, sep="\t")
common_df = common.drop(['language', 'countryCode', 'locality', 'transliteration'], axis=1)

common_df.head()

Unnamed: 0,taxonID,vernacularName
0,45195094,Thinleaf creepingfern
1,45195512,Tongue fern
2,45196512,Ekaha
3,45197311,Iron fern
4,45197822,Trim Shield fern


In [4]:
taxa_expand = pd.read_csv(taxa, sep="\t", dtype="object")
scientific = taxa_expand[['taxonID', 'datasetName', 'scientificName', 'genericName', 'specificEpithet', 'infraspecificEpithet',
                        'scientificNameAuthorship']]
scientific.head()

Unnamed: 0,taxonID,datasetName,scientificName,genericName,specificEpithet,infraspecificEpithet,scientificNameAuthorship
0,45194626,World Ferns in Species 2000 & ITIS Catalogue o...,Bolbitis novoguineensis Hennipman,Bolbitis,novoguineensis,,Hennipman
1,45194627,World Ferns in Species 2000 & ITIS Catalogue o...,Bolbitis occidentalis R.C.Moran,Bolbitis,occidentalis,,R.C.Moran
2,45194629,World Ferns in Species 2000 & ITIS Catalogue o...,Acrostichum pandurifolium (Hook.) Hook.,Acrostichum,pandurifolium,,(Hook.) Hook.
3,45194630,World Ferns in Species 2000 & ITIS Catalogue o...,Gymnopteris pandurifolia Hook.,Gymnopteris,pandurifolia,,Hook.
4,45194631,World Ferns in Species 2000 & ITIS Catalogue o...,Leptochilus pandurifolius (Hook.) C. Chr.,Leptochilus,pandurifolius,,(Hook.) C. Chr.


In [5]:
# scientific['base'] = scientific['genericName'] + " " + scientific['specificEpithet']
# scientific['extra'] = " var. " + scientific['infraspecificEpithet']
# scientific['sci_name'] = scientific['base'] + scientific['extra']

base = scientific['genericName'].fillna('') + " " + scientific['specificEpithet'].fillna('')
extra = " var. " + scientific['infraspecificEpithet']
scientific['sci_name'] = base + extra.fillna('')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [6]:
scientific_df = scientific[['taxonID', 'sci_name']]
scientific_df['taxonID']=scientific_df['taxonID'].apply(int)
scientific_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,taxonID,sci_name
0,45194626,Bolbitis novoguineensis
1,45194627,Bolbitis occidentalis
2,45194629,Acrostichum pandurifolium
3,45194630,Gymnopteris pandurifolia
4,45194631,Leptochilus pandurifolius


In [7]:
location_df = pd.read_csv(distribution)
location_df.drop(columns=['Unnamed: 0'], inplace=True)
location_df.head()

Unnamed: 0,taxonID,geopolitical_regions,location_distribution
0,45194626,"['Melanesia, Micronesia & Polynesia']",['New Guinea']
1,45194627,['South America'],['Ecuador']
2,45194628,['South America'],['Bolivia;Ecuador;Peru']
3,45194632,['Central America;South America;Undefined;Cari...,['Mexico;Belize;Guatemala;Honduras;El Salvador...
4,45194645,['Southern Asia'],['India']


Merging scientific_df and common_df together.

In [8]:
merged1 = pd.merge(scientific_df, common_df, how = "left", on="taxonID")
# merged1[merged1['vernacularName'].notnull()]
merged1.head()

Unnamed: 0,taxonID,sci_name,vernacularName
0,45194626,Bolbitis novoguineensis,
1,45194627,Bolbitis occidentalis,
2,45194629,Acrostichum pandurifolium,
3,45194630,Gymnopteris pandurifolia,
4,45194631,Leptochilus pandurifolius,


In [9]:
catlife = pd.merge(merged1, location_df,  how = "left", on="taxonID")
catlife.drop(columns=["taxonID"], inplace= True)
catlife.head()

Unnamed: 0,sci_name,vernacularName,geopolitical_regions,location_distribution
0,Bolbitis novoguineensis,,"['Melanesia, Micronesia & Polynesia']",['New Guinea']
1,Bolbitis occidentalis,,['South America'],['Ecuador']
2,Acrostichum pandurifolium,,,
3,Gymnopteris pandurifolia,,,
4,Leptochilus pandurifolius,,,


In [10]:
catlife.to_csv(outfile)