Data Wrangling Notebook for VertNet Deer Data
<br />
Neeka Sewnath
<br />
nsewnath@ufl.edu

In [3]:
import pandas as pd
import re
import uuid
import numpy as np

Silencing warnings that are unnecessary

In [4]:
try:
    import warnings
    warnings.filterwarnings('ignore')
except:
    pass

Import VertNet Deer Data

In [5]:
deer = pd.read_csv("../Original_Data/ODOVIRGCLEAN.csv")

Add required GEOME columns

In [None]:
deer=deer.assign(basisOfRecord="FossilSpecimen")
deer=deer.assign(samplingProtocol="Unknown")
deer=deer.assign(measurementMethod="Unknown")
deer=deer.assign(country="Unknown")

Clean up lifestage column by separating values into ageUnit and ageValue

In [11]:
def find_unit(life_val):
    """Separates ageUnit lifestage data"""
    life_val = str(life_val).split()
    if life_val[1]:
        return "year"

def find_age(life_val):
    """Separates ageValue from lifestage data"""
    life_val = str(life_val).split()
    if life_val[0] != "Juvenile":
        return life_val[0]


deer["ageUnit"]  = deer["lifestage"].apply(find_unit)
deer["ageValue"]  = deer["lifestage"].apply(find_age)

['nan']


IndexError: list index out of range

Remove ageValue and ageUnit values from lifestage

In [None]:
juv_filter = deer["lifestage"] == "Juvenile"
deer['lifestage'][juv_filter == True] = "juvenile"
deer['lifestage'][juv_filter == False] = ""

Parsed through the eventdata column, identified year and moved year to new yearCollected column

In [4]:
# Create yearCollected column 
deer=deer.assign(yearCollected = '')

def find_year(date):
    """Finds year within eventdate cell"""
    slash=re.compile('/')
    dash =re.compile('-')
    date = str(date)
    
    if slash.findall(date):
        return date[2]
    else dash.findall(date):
        return date[0]
    
deer["yearCollected"]  = deer["eventdate"].apply(find_unit)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [5]:
#Clean up "nan" in yearCollected
for ind in deer.index:
    if deer["yearCollected"][ind]=="nan":
        deer["yearCollected"][ind]="Unknown"
        

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [6]:
#Rearrange columns so that template columns are first, followed by measurement values

#Create column list
cols = deer.columns.tolist()

#Specify desired columns
cols = ['catalognumber',
        'collectioncode',
        'country',
        'decimallatitude',
        'decimallongitude',
        'eventdate',
        'institutioncode',
        'lifestage',
        'ageValue',
        'ageUnit',
        'locality',
        'sex',
        'scientificname',
        'yearCollected',
        'basisOfRecord',
        'samplingProtocol',
        'measurementMethod',
        '1st_body_mass',
       '1st_hind_foot_length',
       '1st_tail_length',
       '1st_total_length'
       ]

#Subset dataframe
deer = deer[cols]

In [7]:
#Matching template and column terms

#Renaming columns 
deer = deer.rename(columns = {'catalognumber':'catalogNumber', 
                            'collectioncode':'collectionCode',
                            'decimallatitude':'decimalLatitude',
                            'decimallongitude':'decimalLongitude',
                            'eventdate':'verbatimEventDate',
                            'institutioncode' :'institutionCode',
                            'lifestage':'verbatimAgeValue',
                            'locality':'verbatimLocality',
                            'scientificname':'scientificName'})


In [8]:
#Matching trait and ontology terms

#Renaming columns
deer = deer.rename(columns={'1st_body_mass':'body mass',
                            '1st_hind_foot_length':'pes length',
                            '1st_tail_length':'tail length',
                            '1st_total_length':'body length'
                           })


In [9]:
#Create new column individualID that has a unique identifer (e.g., collectionCode, insitutionCode, catalogNumber)
deer=deer.assign(individualID = deer['collectionCode'] + deer['institutionCode']+ deer['catalogNumber'])

In [10]:
#Create new column basisOfRecord which is "preservedSpecimen"
#Create new column locality and set to unknown
deer=deer.assign(basisOfRecord = 'PreservedSpecimen')
deer=deer.assign(locality="Unknown")

In [11]:
#make a measurementUnit column
deer=deer.assign(measurementUnit = "")

In [12]:
#Create materialSampleID which is a UUID for each measurement
#Create eventID and populate it with materialSampleID
deer=deer.assign(materialSampleID = '')
deer['materialSampleID'] = [uuid.uuid4() for _ in range(len(deer.index))]

for ind in deer.index:
    x=deer['materialSampleID'][ind]
    y=str(x)
    z=y.replace("-", '_')
    
    deer['materialSampleID'][ind] = z

deer=deer.assign(eventID = deer["materialSampleID"])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [13]:
#Create long version so that each trait has its own row

#Creating long version, first specifiying keep variables, then naming variable and value
longVers=pd.melt(deer, 
                id_vars=['catalogNumber',
                         'individualID',
                         'collectionCode',
                         'country',
                         'decimalLatitude',
                         'decimalLongitude', 
                         'verbatimEventDate', 
                         'institutionCode',
                         'verbatimAgeValue',
                         'ageValue',
                         'ageUnit',
                         'verbatimLocality',
                         'locality',
                         'sex',
                         'scientificName',
                         'yearCollected',
                         'basisOfRecord',
                         'materialSampleID',
                         'eventID',
                         'measurementMethod',
                         'samplingProtocol',
                         'measurementUnit'], 
                          var_name = 'measurementType', 
                          value_name = 'measurementValue')


In [14]:
#Populating measurementUnit column with appropriate measurement units in long version
for ind in longVers.index:
    if longVers['measurementType'][ind] == "body mass":
        longVers['measurementUnit'][ind]="g"
    else:
        longVers['measurementUnit'][ind]="mm"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [15]:
#Create diagnosticID which is a unique number for each measurement
longVers=longVers.assign(diagnosticID = '')
longVers['diagnosticID'] = np.arange(len(longVers))

In [16]:
#If measurement value equals N/a, delete entire row
longVers = longVers.dropna(subset=['measurementValue'])

#Drop first row of data, it contains no measurementValue but is still retained
longVers = longVers.drop(longVers.index[0])

In [17]:
#Writing long data csv file
longVers.to_csv('../Mapped Data/FuTRES_Deer_VertNet_Global_Modern.csv', index = False)