Data Wrangling Notebook for Hopkins 2008 Appendix 1 Data
<br />
Neeka Sewnath
<br />
nsewnath@ufl.edu

In [1]:
import pandas as pd
import numpy as np
import uuid

Silence unnecessary errors

In [2]:
try:
    import warnings
    warnings.filterwarnings('ignore')
except:
    pass

Import Hopkins 2008 Appendix 1 Data

In [3]:
data = pd.read_csv("../Original_Data/Hopkins2008Appendix1.csv")

Clean up species column

In [4]:
data = data.assign(scientificName = data["Species"])

Adding additional required GEOME columns

In [5]:
data = data.assign(country = "Unknown", yearCollected = "Unknown", locality = "Unknown")

Set samplingProtocol and measurementMethod 

In [6]:
citation = "Hopkins, S. S. (2008). Reassessing the Mass of Exceptionally Large Rodents Using Toothrow Length and Area as Proxies for Body Mass. Journal of Mammalogy, 89(1), 232–243. https://doi.org/10.1644/06-mamm-a-306.1"

data = data.assign(samplingProtocol = citation, measurementMethod = citation)

Change MVZ specimen # to individualID

In [7]:
data = data.assign(individualID = data["MVZ specimen #"])

Adding GEOME required basisofRecord column

In [8]:
data = data.assign(basisOfRecord = "FossilSpecimen")

Rearrange columns so that template columns are first, followed by measurement values

In [9]:
# Create column list
cols = data.columns.tolist()

# Specify desired columns
cols = ['scientificName',
        'samplingProtocol',
        'measurementMethod',
        'individualID',
        'country',
        'yearCollected',
        'basisOfRecord',
        'locality',
        'mass']

# Subset dataframe
data = data[cols]

Matching template and column terms

In [10]:
data = data.rename(columns = {"mass": "body mass"})
                              #"LTRL (mm)": "lower tooth row length",
                              #"M1 width (mm)": "Upper secondary molar width 1"})

Create a long version of the data frame

In [11]:
# Creating long version, first specifiying keep variables, then naming variable and value
longVers = pd.melt(data, 
                id_vars = ['scientificName',
                           'samplingProtocol',
                           'measurementMethod',
                           'individualID',
                           'country',
                           'basisOfRecord',
                           'locality',
                           'yearCollected'], 
                            var_name = 'measurementType', 
                            value_name = 'measurementValue')

Create necessary materialSampleID column and populate with UUID (use hex to remove dashes). Create eventID. 

In [12]:
longVers = longVers.assign(materialSampleID = '')
longVers['materialSampleID'] = [uuid.uuid4().hex for _ in range(len(longVers.index))]

longVers = longVers.assign(eventID = longVers['materialSampleID'])

Create diagnosticID

In [13]:
longVers=longVers.assign(diagnosticID = '')
longVers['diagnosticID'] = np.arange(len(longVers))

Add measurementUnit column

In [14]:
longVers = longVers.assign(measurementUnit = 'g')

If measurement value equals N/a, delete entire row

In [15]:
longVers = longVers.dropna(subset = ['measurementValue'])

Drop non float measurement values

In [16]:
non_float = longVers['measurementValue'].str.contains("ÔøΩ")
longVers['measurementValue'] = longVers['measurementValue'][non_float == False]

Writing long data csv file

In [17]:
longVers.to_csv('../Mapped_Data/FuTRES_Hopkins_2008_Appendix_1.csv', index = False)