Data Wrangling Notebook for Hopkins 2008 Appendix 1 Data
<br />
Neeka Sewnath
<br />
nsewnath@ufl.edu

In [None]:
import pandas as pd
import numpy as np
import uuid

Silence unnecessary errors

In [None]:
try:
    import warnings
    warnings.filterwarnings('ignore')
except:
    pass

Import Hopkins 2008 Appendix 1 Data

In [None]:
data = pd.read_csv("../Original_Data/Hopkins2008Appendix1.csv")

Clean up species column

In [None]:
data = data.assign(scientificName = data["Species"])

Adding additional required GEOME columns
<br />
TODO: Check publication for these 

In [None]:
data = data.assign(country = "Unknown", yearCollected = "Unknown")

Set samplingProtocol and measurementMethod 
<br />
TODO: Check publication associated with this data 

In [None]:
citation = "Famoso, N. A., &amp; Davis, E. B. (2014). Occlusal enamel complexity in Middle Miocene to HOLOCENE equids (equidae: Perissodactyla) of North America. PLoS ONE, 9(2). doi:10.1371/journal.pone.0090184"

data = data.assign(samplingProtocol = citation, measurementMethod = citation)

Change MVZ specimen # to individualID
<br />
TODO: Check if this is the correct assumption

In [None]:
data = data.assign(individualID = data["MVZ specimen #"])

Rearrange columns so that template columns are first, followed by measurement values

In [None]:
# Create column list
cols = data.columns.tolist()

# Specify desired columns
cols = ['scientificName',
        'samplingProtocol',
        'measurementMethod',
        'individualID',
        'country',
        'yearCollected']

# Subset dataframe
data = data[cols]

Matching template and column terms

In [None]:
data = data.rename(columns = {"mass (g)": "body mass"}
                              #"LTRL (mm)": "lower tooth row length",
                              #"M1 width (mm)": "Upper secondary molar width 1"})

Create necessary materialSampleID column and populate with UUID (use hex to remove dashes). 

In [None]:
data = data.assign(materialSampleID = '')
data['materialSampleID'] = [uuid.uuid4().hex for _ in range(len(data.index))]

Create a long version of the data frame

In [None]:
# Creating long version, first specifiying keep variables, then naming variable and value
longVers = pd.melt(data, 
                id_vars = ['scientificName',
                           'samplingProtocol',
                           'measurementMethod',
                           'individualID'], 
                            var_name = 'measurementType', 
                            value_name = 'measurementValue')

Add measurementUnit column

If measurement value equals N/a, delete entire row

In [None]:
longVers = longVers.dropna(subset = ['measurementValue'])

Writing long data csv file

In [None]:
longVers.to_csv('../Mapped_Data/Hopkins_2008_Appendix_1.csv', index = False)