Data Wrangling Notebook for Sykes Bone Data
<br />
Neeka Sewnath
<br />
nsewnath@ufl.edu

QUESTIONS: 
- Where should bone ID be perserved?

- period_name -> cultural occupation period 
- bone_date -> capture it somewhere 

- MCI-II (c. 2,000-1,700 BC, MCI)
- put it in 2000 BC is the youngest
- 1700 BC is the latest 
- Pleistocene = earliestEpochOrLowestSeries And latestEpochOrHighestSeries
- reference system : BC

MEETING NOTES:
- not using bone ID

https://docs.google.com/spreadsheets/d/1AfmnouAnLVw0eWX3dpzm2zFw27nIfXCGUJfwf3aZFfc/edit#gid=787101794

- site_code, reference_id, period_id, bone_id (test for uniqueness) 


In [4]:
import pandas as pd
import numpy as np
import re
import uuid 

Silencing warnings that are unnecessary

In [5]:
try:
    import warnings
    warnings.filterwarnings('ignore')
except:
    pass

Import Sykes Bone Data

In [6]:
data = pd.read_csv("./../Original_Data/dama_bone_measurements_full.csv")

TESTING: Checking individualID uniqueness

In [29]:
data["individID_test"] = data["site_name"].astype(str) + "," + data["period_name"].astype(str) \
                         + "," + data["bone_id"].astype(str)

# print(data["individID_test"])

# len(data)
# len(data["individID_test"].unique())

Add individualID and populate with UUID

In [None]:
# data = data.assign(individualID = '')
# data['individualID'] = [uuid.uuid4().hex for _ in range(len(data.index))]

Set samplingProtocol and measurementMethod

In [None]:
data = data.assign(samplingProtocol = data["reference_details"], measurementMethod = data["reference_details"])

Adding additional required GEOME columns

In [None]:
data = data.assign(country = "unknown", 
                   yearCollected = "unknown", 
                   locality = data["site_name"], 
                   basisOfRecord = "PreservedSpecimen")

Updating scientificName by removing paranthesis from latin name 

In [33]:
# Adding latin name to dynamicProperties because there's no verbatimScientificName
data["dynamicProperties"] = data["latin_name"]

# Removing parenthesis elements from scientificName 
data["scientificName"] = [i.rsplit(' ', 1)[0] if ')' in i else i for i in data["latin_name"]]

Adding chronometric columns

Renaming columns

In [None]:
data = data.rename(columns = {'':''})

Renaming measurementType values

In [None]:
# Replace names of terms avaliable in GEOME
# NOTE: Make sure mapping file is up to date before reprocessing (git pull from FuTRES Repo)

# Read mapping file 
mapping_file = pd.read_csv("./../Mapping Files/ontology_codeBook.csv")

# Create subset of those within FOVT or OBA
map_subset = mapping_file[(mapping_file["Status"] == "in FOVT") | (mapping_file["Status"] == "in OBA") ]

# Create a subset of Joaquin data
sykes_subset = map_subset[map_subset["name"] == "Sykes"]

# Isolating necessary columns
sykes_subset = sykes_subset[["label", "term"]]

# Create dictionary of terms
map_dict = sykes_subset.set_index('label').to_dict()['term']

# Map the new terms onto the old terms in the dataframe 
data["measurementType"] = (pd.Series(data["measurementType"])).map(map_dict)

Create materialSampleID which is a UUID for each measurement. Populate eventID with materialSampleID

In [None]:
data['materialSampleID'] = [uuid.uuid4().hex for _ in range(len(data.index))]
data = data.assign(eventID = data["materialSampleID"])

Create diagnosticID

In [None]:
data['diagnosticID'] = data['diagnosticID'] = np.arange(len(data))

Adding measurementUnit column

In [None]:
data['measurementUnit'] = "mm"

Write file to csv

In [None]:
data.to_csv('../Mapped_Data/FuTRES_Cervid_Sykes_Eurasia_Zooarch.csv')