# To Code:

## template mapping files are in the git repository

## original data in _CyVerse Discovery Environment_ 
### data file is: "EAP Florida Modern Deer Measurements_VforFuTRES_Dec2019.xlsx"

### _ageUnit_
- create column for _ageUnit_
- all in "year" (spelled out and singular)

### _measurementUnit_
- create column _measurementUnit_
- measurements either in "g" or "mm" (abbreviated)

### _yearCollected_
- in _Standardized Collection Date_
- create new column yearCollected
- separate out year
- include century as well (e.g., 1999)

### _otherCatalogNumbers_
- concatenated list of:
    - [UF#]

### _verbatimLatitude_
- in _Verbatim Coordinates [Latitude/Longitude]_
- create new column verbatimLatitude
- separate out latitude coordinates

### _verbatimLongitude_
- in _Verbatim Coordinates [Latitude/Longitude]_
- create new column verbatimLongitude
- separate out longitude coordinates

### _unused columns_
- Common Name
- [Kingdom]
- [Phylum]
- [Class]
- [Order]
- [Family]
- [Genus]
- Specific Epithet [species]
- Infraspecific Epithet [subspecies]
- Taxon Rank
- [Collector Number]
- [# Specimens]
- Organism ID
- Identified By
- Year/Month/Day Identified
- Identification Remarks
- Country Code
- [County]
- Municipality [City]
- Land owner
- Geodetic Datum
- Coordinate Uncertainty in Meters
- Coordinate Precision
- Elevation Accuracy
- Depth of Bottom in Meters
- Minimum Distance Above Surface in Meters
- Maximum Distance Above Surface in Meters
- Georeference Protocol
- Georeference Verified?
- Location Remarks
- Accession Date
- [Collector]
- Event Entered By
- Habitat/Subhabitat/Substratum
- Establishment Means
- Permit Information
- [Specimen Location]
- [Data location]
- [on_loan]
- [time_stamp]
- Associated Taxa
- Images
- Occurrence Remarks
- Additional Notes Available?
- Skeletal Element Name


In [1]:
import pandas as pd
import numpy as np 
import uuid
import re

In [2]:
#Import EAP Florida Modern Deer Data Locally
deerData = pd.read_csv("../Original Data/EAP Florida Modern Deer Measurements_FORFUTRES_1_23_2020.csv")

#Drop unnecessary rows 
deerData = deerData.iloc[33:]

#Create new header
new_header = deerData.iloc[0] 
deerData = deerData[1:] 
deerData.columns = new_header

deerData

33,Compiled Terms [EAP database terms are in square brackets],Occurrence ID,Other References to this Occurrence,Taxon [in EA database this is taxon + Species],Common Name,[Kingdom],[Phylum],[Class],[Order],[Family],...,"Tibia Ll, (length of the lateral side, von den Driesch 1976 *note she says only in horses), mm","Tibia SD, (smallest breadth of the diaphysis, von den Driesch 1976), mm","Tibia Bp, (greatest breadth of the proximal end, von den Driesch 1976), mm","Tibia Bd, (greatest breadth of the distal end, von den Driesch 1976), mm","Tibia Dd, (greatest depth of the distal end, von den Driesch 1976), mm",Measurement Remarks,Measurements by,Measurement Date,Measurement Method,Measurement Accuracy
34,,--,--,Odocoileus virginianus,white tailed deer,Animalia,Chordata,Mammalia,Artiodactyla,Cervidae,...,--,--,--,--,--,skin weight in tanned,collector/EAP staff,Carcass measurements presumably done on collec...,Live and carcass weights and measures methods ...,Unknown for fresh carcass field weights and me...
35,,--,--,Odocoileus virginianus,white tailed deer,Animalia,Chordata,Mammalia,Artiodactyla,Cervidae,...,--,--,--,--,--,--,collector/EAP staff,Carcass measurements presumably done on collec...,Live and carcass weights and measures methods ...,Unknown for fresh carcass field weights and me...
36,,--,--,Odocoileus virginianus,white tailed deer,Animalia,Chordata,Mammalia,Artiodactyla,Cervidae,...,--,--,--,--,--,--,collector/EAP staff,Carcass measurements presumably done on collec...,Live and carcass weights and measures methods ...,Unknown for fresh carcass field weights and me...
37,,--,--,Odocoileus virginianus,white tailed deer,Animalia,Chordata,Mammalia,Artiodactyla,Cervidae,...,--,--,--,--,--,--,collector/EAP staff,Carcass measurements presumably done on collec...,Live and carcass weights and measures methods ...,Unknown for fresh carcass field weights and me...
38,,--,--,Odocoileus virginianus,white tailed deer,Animalia,Chordata,Mammalia,Artiodactyla,Cervidae,...,--,--,--,--,--,--,Samantha McCrane,19-Oct,Tibia GL and LI taken with generic (Marathon-s...,"For skeletal metrics, Mitutoyo 8in and 6in com..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
243,,--,--,Odocoileus virginianus,white tailed deer,Animalia,Chordata,Mammalia,Artiodactyla,Cervidae,...,--,19.29,--,28.37,22.77,proximal tibia unfused,Jessica King,19-Oct,Tibia GL and LI taken with generic (Marathon-s...,"For skeletal metrics, Mitutoyo 8in and 6in com..."
244,,42C49910-59B3-4A91-8AC8-87FEF4B72006,UF 13489,Odocoileus virginianus,white tailed deer,Animalia,Chordata,Mammalia,Artiodactyla,Cervidae,...,--,--,--,--,--,--,collector/EAP staff,Carcass measurements presumably done on collec...,Live and carcass weights and measures methods ...,Unknown for fresh carcass field weights and me...
245,,42C49910-59B3-4A91-8AC8-87FEF4B72006,UF 13489,Odocoileus virginianus,white tailed deer,Animalia,Chordata,Mammalia,Artiodactyla,Cervidae,...,--,--,--,--,--,--,collector/EAP staff,Carcass measurements presumably done on collec...,Live and carcass weights and measures methods ...,Unknown for fresh carcass field weights and me...
246,,42C49910-59B3-4A91-8AC8-87FEF4B72006,UF 13489,Odocoileus virginianus,white tailed deer,Animalia,Chordata,Mammalia,Artiodactyla,Cervidae,...,--,--,--,--,--,--,collector/EAP staff,Carcass measurements presumably done on collec...,Live and carcass weights and measures methods ...,Unknown for fresh carcass field weights and me...


In [3]:
#Preliminary data cleaning

#Add ageUnit column to deer data
deerData=deerData.assign(ageUnit = "year")

#Create yearCollected column to deer data
deerData=deerData.assign(yearCollected = "")

#Create otherCatalogNumbers column to deer data
deerData=deerData.assign(otherCatalogNumbers = "") 

#Create verbatimLatitude column to deer data
deerData=deerData.assign(verbatimLatitude = "")

#Create verbatimLongitude column to deer data
deerData=deerData.assign(verbatimLongitude = "")

for ind in deerData.index:
    x=deerData['Verbatim Coordinates [Latitude/Longitude]'][ind]
    y=str(x)
    z=str(y).split()
    deerData['verbatimLatitude'][ind]=z[0]
    deerData['verbatimLongitude'][ind]=z[1]

    #populating yearCollected column with year 
    a=deerData['Standardized Collection Date'][ind]
    b=str(a)
    slash=re.compile('/')
    
    if slash.findall(b):
        c=b.split('/')
        deerData['yearCollected'][ind]=c[2]
        

In [4]:
# Clean up sex column 
female = deerData['[Sex]']=="F"
male = deerData['[Sex]'] == "M"
deerData['[Sex]'][(female == False)&(male==False)]="not collected"
deerData['[Sex]'][female == True]="female"
deerData['[Sex]'][male == True]="male"

In [5]:
# Clean up Side column
right = deerData['Side']=="R"
left = deerData['Side'] == "L"
deerData['Side'][(right == False)&(left ==False)]=""
deerData['Side'][right == True]="right"
deerData['Side'][left == True]="left"

In [6]:
# Remove -- in Minimum Elevation in Meters and Maximum Elevation in Meters
dash_min = deerData['Minimum Elevation in Meters']=="--"
dash_max = deerData['Maximum Elevation in Meters']=="--"
deerData['Minimum Elevation in Meters'][dash_min == True]=""
deerData['Maximum Elevation in Meters'][dash_max == True]=""

In [7]:
# Remove -- in Minimum Depth in Meters and Maximum Depth in Meters
dash_min = deerData['Minimum Depth in Meters']=="--"
dash_max = deerData['Maximum Depth in Meters']=="--"
deerData['Minimum Depth in Meters'][dash_min == True]=""
deerData['Maximum Depth in Meters'][dash_max == True]=""

In [8]:
# Set up verbatimAge column
deerData["verbatimAgeValue"]=deerData['[Age] Value in Years [in EA database, all age info goes in one category = Age]']

In [9]:
# Clean up Age column
age_filter = deerData['[Age] Value in Years [in EA database, all age info goes in one category = Age]'].str.contains("-")
deerData['[Age] Value in Years [in EA database, all age info goes in one category = Age]'][age_filter] = ""

In [10]:
# Clean up Reproductive Condition column

reproduction_data=deerData["Reproductive Condition"]
not_pregnant_filter = reproduction_data.str.contains("not|non")
pregnant_filter = reproduction_data.str.contains("fetus|several")
space_filter = reproduction_data.str.contains("--")

# Reassigning data
deerData["Reproductive Condition"][not_pregnant_filter==True]="non-reproductive"
deerData["Reproductive Condition"][pregnant_filter==True]="pregnant"
deerData["Reproductive Condition"][space_filter==True]=""

deerData["Reproductive Condition"]

34                     
35                     
36                     
37                     
38                     
             ...       
243                    
244    non-reproductive
245    non-reproductive
246    non-reproductive
247    non-reproductive
Name: Reproductive Condition, Length: 214, dtype: object

In [11]:
# Clean up life stage

young_adult = deerData['[Age] Life Stage']=="\"young adult\""
dash_age = deerData['[Age] Life Stage']=="--"
deerData['[Age] Life Stage'][young_adult == True]="adult"
deerData['[Age] Life Stage'][dash_age == True]=""


In [12]:
# add otherCatalogNumbers column
deerData=deerData.assign(otherCatalogNumbers = deerData['[UF#]'].fillna(''))

In [13]:
#Rearrange columns so that template columns are first, followed by measurement values

#Create column list
cols = deerData.columns.tolist()

#Specify desired columns
cols = ['Occurrence ID',
        'Other References to this Occurrence',
        'Taxon [in EA database this is taxon + Species]',
        'EA Cat number [in EA database, this is catalog+catalog extension]',
        '[Sex]',
        '[Age] Value in Years [in EA database, all age info goes in one category = Age]',
        '[Age] Life Stage',
        '[Age] Age Remarks',
        'ageUnit',
        'verbatimAgeValue',
        '[Country]',
        'State [also in EA database as Country/State]',
        'Locality [in EAP database as City/Modifier]',
        'Verbatim Locality',
        'verbatimLatitude',
        'verbatimLongitude',
        'Latitude',
        'Longitude',
        'Minimum Elevation in Meters',
        'Maximum Elevation in Meters',
        '[Elevation]',
        'Minimum Depth in Meters',
        'Maximum Depth in Meters',
        'Verbatim Depth',
        'Collection Date [EAP database has only one date category, randomly used for whatever]',
        'yearCollected',
        'Sampling Protocol',
        'Reproductive Condition',
        'Event Remarks',
        'Side',
        'Total Fresh Weight (g)',
        'Height (mm) [define?]',
        'TL (mm) [Total Length]',
        'HF (mm) [Hind Foot Length]',
        'Measurement Remarks',
        'Measurements by',
        'Measurement Date',
        'Measurement Method',
        'Measurement Accuracy',
        'otherCatalogNumbers']

#Subset dataframe
deerData = deerData[cols]

In [14]:
#Matching template and column terms

#Renaming columns 
deerData = deerData.rename(columns = {'Occurrence ID':'occurrenceID',
                                      'Other References to this Occurrence': 'references',
                                      'Taxon [in EA database this is taxon + Species]': 'scientificName',
                                      'EA Cat number [in EA database, this is catalog+catalog extension]':'catalogNumber',
                                      '[Sex]':'sex',
                                      '[Age] Value in Years [in EA database, all age info goes in one category = Age]':'ageValue',
                                      '[Age] Life Stage':'lifeStage',
                                      '[Age] Age Remarks':'ageEstimationMethod',
                                      '[Country]':'country',    
                                      'State [also in EA database as Country/State]':'stateProvince',
                                      'Locality [in EAP database as City/Modifier]':'locality',
                                      'Verbatim Locality':'verbatimLocality',
                                      'Latitude':'decimalLatitude',
                                      'Longitude':'decimalLongitude',
                                      'Minimum Elevation in Meters':'minimumElevationInMeters',
                                      'Maximum Elevation in Meters': 'maximumElevationInMeters',
                                      '[Elevation]': 'verbatimElevation',
                                      'Minimum Depth in Meters': 'minimumDepthInMeters',
                                      'Maximum Depth in Meters': 'maximumDepthInMeters',
                                      'Verbatim Depth':'verbatimDepth',
                                      'Collection Date [EAP database has only one date category, randomly used for whatever]':'verbatimEventDate',
                                      'Sampling Protocol':'samplingProtocol',
                                      'Reproductive Condition':'reproductiveCondition',
                                      'Event Remarks':'eventRemarks',
                                      'Side': 'measurementSide',
                                      'Measurement Remarks': 'measurementRemarks',
                                      'Measurements by': 'measurementDeterminedBy',
                                      'Measurement Date': 'measurementDeterminedDate',
                                      'Measurement Method': 'measurementMethod',
                                      'Measurement Accuracy': 'measurementAccuracy'})

In [15]:
#Matching trait and ontology terms

#Renaming columns
deerData = deerData.rename(columns={'Total Fresh Weight (g)': 'body mass',
                                    'Height (mm) [define?]': 'body height',
                                    'TL (mm) [Total Length]': 'body length',
                                    'HF (mm) [Hind Foot Length]': 'pes length'})


In [16]:
#Create materialSampleID which is a UUID for each measurement
#Create eventID and populate it with materialSampleID

deerData=deerData.assign(materialSampleID = '')
deerData['materialSampleID'] = [uuid.uuid4() for _ in range(len(deerData.index))]

for ind in deerData.index:
    x=deerData['materialSampleID'][ind]
    y=str(x)
    z=y.replace("-", '_')
    
    deerData['materialSampleID'][ind] = z

deerData=deerData.assign(eventID = deerData["materialSampleID"])

In [17]:
#Add GEOME required columns
deerData=deerData.assign(measurementMethod="Unknown")
deerData=deerData.assign(basisOfRecord="PreservedSpecimen")

In [18]:
#create long version so that each trait has its own row

#creating long version, first specifiying keep variables, then naming variable and value
longVers=pd.melt(deerData, 
                id_vars=['occurrenceID',
                         'eventID',
                         'references',
                         'scientificName',
                         'catalogNumber',
                         'otherCatalogNumbers',
                         'sex',
                         'ageValue',
                         'ageUnit',
                         'verbatimAgeValue',
                         'lifeStage',
                         'country',
                         'ageEstimationMethod',
                         'stateProvince',
                         'locality',
                         'verbatimLocality',
                         'verbatimLatitude',
                         'verbatimLongitude',
                         'decimalLatitude',
                         'decimalLongitude',
                         'minimumElevationInMeters',
                         'maximumElevationInMeters',
                         'verbatimElevation',
                         'minimumDepthInMeters',
                         'maximumDepthInMeters',
                         'verbatimDepth',
                         'verbatimEventDate',
                         'yearCollected',
                         'samplingProtocol',
                         'eventRemarks',
                         'reproductiveCondition',
                         'measurementRemarks',
                         'measurementSide',
                         'measurementMethod',
                         'measurementAccuracy',
                         'measurementDeterminedDate',
                         'measurementDeterminedBy',
                         'basisOfRecord',
                         'materialSampleID'], 
                var_name = 'measurementType', 
                value_name = 'measurementValue')

In [19]:
#Populating measurementUnit column with appropriate measurement units in long version
longVers=longVers.assign(measurementUnit="")

for ind in longVers.index:
    if longVers['measurementType'][ind] == "body mass":
        longVers['measurementUnit'][ind]="g"
    else:
        longVers['measurementUnit'][ind]="mm"

In [20]:
#Create diagnosticID which is a unique number for each measurement
longVers=longVers.assign(diagnosticID = '')
longVers['diagnosticID'] = np.arange(len(longVers))

In [25]:
#If measurement value equals N/a or --, delete entire row
longVers = longVers.dropna(subset=['measurementValue'])
longVers = longVers[longVers.measurementValue != "--"]

#Drop first row of data, it contains no measurementValue but is still retained
longVers = longVers.drop(longVers.index[0])

In [26]:
#Writing long data csv file
longVers.to_csv('../Mapped Data/FuTRES_EAP_Deer_Emery.csv')