# To Code

## template mapping files are in the git repository
## original data in _CyVerse Discovery Environment_ 
### data file is: "Extant Aepyceros database_updated 11_2016.csv"

### _verbatimAgeValue_
- assign 'Age (juv, prime adult, older adult, old)' to verbatimAgeValue

### _lifeStage_
- assign lifeStage component of 'Age (juv, prime adult, older adult, old)' to lifeStage 

### _sex_
- make lowercase

### _basisOfRecord_
- required in GEOME
- create basisOfRecord column and assign "FossilSpecimen"

### _locality_
- required in GEOME
- create locality column and assign "Unknown"

### _sampleProtocol_
- required in GEOME
- create sampleProtocol column and assign "Unknown"

### _yearCollected_
- required in GEOME
- create yearCollected column and assign "Unknown"

### _materialSampleID_
- create materialSampleID and assign it a UUID value for each row (before long version)

### _diagnosticID
- create diagnosticID and name 

### _eventID
- create eventID and assign it to materialSampleID values

### _measurementMethod_
- required in GEOME
- create measurementMethod column and assign "Unknown"

### _measurementUnit_
- linear measurements are in "mm"
- separate unite from _Weight_
    - Weight is in "lb"
    

Data Wrangling Notebook for Aepyceros Data
<br />
Neeka Sewnath
<br />
nsewnath@ufl.edu

In [201]:
import pandas as pd
import numpy as np
import uuid

In [202]:
# Import Aepyceros Data 
aepyceros = pd.read_csv("../Original_Data/Aepyceros.csv")

In [203]:
# Assign aepyceros["Age (juv, prime adult, older adult, old)"] to verbatimAgeValue
aepyceros = aepyceros.assign(verbatimAgeValue = aepyceros["Age (juv, prime adult, older adult, old)"])

In [204]:
# lifeStage modification

# add lifeStage column
aepyceros=aepyceros.assign(lifeStage="")

adult_filter=aepyceros["Age (juv, prime adult, older adult, old)"].str.contains("Prime|Old|Young|Very|No")
juv_filter=aepyceros["Age (juv, prime adult, older adult, old)"].str.contains("juvenile|Juvenile")

aepyceros["lifeStage"] = aepyceros['Age (juv, prime adult, older adult, old)'].fillna("Not Collected")
aepyceros["lifeStage"][adult_filter==True] = "adult"
aepyceros["lifeStage"][juv_filter==True] = "juvenile"
        

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  aepyceros["lifeStage"][adult_filter==True] = "adult"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  aepyceros["lifeStage"][juv_filter==True] = "juvenile"


In [205]:
# sex column modification
aepyceros['SEX'] = aepyceros['SEX'].str.lower()

In [206]:
# Add GEOME required columns 
aepyceros=aepyceros.assign(basisOfRecord="FossilSpecimen")
aepyceros=aepyceros.assign(locality="Unknown")
aepyceros=aepyceros.assign(samplingProtocol="Unknown")
aepyceros=aepyceros.assign(yearCollected="Unknown")
aepyceros=aepyceros.assign(measurementMethod="Unknown")

In [207]:
#Rearrange columns so that template columns are first, followed by measurement values

#Create column list
cols = aepyceros.columns.tolist()

#Specify desired columns
cols = ['Museum',
        'Specimen #',
        'Species',
        'SEX',
        'Country/Continent',
        'State/Province',
        'lifeStage',
        'verbatimAgeValue',
        'Weight',
        'Humerus Length',
        'locality',
        'basisOfRecord',
        'samplingProtocol',
        'yearCollected',
        'measurementMethod',
        'Humerus Length',
        'Humerus Width Shaft  AP',
        'Humerus Width Shaft ML',
        'Femur Length',
        'Femur Width Shaft AP',
        'Femur Width ML',
        'Medapodial Length',
        'Medapodial Width AP',
        'Medapodial Width ML',
        'Astragalus Length',
        'Astragalus Width']

#Subset dataframe
aepyceros = aepyceros[cols]

In [208]:
#Matching template and column terms

#Renaming columns 
aepyceros = aepyceros.rename(columns = {'Museum':'institutionCode',
                                        'Specimen #':'individualID',
                                        'Species':'scientificName',
                                        'SEX':'sex',
                                        'Country/Continent':'country',
                                        'State/Province':'stateProvince'})

In [209]:
#Matching trait and ontology terms

#Renaming columns
aepyceros = aepyceros.rename(columns={'Weight':'body mass',
                                      #'Occlusal length M3':'M3 length',
                                      #'Occlusal width M3':'M3 width',
                                      #'Occlusal width M3_ remeasured 11_2016':'M3 width',
                                      #'Occlusal length M2':'M2 length',
                                      #'Occlusal length M2_remeasured 11_2016':'M2 length',
                                      #'Occlusal width M2':'M2 width',
                                      #'Occlusal width M2_remeasured 11_2016':'M2 width',
                                      #'Occlusal length M1':'M1 length',
                                      #'Occlusal width M1':'M1 width',
                                      #'Occlusal length P4':'P4 length',
                                      #'Occlusal Width P4':'P4 width',
                                      #'Occlusal length P2':'P2 length',
                                      #'Occlusal Width P2':'P2 width',
                                      #'Occlusal length M3':'m3 length',
                                      #'Occlusal length M3_remeasured 11_2016':'m3 length',
                                      #'Occlusal width M3':'m3 width',
                                      #'Occlusal width M3_remeasured 11_2016':'m3 width',
                                      #'Occlusal length M2':'m2 length', 
                                      #'Occlusal length M2_remeasured 11_2016': 'm2 length',
                                      #'Occlusal width M2':'m2 width',
                                      #'Occlusal width M2_remeasured 11_2016':'m2 width',
                                      #'Occlusal length M1':'m1 length',
                                      #'Occlusal length M1_remeasured 11_2016':'m1 length',
                                      #'m1 length':'m1 width',
                                      #'Occlusal width M1_remeasured 11_2016':'m1 width',
                                      #'Occlusal length P4':'p4 length',
                                      #'Occlusal width P4':'p4 width',
                                      #'Occlusal length P2':'p2 length',
                                      #'Occlusal width P2':'p2 width',
                                      'Humerus Length':'humerus length',
                                      #'Humerus Width Shaft  AP':'humerus diaphysis depth',
                                      #'Humerus Width Shaft ML':'humerus diaphysis breadth',
                                      'Femur Length':'femur length',
                                      #'Femur Width Shaft AP':'femur diaphysis depth',
                                      #'Femur Width ML':'femur diaphysis breadth',
                                      #'Medapodial Length':'metapodial length',
                                      #'Medapodial Width AP':'metapodial depth',
                                      #'Medapodial Width ML':'metapodial breadth',
                                      'Astragalus Length':'talus length'
                                      #'Astragalus Width':'talus breadth'
                                      })


In [210]:
#Create measurementUnit column
aepyceros=aepyceros.assign(measurementUnit="")

Creating verbatimScientificName and modifying scientificName

In [211]:
# Create verbatim scientificName [not accepted by GEOME yet]
#aepyceros=aepyceros.assign(verbatimScientificName = aepyceros["scientificName"]) 

def clean_name(name):
    """Converts scientific name to binomial nomenclature format"""
    name = str(name).split()
    new_name = " ".join(name[:-1])
    return new_name

# Clean scientificName
aepyceros["scientificName"]  = aepyceros["scientificName"].apply(clean_name)

In [212]:
#Create materialSampleID which is a UUID for each measurement
#Create eventID and populate it with materialSampleID

aepyceros=aepyceros.assign(materialSampleID = '')
aepyceros['materialSampleID'] = [uuid.uuid4().hex for _ in range(len(aepyceros.index))]

aepyceros=aepyceros.assign(eventID = aepyceros["materialSampleID"])

In [213]:
#Create long version so that each trait has its own row

#Creating long version, first specifiying keep variables, then naming variable and value
longVers=pd.melt(aepyceros, 
                id_vars=['institutionCode',
                         'individualID',
                         'scientificName',
                         #'verbatimScientificName',
                         'sex',
                         'country',
                         'stateProvince',
                         'lifeStage',
                         'verbatimAgeValue',
                         'eventID',
                         'basisOfRecord',
                         'locality',
                         'samplingProtocol',
                         'yearCollected',
                         'measurementMethod',
                         'measurementUnit',
                         'materialSampleID'], 
                var_name = 'measurementType', 
                value_name = 'measurementValue')

In [214]:
#Populating measurementUnit column with appropriate measurement units in long version
long_body_mass_filter=longVers['measurementType']=="body mass"
long_no_body_filter=longVers['measurementType']!="body mass"
longVers['measurementUnit'][long_body_mass_filter] = "g"
longVers['measurementUnit'][long_no_body_filter] = "in"

Converting and cleaning measurementValue column

In [223]:
#If measurement value equals N/a, delete entire row
longVers = longVers.dropna(subset=['measurementValue'])
longVers = longVers.drop(longVers.index[0])

# Creating verbatimMeasurementUnit [currently not accepted by GEOME]
longVers=longVers.assign(verbatimMeasurementUnit = longVers["measurementValue"])

def unit_clean(value, unit):
    """Cleans and converts measurementValue column"""
    if unit == "g":
        print(value)
    # Isolate value, convert from pounds to grams 
        #if value == "2400 oz/ 150lbs":
        #    value = 150 * 454
        #    return value
        #elif value == "70lbs":
        #    value = 70 * 454
        #    return value
        #else: 
        #    value = str(value).split()
        #    value = value[0] * 454
        #    return value
    #print(value)
        
# Clean and convert measurementValue column
longVers['measurementValue'] = longVers.apply(lambda x: unit_clean(x.measurementValue, x.measurementUnit), axis=1)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [109]:
#Create diagnosticID which is a unique number for each measurement
longVers=longVers.assign(diagnosticID = '')
longVers['diagnosticID'] = np.arange(len(longVers))

In [111]:
#Writing long data csv file
longVers.to_csv('../Mapped_Data/FuTRES_Aepyceros_Africa_Modern.csv', index=False)