# To Code

## template mapping files are in the git repository
## original data in _CyVerse Discovery Environment_ 
### data file is: "Extant Aepyceros database_updated 11_2016.csv"

### _verbatimAgeValue_
- assign 'Age (juv, prime adult, older adult, old)' to verbatimAgeValue

### _lifeStage_
- assign lifeStage component of 'Age (juv, prime adult, older adult, old)' to lifeStage 

### _sex_
- make lowercase

### _basisOfRecord_
- required in GEOME
- create basisOfRecord column and assign "FossilSpecimen"

### _locality_
- required in GEOME
- create locality column and assign "Unknown"

### _sampleProtocol_
- required in GEOME
- create sampleProtocol column and assign "Unknown"

### _yearCollected_
- required in GEOME
- create yearCollected column and assign "Unknown"

### _materialSampleID_
- create materialSampleID and assign it a UUID value for each row (before long version)

### _diagnosticID
- create diagnosticID and name 

### _eventID
- create eventID and assign it to materialSampleID values

### _measurementMethod_
- required in GEOME
- create measurementMethod column and assign "Unknown"

### _unused columns_
- Location code
- Notes

### _measurementUnit_
- linear measurements are in "mm"
- separate unite from _Weight_
    - Weight is in "lb"
    

In [49]:
import pandas as pd
import numpy as np
import uuid

In [50]:
# Import Aepyceros Data 
aepyceros = pd.read_csv("../Original Data/Aepyceros.csv")

In [51]:
# Assign aepyceros["Age (juv, prime adult, older adult, old)"] to verbatimAgeValue
aepyceros = aepyceros.assign(verbatimAgeValue = aepyceros["Age (juv, prime adult, older adult, old)"])

In [52]:
# lifeStage modification

# add lifeStage column
aepyceros=aepyceros.assign(lifeStage="")

adult_filter=aepyceros["Age (juv, prime adult, older adult, old)"].str.contains("Prime|Old|Young|Very|No")
juv_filter=aepyceros["Age (juv, prime adult, older adult, old)"].str.contains("juvenile|Juvenile")

aepyceros["lifeStage"] = aepyceros['Age (juv, prime adult, older adult, old)'].fillna("Not Collected")
aepyceros["lifeStage"][adult_filter==True] = "adult"
aepyceros["lifeStage"][juv_filter==True] = "juvenile"
        

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # Remove the CWD from sys.path while we load stuff.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [53]:
aepyceros["lifeStage"]

0              adult
1              adult
2              adult
3              adult
4              adult
           ...      
101    Not Collected
102    Not Collected
103    Not Collected
104    Not Collected
105    Not Collected
Name: lifeStage, Length: 106, dtype: object

In [22]:
# sex column modification
aepyceros['SEX'] = aepyceros['SEX'].str.lower()

In [23]:
# Add GEOME required columns 
aepyceros=aepyceros.assign(basisOfRecord="FossilSpecimen")
aepyceros=aepyceros.assign(locality="Unknown")
aepyceros=aepyceros.assign(samplingProtocol="Unknown")
aepyceros=aepyceros.assign(yearCollected="Unknown")
aepyceros=aepyceros.assign(measurementMethod="Unknown")

In [25]:
#Rearrange columns so that template columns are first, followed by measurement values

#Create column list
cols = aepyceros.columns.tolist()

#Specify desired columns
cols = ['Museum',
        'Specimen #',
        'Species',
        'SEX',
        'Country/Continent',
        'State/Province',
        'lifeStage',
        'verbatimAgeValue',
        'Weight',
        'Humerus Length',
        'locality',
        'basisOfRecord',
        'samplingProtocol',
        'yearCollected',
        'measurementMethod',
        #'Medapodial Length',#fovt name?
        #'Medapodial Width AP',#fovt name?
        #'Medapodial Width ML',#fovt name?
        #'Astragalus Length',#fovt name?
        #'Astragalus Width',#fovt name?
        'Femur Length'
       ]

#Subset dataframe
aepyceros = aepyceros[cols]

In [26]:
#Matching template and column terms

#Renaming columns 
aepyceros = aepyceros.rename(columns = {'Museum':'institutionCode',
                                        'Specimen #':'individualID',
                                        'Species':'scientificName',
                                        'SEX':'sex',
                                        'Country/Continent':'country',
                                        'State/Province':'stateProvince'})

In [27]:
#Matching trait and ontology terms

#Renaming columns
aepyceros = aepyceros.rename(columns={'Weight':'body mass',
                                      'Humerus Length':'humerus length',
                                      'Femur Length':'femur length'})


In [28]:
#Create measurementUnit column
aepyceros=aepyceros.assign(measurementUnit="")

In [29]:
#Fill in blanks for required columns 
aepyceros["country"]=aepyceros["country"].fillna("Unknown")
aepyceros["scientificName"]=aepyceros["scientificName"].fillna("Unknown")

In [30]:
#Create materialSampleID which is a UUID for each measurement
#Create eventID and populate it with materialSampleID

aepyceros=aepyceros.assign(materialSampleID = '')
aepyceros['materialSampleID'] = [uuid.uuid4() for _ in range(len(aepyceros.index))]

for ind in aepyceros.index:
    x=aepyceros['materialSampleID'][ind]
    y=str(x)
    z=y.replace("-", '_')
    
    aepyceros['materialSampleID'][ind] = z

aepyceros=aepyceros.assign(eventID = aepyceros["materialSampleID"])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


In [31]:
#Create long version so that each trait has its own row

#Creating long version, first specifiying keep variables, then naming variable and value
longVers=pd.melt(aepyceros, 
                id_vars=['institutionCode',
                         'individualID',
                         'scientificName',
                         'sex',
                         'country',
                         'stateProvince',
                         'lifeStage',
                         'verbatimAgeValue',
                         'eventID',
                         'basisOfRecord',
                         'locality',
                         'samplingProtocol',
                         'yearCollected',
                         'measurementMethod',
                         'measurementUnit',
                         'materialSampleID'], 
                var_name = 'measurementType', 
                value_name = 'measurementValue')


In [32]:
#Populating measurementUnit column with appropriate measurement units in long version
long_body_mass_filter=longVers['measurementType']=="body mass"
long_no_body_filter=longVers['measurementType']!="body mass"
longVers['measurementUnit'][long_body_mass_filter] = "lb"
longVers['measurementUnit'][long_no_body_filter] = "in"

In [33]:
#Create diagnosticID which is a unique number for each measurement
longVers=longVers.assign(diagnosticID = '')
longVers['diagnosticID'] = np.arange(len(longVers))

In [34]:
#If measurement value equals N/a, delete entire row
longVers = longVers.dropna(subset=['measurementValue'])

#Drop first row of data, it contains no measurementValue but is still retained
longVers = longVers.drop(longVers.index[0])

In [35]:
#Writing long data csv file
longVers.to_csv('../Mapped Data/FuTRES_Aepyceros_Villaseñor_Africa_Modern.csv', index=False)