# To Code

## template mapping files are in the git repository
## original data in _CyVerse Discovery Environment_ 
### data file is: "J.Biogeo.2008.AllData.Final.csv"

### _elevationInMeters_
- in _elevation.ft_
- convert ot meters

### _catalogNumber_
- in Specimen.Number column (new catalogNumber)
- separate out institutionCode from Specimen.Number
- create new column titled institutionCode

### _measurementUnit_
- either in "g" or "mm"

### _otherCatalogNumbers_
- concatenated list of:
    - Proxy.Specimen.Number
    - Annual.Specimen.Number
    - YOC.Specimen.Number
    
### _basisOfRecord_
- required in GEOME
- create basisOfRecord column and assign "FossilSpecimen"

### _locality_
- required in GEOME
- create locality column and assign "Unknown"

### _sampleProtocol_
- required in GEOME
- create sampleProtocol column and assign "Unknown"

### _yearCollected_
- required in GEOME
- create yearCollected column and assign "Unknown"

### _measurementMethod_
- required in GEOME
- create measurementMethod column and assign "Unknown"
    
### _materialSampleID_
- create materialSampleID and assign it a UUID value for each row (before long version)

### _diagnosticID
- create diagnosticID and name 

### _eventID
- create eventID and assign it to materialSampleID values

Data Wrangling Notebook for Aepyceros Data
<br />
Neeka Sewnath
<br />
nsewnath@ufl.edu

In [93]:
import pandas as pd
import numpy as np
import uuid
import re

Import data

In [94]:
# Import Biogeo Data Locally
biogeo = pd.read_csv("../Original_Data/J.Biogeo.2008.AllData.Final.csv")

Convert elevation.ft values from feet to meters

In [95]:
# 1 foot is exactly 0.3048 meters
biogeo['elevation.ft']=biogeo['elevation.ft'].multiply(0.3048)

Creating a new column called institutionCode and moving from Specimen.Number to institutionCode. 

In [96]:
# Create institutionCode 
biogeo=biogeo.assign(institutionCode = "")
for ind in biogeo.index:
    x=biogeo['Specimen.Number'][ind]
    y=str(x)
    z=str(y).split()
    biogeo['institutionCode'][ind]=z[0]
    y=re.sub(z[0],'',y)
    biogeo['Specimen.Number'][ind]=y


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  biogeo['institutionCode'][ind]=z[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  biogeo['Specimen.Number'][ind]=y


Add required GEOME columms

In [97]:
biogeo=biogeo.assign(basisOfRecord="PreservedSpecimen")
biogeo=biogeo.assign(scientificName="Spermophilus beecheyi")
#biogeo=biogeo.assign(verbatimscientificName=biogeo["ScientificName"])
biogeo=biogeo.assign(country="Unknown")
biogeo=biogeo.assign(locality="Not Collected")
biogeo=biogeo.assign(yearCollected="Unknown")
biogeo=biogeo.assign(samplingProtocol="Not Collected")
biogeo=biogeo.assign(measurementMethod="Unknown")

Create measurementUnit column

In [98]:
biogeo=biogeo.assign(measurementUnit = "")

Add otherCatalogNumbers by combining Proxy.Specimen.Number, Annual.Specimen.Number, and YOC.Specimen.Number 

In [99]:
biogeo=biogeo.assign(otherCatalogNumbers = biogeo['Proxy.Specimen.Number'].fillna('')+biogeo['Annual.Specimen.Number'].fillna('')+biogeo['YOC.Specimen.Number'].fillna('') )

Select specified columns for final dataset

In [100]:
# Create column list
cols = biogeo.columns.tolist()

# Specify desired columns
cols = ['Specimen.Number',
        'institutionCode',
        'otherCatalogNumbers',
        'dec.lat',
        'dec.long',  
        'max.error',
        'elevation.ft',
        'hind.foot.length.mm',
        'tail.length.mm',
        'total.length.mm',
        'body.mass.g',
        'ear.length.mm',
        'c.toothrow.1.mm',
        'c.toothrow.2.mm',
        'basisOfRecord',
        'scientificName',
        #verbatimscientificName
        'country',
        'locality',
        'yearCollected',
        'samplingProtocol',
        'measurementMethod',
        'measurementUnit']

# Subset dataframe
biogeo = biogeo[cols]

Matching template and column terms

In [101]:
# Renaming columns 
biogeo = biogeo.rename(columns = {'Specimen.Number':'catalogNumber', 
                                  'dec.lat':'decimalLatitude', 
                                  'dec.long':'decimalLongitude',  
                                  'max.error':'coordinateUncertaintyInMeters', 
                                  'elevation.ft':'pointElevationInMeters'})

In [102]:
#Renaming columns
biogeo = biogeo.rename(columns={'hind.foot.length.mm':'pes length',
                                'tail.length.mm': 'tail length',
                                'total.length.mm':'body length with tail',
                                'ear.length.mm':'ear length to notch',
                                'c.toothrow.1.mm':'tooth row length',
                                'c.toothrow.2.mm':'tooth row length',
                                'body.mass.g':'body mass'})

Create materialSampleID which is a UUID for each measurement (use hex to remove "-"). Create eventID and populate it with materialSampleID

In [103]:
# Create materialSampleID
biogeo=biogeo.assign(materialSampleID = '')
biogeo['materialSampleID'] = [uuid.uuid4().hex for _ in range(len(biogeo.index))]

# Create eventID
biogeo=biogeo.assign(eventID = biogeo["materialSampleID"])

Create long version of final data

In [104]:
longVers=pd.melt(biogeo, 
                id_vars=['catalogNumber',
                         'institutionCode',
                         'otherCatalogNumbers',
                         'decimalLatitude',
                         'decimalLongitude',  
                         'coordinateUncertaintyInMeters',
                         'pointElevationInMeters',
                         'materialSampleID',
                         'eventID',
                         'basisOfRecord',
                         'scientificName',
                         #'verbatimScientificName',
                         'country',
                         'locality',
                         'yearCollected',
                         'samplingProtocol',
                         'measurementMethod',
                         'measurementUnit'], 
                          var_name = 'measurementType', 
                          value_name = 'measurementValue')

Populating measurementUnit column with appropriate measurement units in long version

In [105]:
for ind in longVers.index:
    if longVers['measurementType'][ind] == "body mass":
        longVers['measurementUnit'][ind]="g"
    else:
        longVers['measurementUnit'][ind]="mm"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  longVers['measurementUnit'][ind]="mm"
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  longVers['measurementUnit'][ind]="g"


Create diagnosticID which is a unique number for each measurement

In [106]:
longVers=longVers.assign(diagnosticID = '')
longVers['diagnosticID'] = np.arange(len(longVers))

Delete measurement value columns that contain N/A value

In [107]:
#If measurement value equals N/a, delete entire row
longVers = longVers.dropna(subset=['measurementValue'])

Round coordinateUncertaintyInMeters column to integer value

In [108]:
def int_round (num):
    """Rounds coordinateUncertaintyInMeters column to integer value """
    num = round(num)
    return num

# Round coordinateUncertaintyInMeters
longVers["coordinateUncertaintyInMeters"] = longVers["coordinateUncertaintyInMeters"].apply(int_round)

Create verbatimMeasurementUnit column (currently not accepted by GEOME)

In [109]:
#longVers=longVers.assign(verbatimMeasurementUnit = longVers["measurementValue"])

Write file as csv for GEOME upload

In [110]:
#Writing long data csv file
longVers.to_csv('../Mapped_Data/FuTRES_Spermophilus.beecheyi_Blois_NorthAmerica_Modern.csv', index = False);