# To Code

## template mapping files are in the git repository
## original data in _CyVerse Discovery Environment_ 
### data file is: "J.Biogeo.2008.AllData.Final.csv"

### _elevationInMeters_
- in _elevation.ft_
- convert ot meters

### _catalogNumber_
- in Specimen.Number column (new catalogNumber)
- separate out institutionCode from Specimen.Number
- create new column titled institutionCode

### _measurementUnit_
- either in "g" or "mm"

### _otherCatalogNumbers_
- concatenated list of:
    - Proxy.Specimen.Number
    - Annual.Specimen.Number
    - YOC.Specimen.Number
    
### _basisOfRecord_
- required in GEOME
- create basisOfRecord column and assign "FossilSpecimen"

### _locality_
- required in GEOME
- create locality column and assign "Unknown"

### _sampleProtocol_
- required in GEOME
- create sampleProtocol column and assign "Unknown"

### _yearCollected_
- required in GEOME
- create yearCollected column and assign "Unknown"

### _measurementMethod_
- required in GEOME
- create measurementMethod column and assign "Unknown"
    
### _materialSampleID_
- create materialSampleID and assign it a UUID value for each row (before long version)

### _diagnosticID
- create diagnosticID and name 

### _eventID
- create eventID and assign it to materialSampleID values

In [1]:
import pandas as pd
import numpy as np
import uuid
import re

In [2]:
#Import Biogeo Data Locally
biogeo = pd.read_csv("../Original Data/biogeo.csv")

In [3]:
#Preliminary data cleaning

#Convert elevation.ft values from feet to meters
#1 foot is exactly 0.3048 meters
biogeo['elevation.ft']=biogeo['elevation.ft'].multiply(0.3048)

In [4]:
#Preliminary data cleaning

#Creating a new column called institutionCode and moving from Specimen.Number to institutionCode.  
biogeo=biogeo.assign(institutionCode = "")
for ind in biogeo.index:
    x=biogeo['Specimen.Number'][ind]
    y=str(x)
    z=str(y).split()
    biogeo['institutionCode'][ind]=z[0]
    y=re.sub(z[0],'',y)
    biogeo['Specimen.Number'][ind]=y


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()


In [5]:
#Add required GEOME columms
biogeo=biogeo.assign(basisOfRecord="PreservedSpecimen")
biogeo=biogeo.assign(scientificName="Spermophilus beecheyi")
biogeo=biogeo.assign(country="Unknown")
biogeo=biogeo.assign(locality="Not Collected")
biogeo=biogeo.assign(yearCollected="Unknown")
biogeo=biogeo.assign(samplingProtocol="Not Collected")
biogeo=biogeo.assign(measurementMethod="Unknown")

In [6]:
#Add measurementUnit column 
biogeo=biogeo.assign(measurementUnit = "")

In [7]:
#Add otherCatalogNumbers
biogeo=biogeo.assign(otherCatalogNumbers = biogeo['Proxy.Specimen.Number'].fillna('')+biogeo['Annual.Specimen.Number'].fillna('')+biogeo['Annual.Specimen.Number'].fillna('') )

In [8]:
#Rearrange columns so that template columns are first, followed by measurement values

#Create column list
cols = biogeo.columns.tolist()

#Specify desired columns
cols = ['Specimen.Number',
        'institutionCode',
        'otherCatalogNumbers',
        'dec.lat',
        'dec.long',  
        'max.error',
        'elevation.ft',
        'hind.foot.length.mm',
        'tail.length.mm',
        'total.length.mm',
        'body.mass.g',
        'basisOfRecord',
        'scientificName',
        'country',
        'locality',
        'yearCollected',
        'samplingProtocol',
        'measurementMethod',
        'measurementUnit']

#Subset dataframe
biogeo = biogeo[cols]

In [9]:
#Matching template and column terms

#Renaming columns 
biogeo = biogeo.rename(columns = {'Specimen.Number':'catalogNumber', 
                                  'dec.lat':'decimalLatitude', 
                                  'dec.long':'decimalLongitude',  
                                  'max.error':'coordinateUncertaintyInMeters', 
                                  'elevation.ft':'pointElevationInMeters'})

In [10]:
#Matching trait and ontology terms

#Renaming columns
biogeo = biogeo.rename(columns={'hind.foot.length.mm':'pes length',
                                'tail.length.mm': 'tail length',
                                'total.length.mm':'body length',
                                'body.mass.g':'body mass'})

In [11]:
#create materialSampleID which is a UUID for each measurement
#create eventID and populate it with materialSampleID

biogeo=biogeo.assign(materialSampleID = '')
biogeo['materialSampleID'] = [uuid.uuid4() for _ in range(len(biogeo.index))]

for ind in biogeo.index:
    x=biogeo['materialSampleID'][ind]
    y=str(x)
    z=y.replace("-", '_')
    
    biogeo['materialSampleID'][ind] = z

biogeo=biogeo.assign(eventID = biogeo["materialSampleID"])


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':


In [12]:
#create long version so that each trait has its own row

#creating long version, first specifiying keep variables, then naming variable and value
longVers=pd.melt(biogeo, 
                id_vars=['catalogNumber',
                         'institutionCode',
                         'otherCatalogNumbers',
                         'decimalLatitude',
                         'decimalLongitude',  
                         'coordinateUncertaintyInMeters',
                         'pointElevationInMeters',
                         'materialSampleID',
                         'eventID',
                         'basisOfRecord',
                         'scientificName',
                         'country',
                         'locality',
                         'yearCollected',
                         'samplingProtocol',
                         'measurementMethod',
                         'measurementUnit'], 
                          var_name = 'measurementType', 
                          value_name = 'measurementValue')

#Populating measurementUnit column with appropriate measurement units in long version
for ind in longVers.index:
    if longVers['measurementType'][ind] == "body mass":
        longVers['measurementUnit'][ind]="g"
    else:
        longVers['measurementUnit'][ind]="mm"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [13]:
#Create diagnosticID which is a unique number for each measurement
longVers=longVers.assign(diagnosticID = '')
longVers['diagnosticID'] = np.arange(len(longVers))

In [14]:
#If measurement value equals N/a, delete entire row
longVers = longVers.dropna(subset=['measurementValue'])

#Drop first row of data, it contains no measurementValue but is still retained
longVers = longVers.drop(longVers.index[0])

In [15]:
#round coordinateUncertaintyInMeters column to integer value

for ind in longVers.index:
    x=round(longVers["coordinateUncertaintyInMeters"][ind])
    y=int(x)
    longVers["coordinateUncertaintyInMeters"][ind]=y


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,catalogNumber,institutionCode,otherCatalogNumbers,decimalLatitude,decimalLongitude,coordinateUncertaintyInMeters,pointElevationInMeters,materialSampleID,eventID,basisOfRecord,scientificName,country,locality,yearCollected,samplingProtocol,measurementMethod,measurementUnit,measurementType,measurementValue,diagnosticID
1,100740,MVZ,MVZ 100740,35.328304,-119.845250,4.0,822.96,47102270_2189_4f2d_9c1a_cef2d3e24c74,47102270_2189_4f2d_9c1a_cef2d3e24c74,PreservedSpecimen,Not Collected,Unknown,Not Collected,Unknown,Not Collected,Unknown,mm,pes length,50,1
2,101240,MVZ,MVZ 101240MVZ 101240,37.850522,-122.536923,2.0,,a49d1668_f13b_4dbc_bd95_6847b43b2c4e,a49d1668_f13b_4dbc_bd95_6847b43b2c4e,PreservedSpecimen,Not Collected,Unknown,Not Collected,Unknown,Not Collected,Unknown,mm,pes length,55,2
3,101332,MVZ,MVZ 101332MVZ 101332,38.107071,-122.841182,1.0,,63c69d7b_e154_416f_ac53_8cd8ad0dbabc,63c69d7b_e154_416f_ac53_8cd8ad0dbabc,PreservedSpecimen,Not Collected,Unknown,Not Collected,Unknown,Not Collected,Unknown,mm,pes length,52,3
4,101333,MVZ,MVZ 101333MVZ 101333MVZ 101333,38.119278,-122.821322,2.0,45.72,fae1c5fe_fda0_4893_8832_2eec8bc49bfe,fae1c5fe_fda0_4893_8832_2eec8bc49bfe,PreservedSpecimen,Not Collected,Unknown,Not Collected,Unknown,Not Collected,Unknown,mm,pes length,55,4
5,101334,MVZ,MVZ 101334,38.119278,-122.821322,2.0,45.72,4b6b2d94_b1c3_4229_8865_4d999f5dcb8b,4b6b2d94_b1c3_4229_8865_4d999f5dcb8b,PreservedSpecimen,Not Collected,Unknown,Not Collected,Unknown,Not Collected,Unknown,mm,pes length,60,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1127,99914,MVZ,MVZ 99914,33.911517,-116.796437,0.0,548.64,146084bc_73dc_4e7e_bed6_fead4bf235d4,146084bc_73dc_4e7e_bed6_fead4bf235d4,PreservedSpecimen,Not Collected,Unknown,Not Collected,Unknown,Not Collected,Unknown,g,body mass,725,1127
1128,99915,MVZ,MVZ 99915,33.911517,-116.796437,0.0,548.64,1082164e_ec06_43a5_936f_225433860888,1082164e_ec06_43a5_936f_225433860888,PreservedSpecimen,Not Collected,Unknown,Not Collected,Unknown,Not Collected,Unknown,g,body mass,560,1128
1129,99916,MVZ,MVZ 99916,33.911517,-116.796437,0.0,548.64,7d605c4b_13b4_450a_a5ce_a6ffda81c708,7d605c4b_13b4_450a_a5ce_a6ffda81c708,PreservedSpecimen,Not Collected,Unknown,Not Collected,Unknown,Not Collected,Unknown,g,body mass,800,1129
1130,99917,MVZ,MVZ 99917,33.911517,-116.796437,0.0,548.64,97e3b201_e145_40fe_8104_ee571c9ffad9,97e3b201_e145_40fe_8104_ee571c9ffad9,PreservedSpecimen,Not Collected,Unknown,Not Collected,Unknown,Not Collected,Unknown,g,body mass,760,1130


In [16]:
#Writing long data csv file
longVers.to_csv('../Mapped Data/FuTRES_Spermophilus.beecheyi_Blois_NorthAmerica_Modern.csv', index = False);