# Manual Changes

## template mapping files are in the git repository

## original data in _CyVerse Discovery Environment_ 
### data file is: "ODOVIRGCLEAN.csv"

### _lifeStage_ and _ageValue_
- in _lifestage_
- create new columns _ageValue_ and _ageUnit_
- separate out lifeStage (e.g., juvenile, adult) from ageValue and ageUnit
- make sure ageUnit is spelled out and singular (e.g., "year")

### _yearCollected_
- in _eventDate_
- create new column _yearCollected_
- separate out year
- include century as well (e.g., 1999)

### _unused columns_
- LocationCode
- Note

## To Code
### _measurementValue_
- select only "1st_" measurement

### _measurementUnit_
- make sure either in "g" or "mm"

In [38]:
import pandas as pd #used
import numpy as np
import multiprocessing
import re #used
import uuid #used
from dateutil.parser import parse
from operator import eq

In [105]:
# Import Mammal Vertnet Data 
mammal = pd.read_csv("./../Original Data/mammals_no_bats_2019-03-13-2.csv")

In [106]:
# Prelimary data cleaning of yearCollected column

# Filling N/As with "Unknown"
mammal["eventdate"]=mammal["eventdate"].fillna("Unknown")

# Parsed through the eventdata column, identified year and moved year to new yearCollected column
mammal=mammal.assign(yearCollected = '')

# Creating event date variable
verbatim_date=mammal['eventdate']

# Establishing vertnet filter
vertnet_date_filter = verbatim_date.str.contains("IV|0000|September|<|NW|latter|unknown|(MCZ)|(MSU)|present|and|;|&|mainly|between|Between|BETWEEN|OR|Unknown|UNKNOWN|#|TO|\?|\'|----|19--|No Date|\,|\d{4}-\d{4}|(/n) /d|\d{4}[s]| \d{4}\'[S]|1075-07-29|975-07-17| 2088| 9999| 0201")

# Grabbing clean data
verbatim_date_clean= verbatim_date[vertnet_date_filter==False]

def year_search(year):
    """Search string for 4 digit number and pass to correct function"""
    if (re.search(r'\d{4}$', year)):
        return year_cleaner_front(year)
    elif (re.search(r'^\d{4}', year)):
        return year_cleaner_back(year)

def year_cleaner_front(year):
    """Isolate the year at the beginning of the string"""
    cleaned_year = year[len(year)-4:len(year)]
    return cleaned_year

def year_cleaner_back(year):
    """Isolate the year at the end of the string"""
    cleaned_year = year[0:4]
    return cleaned_year

mammal["yearCollected"] = verbatim_date_clean.apply(year_search)
mammal["yearCollected"] = mammal["yearCollected"].fillna("Unknown")


In [109]:
mammal.to_csv('../Mapped Data/Mammal_CheckPoint.csv', index=False)

In [8]:
# Start from checkpoint (for troubleshooting to avoid rerunning yearCollected code)
#mammal = pd.read_csv('../Mapped Data/Mammal_CheckPoint.csv')

In [9]:
# Clean up sex column 
female = mammal['sex']=="female"
male = mammal['sex']=="male"
mammal['sex'][(female == False)&(male==False)]=""

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


In [10]:
# Add required GEOME columns
mammal=mammal.assign(samplingProtocol="Unknown")
mammal=mammal.assign(measurementMethod="Unknown")
mammal=mammal.assign(country="Unknown")
mammal=mammal.assign(basisOfRecord="PreservedSpecimen")

In [12]:
# Create verbatimEventData column for columns that present a range
mammal=mammal.assign(verbatimEventDate = '')
mammal['verbatimEventDate']=mammal['eventdate']

In [14]:
# Rearrange columns so that template columns are first, followed by measurement values

# Create column list
cols = mammal.columns.tolist()

# Specify desired columns
cols = ['catalognumber',
        'collectioncode',
        'decimallatitude',
        'decimallongitude',
        'maximumelevationinmeters',
        'minimumelevationinmeters',
        'institutioncode',
        'verbatimEventDate',
        'locality',
        'samplingProtocol',
        'measurementMethod',
        'country',
        'sex',
        'lifestage',
        'scientificname',
        'references',
        'basisOfRecord',
        'yearCollected',
        '1st_body_mass',
        '1st_ear_length',
        '1st_hind_foot_length',
        '1st_tail_length',
        '1st_total_length']

# Subset dataframe
mammal = mammal[cols]

In [15]:
# Matching template and column terms

# Renaming columns 
mammal = mammal.rename(columns = {'catalognumber': 'catalogNumber',
                                 'collectioncode':'collectionCode',
                                 'decimallatitude':'decimalLatitude',
                                 'decimallongitude':'decimalLongitude',
                                 'maximumelevationinmeters':'maximumElevationInMeters',
                                 'minimumelevationinmeters':'minimumElevationInMeters',
                                 'institutioncode' :'institutionCode',
                                 'locality':'verbatimLocality',
                                 'lifestage':'lifeStage',
                                 'scientificname':'scientificName'})

In [16]:
# Matching trait and ontology terms

# Renaming columns
mammal = mammal.rename(columns={'1st_body_mass':'body mass',
                                '1st_ear_length': 'ear length',
                                '1st_hind_foot_length':'pes length',
                                '1st_tail_length':'tail length',
                                '1st_total_length':'body length'})

In [17]:
# Create materialSampleID which is a UUID for each measurement
mammal=mammal.assign(materialSampleID = '')
mammal['materialSampleID'] = [uuid.uuid4().hex for _ in range(len(mammal.index))]

# Create eventID and populate it with materialSampleID
mammal=mammal.assign(eventID = mammal["materialSampleID"])

In [18]:
# Fill unknown for scientificName
mammal["scientificName"]=mammal["scientificName"].fillna("Unknown")

In [19]:
# Add required GEOME column locality after reassigning locality to verbatimLocality
mammal=mammal.assign(locality="Unknown")

In [20]:
# Creating long version, first specifiying keep variables, then naming variable and value

longVersMammal=pd.melt(mammal,
                      id_vars=['catalogNumber',
                      'collectionCode',
                      'decimalLatitude',
                      'decimalLongitude',
                      'maximumElevationInMeters', 
                      'minimumElevationInMeters',
                      'yearCollected',
                      'basisOfRecord',
                      'verbatimEventDate',
                      'institutionCode',
                      'lifeStage',
                      'verbatimLocality',
                      'locality',
                      'samplingProtocol',
                      'measurementMethod',
                      'country',
                      'sex',
                      'scientificName',
                      'materialSampleID',
                      'eventID',
                      'references'], 
                var_name = 'measurementType',
                value_name = 'measurementValue')

In [21]:
# Populating measurementUnit column with appropriate measurement units in long version
longVersMammal=longVersMammal.assign(measurementUnit="")

long_body_mass_filter=longVersMammal['measurementType']=="body mass"
long_no_body_filter=longVersMammal['measurementType']!="body mass"

longVersMammal['measurementUnit'][long_body_mass_filter] = "g"
longVersMammal['measurementUnit'][long_no_body_filter] = "mm"


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [22]:
# Create diagnosticID which is a unique number for each measurement
longVersMammal=longVersMammal.assign(diagnosticID = '')
longVersMammal['diagnosticID'] = np.arange(len(longVersMammal))

In [23]:
# If measurement value equals N/A, delete entire row
longVersMammal = longVersMammal.dropna(subset=['measurementValue'])

# Drop first row of data, it contains no measurementValue but is still retained
longVersMammal = longVersMammal.drop(longVersMammal.index[0])

In [24]:
# Writing long data csv file
longVersMammal.to_csv('../Mapped Data/Mammal_Data_Long.csv', index=False)

In [26]:
#Chunking the data in more managable sizes for GEOME

# Create as many processes as there are CPUs on your machine
num_processes = multiprocessing.cpu_count()

# Initiate the chunk size 
chunk_size = 50000

# Create chunks
chunks = [longVersMammal.ix[longVersMammal.index[i:i + chunk_size]] for i in range(0, longVersMammal.shape[0], chunk_size)]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  # Remove the CWD from sys.path while we load stuff.


In [28]:
# Creating data chunks
for i in range(len(chunks)):
    new=i+1
    chunks[i].to_csv('../Mapped Data/FuTRES_Mammals_VertNet_Global_Modern_'+ str(new) +'.csv', index=False)
    print("mapped_data",i, " done")

mapped_data 0
mapped_data 1
mapped_data 2
mapped_data 3
mapped_data 4
mapped_data 5
mapped_data 6
mapped_data 7
mapped_data 8
mapped_data 9
