# 0.1) Starting Code and Helper Functions

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime as dt

In [2]:
#importing all the data, from the 'Calaveras Cataloguing' spreadsheet. Updated as of 3/16/20. 

inverts = pd.read_csv('Calaveras_Cataloguing_Invertebrates.csv')
inv_0 = inverts.copy()

verts = pd.read_csv('Calaveras_Cataloguing_Vertebrates.csv')
vert_0 = verts.copy()

In [3]:
##null_list

#takes in a list and reports if it is completely null values
#useful for cleaning up data

def null_list(array):
    temp1 = []
    for j in range(len(array)):
        if pd.isnull(array[j]) == True:
            temp1.append(array[j])
    if len(temp1) == len(array):
        return True
    else:
        return False

# 0.2) Cleaning Up the Data

Data entry for this project involved adding the field number (and the date that the fossil was found), the locality number, and the taxonomy (as specific as possible). A lot of the data is either left blank or is not uniform, so code needs to be written that parses through the data and cleans it up. 

In [4]:
##0.2.1) REMOVING ANY EMPTY ROWS

#First, there are extra rows assigned with no values in them. We will get rid of them. 

inv_1 = pd.DataFrame(columns=inverts.columns)

for i in range(len(inv_0)):
    if null_list(inv_0.iloc[i][2:22]) == False:
        inv_1 = inv_1.append(inv_0.iloc[i], ignore_index = False) #ignore_index is important, can keep track from OG data

In [5]:
##0.2.2) TAXONOMY DIVERSITY + IDENTIFYING MISTAKES

#A lot of the data entries have the taxonomies listed at lower levels
#Our first step will be to identify the diversity of the group at each taxonomic level (from genus to phylum)

#checks the rows where the column condition is met
#(inv_1.loc[inv_1['ordr'] == 'Pectinidae'])
#inv_2['phylum']

temp2 = []
for col in inv_1:
    temp2.append(inv_1[col].unique())

print('The Diversity of the Calaveras Inverts Across Taxonomic Levels (v1)')   
print()
    
for i in range(2,8):
    print(inv_1.columns[i])
    print(temp2[i])
    print()

The Diversity of the Calaveras Inverts Across Taxonomic Levels (v1)

phylum
[nan 'Plantae' 'V17011' 'V17017' 'Ichnofossil' 'Mollusca' 'Arthropoda'
 'Echinodermata ' 'Echinodermata' 'Annelida' 'ichnofossil' ' '
 'Foraminifera' 'Ichnotaxon' 'Teredolites']

class
[nan 'Bivalvia' 'Cirripedia' 'Mollusca' 'Gastropoda' 'Bivalvia '
 'Echinodermata' 'Pectinidae' 'Scaphopoda' 'Gastropoda ' 'Echinoidea '
 'Cirrepedia' 'Bivalvia?' 'Scaphopoda (?)' 'Gastropoda?' ' ' 'Crustacea '
 'Maxillopoda']

ordr
[nan 'Pectinidae' 'Ostreioida' 'Venerida' 'Lucinidae' 'Cirripedia'
 'Decapoda']

family
[nan 'Pectinidae' 'Ostreidae' 'Arcidae' 'Pectinidae ' 'Mactridae'
 'Lucinidae' 'Tellinidae' 'Lucinidae ' 'Yoldiidae' 'Trochidae' 'Mytilidae'
 'Lucinidea' 'pectinidae' 'Pectinidae?' 'Spondylidae' 'Cardiidae'
 'Cardiidae?' 'arcidae?']

originalGenus
[nan]

genus
['Patinopecten' nan 'Mytilus' 'Teredolites' ' ' 'patinopecten' 'Spisula'
 'Diplodanta' 'Astrodapsis' 'astrodapsis' '???' 'terodolites'
 'teredolites' 'Mytilid

This is where we are at. We have cleaned up the data and gotten rid of extra space, and are now left with just the data (logged over 2 years by different people). The next step will be to use the lower taxonomic level IDs to fill out the upper levels. This will flush out the data and make it more useful later. Unfortunately, we will have to hard code some of this to account for variations in taxonomic levels. It will be lengthy and arduous. 

Next, we isolated the unique groups in each column of our data (shown above), which provides us with very useful information. Now we know what we have to 'hard code'. Unfortunately, due to different people doing data entry, there are some overlapping errors. These (according to taxonomic level) are:

-ERROR LIST-

1. Phylum: 
    1. locality numbers in the phylum,
    2. plantae is a kingdom, but we can ignore this
    3. the 2 'ichnofossil', 'ichnotaxon', and 'teredolites' are all the same thing, 
    4. foraminifera is not a phylum, it is a ???? ASK ASHLEY, HOW DO WE CLASSIFY FORAMS??
2. Class: 
    1. 3 bivalvia, 
    2. 3 gastropoda, 
    3. mollusca is a phylum, 
    4. pectinidae is a family, 
    5. 2 scaphopoda, 
    6. 2 cirripedia, 
    7. crustacea is a generic term (for phylum?), 
    8. echinodermata is a phylum, has multiple spellings
3. Order:
    1. pectinidae is a family,
    2. lucinidae is a family,
    3. cirripedia is a subclass, we are grouping it in 'class'    
4. Family:
    1. 4 pectinidae (one has a space after)
    2. 2 arcidae
    3. 2 lucinidae
    4. 2 cardiidae    
5. Genus:
    1. 2 patinopecten
    2. mytilidae is a family
    3. 3 teredolites
    4. diplodanta is spelled 'diplodonta'
    5. 2 astrodapsis

In [6]:
##0.2.3) TAXONOMIC ERROR CORRECTIONS

inv_2 = inv_1.copy()


#We will do a row by row sweep, because that is easiest for me
#According to the error list above

##########################################################################################################################

##1. phylum
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'phylum'] == ' '):
            inv_2.loc[i,'phylum'] = np.nan

#1A
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'phylum'] == 'V17011') or (inv_2.loc[i,'phylum'] == 'V17017'):
        inv_2.loc[i,'phylum'] = np.nan

#1C
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'phylum'] == 'Ichnofossil') or (inv_2.loc[i,'phylum'] == 'ichnofossil') or (inv_2.loc[i,'phylum'] == 'Ichnotaxon') or (inv_2.loc[i,'phylum'] == 'Teredolites'):
        inv_2.loc[i,'phylum'] = np.nan
        inv_2.loc[i,'genus'] = 'Teredolites'
        
##########################################################################################################################
        
##2. class
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'class'] == ' '):
            inv_2.loc[i,'class'] = np.nan

#2A
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'class'] == 'Bivalvia ') or (inv_2.loc[i,'class'] == 'Bivalvia?'):
        inv_2.loc[i,'class'] = 'Bivalvia'
#2B
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'class'] == 'Gastropoda ') or (inv_2.loc[i,'class'] == 'Gastropoda?'):
        inv_2.loc[i,'class'] = 'Gastropoda'
#2C
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'class'] == 'Mollusca'):
        inv_2.loc[i,'class'] = np.nan
        inv_2.loc[i,'phylum'] = 'Mollusca'        
#2D
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'class'] == 'Pectinidae'):
        inv_2.loc[i,'class'] = np.nan
        inv_2.loc[i,'family'] = 'Pectinidae'
#2E
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'class'] == 'Scaphopoda (?)'):
        inv_2.loc[i,'class'] = 'Scaphopoda'
#2F
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'class'] == 'Cirrepedia'):
        inv_2.loc[i,'class'] = 'Cirripedia'
#2G
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'class'] == 'Crustacea'):
        inv_2.loc[i,'class'] = np.nan
        inv_2.loc[i,'phylum'] = 'Arthropoda'
#2H
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'class'] == 'Echinodermata') or (inv_2.loc[i,'phylum'] == 'Echinodermata '):
        inv_2.loc[i,'class'] = np.nan
        inv_2.loc[i,'phylum'] = 'Echinodermata'

##########################################################################################################################

##3. order
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'ordr'] == ' '):
            inv_2.loc[i,'ordr'] = np.nan            
#3A
for i, row in inv_2.iterrows():
    if (inv_2.loc[i,'ordr'] == 'Pectinidae'):
        inv_2.loc[i,'ordr'] = np.nan
        inv_2.loc[i,'family'] = 'Pectinidae'
#3B
for i, row in inv_2.iterrows():
    if (inv_2.loc[i,'ordr'] == 'Lucinidae'):
        inv_2.loc[i,'ordr'] = np.nan
        inv_2.loc[i,'family'] = 'Lucinidae'        
#3C
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'ordr'] == 'Cirripedia'):
        inv_2.loc[i,'ordr'] = np.nan
        inv_2.loc[i,'class'] = 'Cirripedia'        

##########################################################################################################################

##4. family 
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'family'] == ' '):
            inv_2.loc[i,'family'] = np.nan            
#4A
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'family'] == 'Pectinidae ') or (inv_2.loc[i,'family'] == 'pectinidae') or (inv_2.loc[i,'family'] == 'Pectinidae?'):
        inv_2.loc[i,'family'] = 'Pectinidae'
#4B
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'family'] == 'arcidae?'):
        inv_2.loc[i,'family'] = 'Arcidae'
#4C
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'family'] == 'Lucinidea') or (inv_2.loc[i,'family'] == 'Lucinidae '):
        inv_2.loc[i,'family'] = 'Lucinidae'
#4D
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'family'] == 'Cardiidae?'):
        inv_2.loc[i,'family'] = 'Cardiidae'
        
##########################################################################################################################

##5. genus
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'genus'] == ' '):
            inv_2.loc[i,'genus'] = np.nan            
#5A
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'genus'] == 'patinopecten'):
        inv_2.loc[i,'genus'] = 'Patinopecten'   
#5B
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'genus'] == 'Mytilidae'):
        inv_2.loc[i,'genus'] = np.nan
        inv_2.loc[i,'family'] = 'Mytilidae'   
#5C
for i, row in inv_2.iterrows():
    if (inv_2.loc[i,'genus'] == 'teredolites') or (inv_2.loc[i,'genus'] == 'terodolites'):
        inv_2.loc[i,'genus'] = 'Teredolites'
#5D
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'genus'] == 'Diplodanta'):
        inv_2.loc[i,'genus'] = 'Diplodonta' 
#5E
for i, row in inv_2.iterrows():
    if  (inv_2.loc[i,'genus'] == 'astrodapsis'):
        inv_2.loc[i,'genus'] = 'Astrodapsis' 
        
temp3 = []
for col in inv_2:
    temp3.append(inv_2[col].unique())

print('The Diversity of the Calaveras Inverts Across Taxonomic Levels (v2)')   
print()
    
for i in range(2,8):
    print(inv_2.columns[i])
    print(temp3[i])
    print()

The Diversity of the Calaveras Inverts Across Taxonomic Levels (v2)

phylum
[nan 'Plantae' 'Mollusca' 'Echinodermata' 'Arthropoda' 'Annelida'
 'Foraminifera']

class
[nan 'Bivalvia' 'Cirripedia' 'Gastropoda' 'Scaphopoda' 'Echinoidea '
 'Crustacea ' 'Maxillopoda']

ordr
[nan 'Ostreioida' 'Venerida' 'Decapoda']

family
[nan 'Pectinidae' 'Ostreidae' 'Arcidae' 'Mactridae' 'Lucinidae'
 'Tellinidae' 'Yoldiidae' 'Trochidae' 'Mytilidae' 'Spondylidae'
 'Cardiidae']

originalGenus
[nan]

genus
['Patinopecten' nan 'Mytilus' 'Teredolites' 'Spisula' 'Diplodonta'
 'Astrodapsis' '???']



Great. This looks like some diversity data we can work with. Now we have to work backwards, using lower level taxonomic information to fill out the higher level information. This will be difficult to accomplish.

In [7]:
##0.2.4) FLUSHING OUT THE DATA

#We have our dataset...we know which lower taxa we have and can create functions based off them to fill out the higher levels.
#If I was smarter, I would have mapped out the taxonomic info onto an existing database to save time. 
#Alas, I have to hard code the rest. 

#Higher taxa info taken from ITIS, World Echinoidea Database (WED)

In [8]:
##Class (Phylum)
inv_3_1 = inv_2.copy()

#Bivalvia (Mollusca) ITIS
#Gastropoda (Mollusca) ITIS
#Scaphopoda (Mollusca) ITIS
for i, row in inv_3_1.iterrows():
    if  (inv_3_1.loc[i,'class'] == 'Bivalvia') or (inv_3_1.loc[i,'class'] == 'Gastropoda') or (inv_3_1.loc[i,'class'] == 'Scaphopoda'):
            inv_3_1.loc[i,'phylum'] = 'Mollusca'

#Cirripedia (Arthropoda) ITIS
#Crustacea (Arthropoda) ITIS
#Maxillopoda (Arthropoda) ITIS
for i, row in inv_3_1.iterrows():
    if  (inv_3_1.loc[i,'class'] == 'Cirripedia') or (inv_3_1.loc[i,'class'] == 'Crustacea') or (inv_3_1.loc[i,'class'] == 'Maxillopoda'):
            inv_3_1.loc[i,'phylum'] = 'Arthropoda'

#Echinoidea (Echinodermata) ITIS
for i, row in inv_3_1.iterrows():
    if (inv_3_1.loc[i,'class'] == 'Echinoidea'):
            inv_3_1.loc[i,'phylum'] = 'Echinodermata'

In [9]:
##Order (Class, Phylum)
inv_3_2 = inv_3_1.copy()

#Ostreioida (Bivalvia, Mollusca) ITIS
#Venerida (Bivalvia, Mollusca) ITIS
for i, row in inv_3_2.iterrows():
    if (inv_3_2.loc[i,'ordr'] == 'Ostreioida') or (inv_3_2.loc[i,'ordr'] == 'Venerida'):
        inv_3_2.loc[i,'class'] = 'Bivalvia'
        inv_3_2.loc[i,'phylum'] = 'Mollusca'

#Decapoda (Malacostraca, Arthropoda) ITIS
for i, row in inv_3_2.iterrows():
    if (inv_3_2.loc[i,'ordr'] == 'Decapoda'):
        inv_3_2.loc[i,'class'] = 'Malacostraca'
        inv_3_2.loc[i,'phylum'] = 'Arthropoda'

In [10]:
##Family (Order, Class, Phylum)
inv_3_3 = inv_3_2.copy()

#Pectinidae (Ostreoida, Bivalvia, Mollusca) ITIS
#Ostreidae (Ostreoida, Bivalvia, Mollusca) ITIS
#Spondylidae (Ostreoida, Bivalvia, Mollusca) ITIS
for i, row in inv_3_3.iterrows():
    if (inv_3_3.loc[i,'family'] == 'Pectinidae') or (inv_3_3.loc[i,'family'] == 'Ostreidae') or (inv_3_3.loc[i,'family'] == 'Spondylidae'):
        inv_3_3.loc[i,'ordr'] = 'Ostreioida'
        inv_3_3.loc[i,'class'] = 'Bivalvia'
        inv_3_3.loc[i,'phylum'] = 'Mollusca'

#Arcidae (Arcoida, Bivalvia, Mollusca) ITIS
for i, row in inv_3_3.iterrows():
    if (inv_3_3.loc[i,'family'] == 'Arcidae'):
        inv_3_3.loc[i,'ordr'] = 'Arcoida'
        inv_3_3.loc[i,'class'] = 'Bivalvia'
        inv_3_3.loc[i,'phylum'] = 'Mollusca'

#Mactridae (Veneroida, Bivalvia, Mollusca) ITIS
#Lucinidae (Veneroida, Bivalvia, Mollusca) ITIS
#Tellinidae (Veneroida, Bivalvia, Mollusca) ITIS
#Cardiidae (Veneroida, Bivalvia, Mollusca) ITIS
for i, row in inv_3_3.iterrows():
    if (inv_3_3.loc[i,'family'] == 'Mactridae') or (inv_3_3.loc[i,'family'] == 'Lucinidae') or (inv_3_3.loc[i,'family'] == 'Tellinidae') or (inv_3_3.loc[i,'family'] == 'Cardiidae'):
        inv_3_3.loc[i,'ordr'] = 'Veneroida'
        inv_3_3.loc[i,'class'] = 'Bivalvia'
        inv_3_3.loc[i,'phylum'] = 'Mollusca'

#Yoldiidae (Nuculoida, Bivalvia, Mollusca) ITIS
for i, row in inv_3_3.iterrows():
    if (inv_3_3.loc[i,'family'] == 'Yoldiidae'):
        inv_3_3.loc[i,'ordr'] = 'Nuculoida'
        inv_3_3.loc[i,'class'] = 'Bivalvia'
        inv_3_3.loc[i,'phylum'] = 'Mollusca'

#Trochidae (Archaeogastropoda, Gastropoda, Mollusca) ITIS
for i, row in inv_3_3.iterrows():
    if (inv_3_3.loc[i,'family'] == 'Trochidae'):
        inv_3_3.loc[i,'ordr'] = 'Archaegastropoda'
        inv_3_3.loc[i,'class'] = 'Gastropoda'
        inv_3_3.loc[i,'phylum'] = 'Mollusca'

#Mytilidae (Mytiloida, Bivalvia, Mollusca) ITIS
for i, row in inv_3_3.iterrows():
    if (inv_3_3.loc[i,'family'] == 'Mytilidae'):
        inv_3_3.loc[i,'ordr'] = 'Mytiloida'
        inv_3_3.loc[i,'class'] = 'Bivalvia'
        inv_3_3.loc[i,'phylum'] = 'Mollusca'

In [11]:
##Genus (Family, Order, Class, Phylum)
inv_3_4 = inv_3_3.copy()

#Patinopecten (Pectinidae, Ostreoida, Bivalvia, Mollusca) ITIS
for i, row in inv_3_4.iterrows():
    if (inv_3_4.loc[i,'genus'] == 'Patinopecten'):
        inv_3_4.loc[i,'family'] = 'Pectinidae'
        inv_3_4.loc[i,'ordr'] = 'Ostreoida'
        inv_3_4.loc[i,'class'] = 'Bivalvia'
        inv_3_4.loc[i,'phylum'] = 'Mollusca'

#Mytilus (Mytlidae, Mytiloida, Bivalvia, Mollusca) ITIS
for i, row in inv_3_4.iterrows():
    if (inv_3_4.loc[i,'genus'] == 'Mytilus'):
        inv_3_4.loc[i,'family'] = 'Mytilidae'
        inv_3_4.loc[i,'ordr'] = 'Mytiloida'
        inv_3_4.loc[i,'class'] = 'Bivalvia'
        inv_3_4.loc[i,'phylum'] = 'Mollusca'

#Teredolites (ASK ASHLEY, WHAT DO WE DO ABOUT ICHNOFOSSILS???)

#Spisula (Mactridae, Veneroida, Bivalvia, Mollusca) ITIS
for i, row in inv_3_4.iterrows():
    if (inv_3_4.loc[i,'genus'] == 'Spisula'):
        inv_3_4.loc[i,'family'] = 'Mactridae'
        inv_3_4.loc[i,'ordr'] = 'Veneroida'
        inv_3_4.loc[i,'class'] = 'Bivalvia'
        inv_3_4.loc[i,'phylum'] = 'Mollusca'

#Diplodonta (Ungulinidae, Veneroida, Bivalvia, Mollusca) ITIS
for i, row in inv_3_4.iterrows():
    if (inv_3_4.loc[i,'genus'] == 'Diplodonta'):
        inv_3_4.loc[i,'family'] = 'Ungulinidae'
        inv_3_4.loc[i,'ordr'] = 'Veneroida'
        inv_3_4.loc[i,'class'] = 'Bivalvia'
        inv_3_4.loc[i,'phylum'] = 'Mollusca'

#Astrodapsis (Echinarachniidae, Clypeasteroida, Echinoidea, Echinodermata) WED
for i, row in inv_3_4.iterrows():
    if (inv_3_4.loc[i,'genus'] == 'Astrodapsis'):
        inv_3_4.loc[i,'family'] = 'Echinarachniidae'
        inv_3_4.loc[i,'ordr'] = 'Clypeasteroida'
        inv_3_4.loc[i,'class'] = 'Echinoidea'
        inv_3_4.loc[i,'phylum'] = 'Echinodermata'

In [12]:
inv_3_4

Unnamed: 0,spec_no,loc_ID_num,phylum,class,ordr,family,originalGenus,genus,subgenus,originalSubgenus,...,subspecies,modifiers,num_individuals,element,field_num,Associated With,ID by,ID date,Catalogued by,Notes
0,300100,V17009,Mollusca,Bivalvia,Ostreoida,Pectinidae,,Patinopecten,,,...,,,1.0,shell,JPW 12-03-2013/1,,"Robins, Cristina",10-16-2017,"Dangsangtong, Peter",
1,300101,V17018,Mollusca,Bivalvia,Ostreoida,Pectinidae,,Patinopecten,,,...,,,1.0,shell,JPW 04-10-2012/3,Associated with 300102,"Robins, Cristina",10-16-2017,"Dangsangtong, Peter",
2,300102,V17018,Mollusca,Bivalvia,Ostreoida,Pectinidae,,Patinopecten,,,...,,,1.0,shell,JPW 04-10-2012/3,Associated with 300101,"Robins, Cristina",10-16-2017,"Dangsangtong, Peter",
3,300103,V17011,Mollusca,Bivalvia,Ostreoida,Pectinidae,,Patinopecten,,,...,,,1.0,shell,JPW 03-21-2012/5,,"Robins, Cristina",10-16-2017,"Dangsangtong, Peter",
4,300104,V17011,Mollusca,Bivalvia,Ostreoida,Pectinidae,,Patinopecten,,,...,,,1.0,external mold,JPW 04-22-2014/3,,"Robins, Cristina",10-16-2017,"Dangsangtong, Peter",
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3130,303230,V17021,Mollusca,Bivalvia,,,,,,,...,,,,external mold,JPW 01-15-2015,assumed JPW/locality,,08-01-2019,"Mackenzie, Laura",
3131,303231,V17021,,,,,,Teredolites,,,...,,,,burrows,JPW 01-15-2015,assumed JPW/locality; part and counterpart,,08-01-2019,"Mackenzie, Laura",
3132,303232,V17021,Mollusca,Bivalvia,,,,,,,...,,,,shell fragment,JPW 01-15-2015,assumed JPW/locality,,08-01-2019,"Mackenzie, Laura",
3133,303233,V17021,Mollusca,Bivalvia,,,,,,,...,,,,shell fragment,JPW 01-15-2015,assumed JPW/locality,,08-01-2019,"Mackenzie, Laura",


That was painful, but useful in consolidating the data into a uniform structure, ready for analysis. The final task for my data cleaning involved making all the field numbers uniform, and converting them from strings to classes using datetime, so that I can begin looking at things quantitatively. 

First, I made a function called format_test that looked for whether each fossil had a JPW number and could be converted into a datetime. This is shown in the cell below. If that condition failed, the data was either fixed for formatting or excluded. 

I put inv_3_4 into an Excel spreadsheet so I could make formatting errors and correct any anomalies, just do a general "cleaning" of the data. 

In [37]:
def format_test(df,index,column):
    if (str(df.iloc[index][column])[0:3] == 'JPW') or (str(df.iloc[index][column])[0:3] == 'CDP'):
        try:
            dt.strptime(str(df.iloc[index][column])[4:14], '%m-%d-%Y')
            return True
        except ValueError:
            return False
        #print(str(i) + " " + str(extracted.iloc[i][1])[0:3] + "          .") This was just to test the function.
    else:
        return False

In [38]:
temp4 = []
for i in range(len(inv_3_4)):
    if format_test(inv_3_4,i,16) == True:
        temp4.append(True)
    else:
        temp4.append(False)
        
inv_3_4.insert(17, 'filtered', temp4)

In [41]:
#We can use the generated list below to find which dates can be corrected, and 
#which data needs to be removed from analysis due to incomplete records. 

#for i in range(len(inv_3_4)):
    #print(inv_3_4.iloc[i][0])
    #print(inv_3_4.iloc[i][16:18])
    #print()

In [42]:
#downloading the current rendition of the data to an excel file
#I will manually fix minor formatting issues and return the file as 'cleaned.xlsx'

with pd.ExcelWriter('calaveras_inverts_for_cleaning.xlsx') as writer:
    inv_3_4.to_excel(writer)

In [43]:
cleaned = pd.read_excel('calaveras_inverts_cleaned.xlsx')
cleaned = cleaned.drop(columns=['Unnamed: 0'])

#printing for testing
#for i in range(len(cleaned)):
    #print(cleaned.iloc[i][1])
    #print(cleaned.iloc[i][16:18])
    #print()

In [44]:
temp5 = []
for i in range(len(cleaned)):
    if format_test(cleaned,i,16) == True:
        temp5.append(True)
    else:
        temp5.append(False)

cleaned.insert(18, 'filtered2', temp5)

In [56]:
inv_4 = pd.DataFrame(columns=cleaned.columns)

for i in range(len(cleaned)):
    if cleaned.iloc[i][18] == True:
        inv_4 = inv_4.append(cleaned.iloc[i], ignore_index = False)

3069

In [76]:
dates = []

for index in range(len(inv_4)):
    dates.append(dt.strptime(str(inv_4.iloc[index][16])[4:14], '%m-%d-%Y'))

inv_4.insert(17, 'date', dates)

In [77]:
#we extract the relevant data

extracted = inv_4[['spec_no', 'field_num', 'date', 'phylum', 'class', 'ordr', 'family', 'genus']].copy()
extracted

Unnamed: 0,spec_no,field_num,date,phylum,class,ordr,family,genus
0,300100,JPW 12-03-2013/1,2013-12-03,Mollusca,Bivalvia,Ostreoida,Pectinidae,Patinopecten
1,300101,JPW 04-10-2012/3,2012-04-10,Mollusca,Bivalvia,Ostreoida,Pectinidae,Patinopecten
2,300102,JPW 04-10-2012/3,2012-04-10,Mollusca,Bivalvia,Ostreoida,Pectinidae,Patinopecten
3,300103,JPW 03-21-2012/5,2012-03-21,Mollusca,Bivalvia,Ostreoida,Pectinidae,Patinopecten
4,300104,JPW 04-22-2014/3,2014-04-22,Mollusca,Bivalvia,Ostreoida,Pectinidae,Patinopecten
...,...,...,...,...,...,...,...,...
3076,303230,JPW 01-15-2015,2015-01-15,Mollusca,Bivalvia,,,
3077,303231,JPW 01-15-2015,2015-01-15,,,,,Teredolites
3078,303232,JPW 01-15-2015,2015-01-15,Mollusca,Bivalvia,,,
3079,303233,JPW 01-15-2015,2015-01-15,Mollusca,Bivalvia,,,


In [80]:
extracted.loc[0,'date'].day_name()

'Tuesday'

In [81]:
extracted['date'].min()

#this is a problem

Timestamp('2003-12-02 00:00:00')

In [85]:
time_series = extracted.sort_values(by="date")
time_series

Unnamed: 0,spec_no,field_num,date,phylum,class,ordr,family,genus
942,301042,JPW 12-02-2003/3I,2003-12-02,Mollusca,Bivalvia,,,
957,301057,JPW 12-02-2003/3I,2003-12-02,Mollusca,Bivalvia,,,
969,301069,JPW 12-02-2003/3I,2003-12-02,Mollusca,Bivalvia,,,
970,301070,JPW 12-02-2003/3I,2003-12-02,Mollusca,Bivalvia,,,
951,301051,JPW 12-02-2003/3I,2003-12-02,Mollusca,Bivalvia,,,
...,...,...,...,...,...,...,...,...
3016,303125,CDP 07-25-2019/24,2019-07-25,Mollusca,Bivalvia,,,
3017,303126,CDP 07-25-2019/25,2019-07-25,Mollusca,Bivalvia,,,
3018,303127,CDP 07-25-2019/26,2019-07-25,Mollusca,Bivalvia,Arcoida,Arcidae,
3003,303112,CDP 07-25-2019/12,2019-07-25,Mollusca,Bivalvia,Ostreioida,Pectinidae,


In [98]:
for i in range(len(time_series)):
    (time_series.iloc[i][4])print()

301042
Bivalvia

301057
Bivalvia

301069
Bivalvia

301070
Bivalvia

301051
Bivalvia

301071
Bivalvia

301072
Bivalvia

301043
Bivalvia

301058
Bivalvia

301035
Bivalvia

301037
Bivalvia

301056
Bivalvia

301044
Bivalvia

301045
Bivalvia

301038
Bivalvia

301073
Bivalvia

301055
Bivalvia

301036
Bivalvia

301052
Bivalvia

301059
Bivalvia

301060
Bivalvia

301050
Bivalvia

301049
Bivalvia

301048
Bivalvia

301067
Bivalvia

301066
Bivalvia

301065
Bivalvia

301064
Bivalvia

301063
Bivalvia

301054
Bivalvia

301053
Bivalvia

301040
Bivalvia

301039
Bivalvia

301062
Bivalvia

301068
Bivalvia

301061
Bivalvia

301041
Bivalvia

301047
Bivalvia

301046
Bivalvia

300861
nan

300483
nan

301342
Bivalvia

300481
Bivalvia

300162
Bivalvia

300480
Bivalvia

300117
Bivalvia

300887
nan

300161
nan

300160
Bivalvia

300885
Bivalvia

300886
Gastropoda

300883
nan

300577
nan

300578
Gastropoda

300882
Bivalvia

301361
Bivalvia

300154
Bivalvia

300153
Bivalvia

300152
Bivalvia

300151
Bivalvia

300150

Cirripedia

303205
nan

300404
Bivalvia

300406
Bivalvia

300405
Bivalvia

300412
Bivalvia

300503
Bivalvia

300502
Bivalvia

300501
Bivalvia

301387
Bivalvia

301386
Bivalvia

300940
Bivalvia

300941
Bivalvia

300171
Bivalvia

300170
Bivalvia

300937
Bivalvia

300168
Bivalvia

300169
Bivalvia

300939
Bivalvia

300938
Bivalvia

300257
Bivalvia

300255
Bivalvia

300254
Bivalvia

300234
Bivalvia

300235
Bivalvia

300477
Bivalvia

300245
Bivalvia

300246
Bivalvia

300247
Bivalvia

300248
Bivalvia

300256
Bivalvia

301802
Bivalvia

301808
Bivalvia

301805
Bivalvia

301806
Bivalvia

301801
Bivalvia

301809
Bivalvia

300148
Bivalvia

300146
Bivalvia

300147
Bivalvia

301804
Bivalvia

301803
Bivalvia

301810
Bivalvia

301807
Bivalvia

300250
nan

300651
nan

300482
Cirripedia

300493
nan

300260
Bivalvia

300422
Bivalvia

300421
Cirripedia

300646
Gastropoda

301154
Bivalvia

300614
Bivalvia

300249
Bivalvia

300259
Bivalvia

301474
Bivalvia

301473
Bivalvia

300695
nan

300390
Bivalvia

3001

302404
Bivalvia

302403
Bivalvia

302402
Bivalvia

302401
Bivalvia

302400
Bivalvia

301899
Bivalvia

301898
Bivalvia

302570
Bivalvia

302399
Bivalvia

302398
Bivalvia

302397
Bivalvia

302396
Bivalvia

302395
Bivalvia

302394
Bivalvia

302393
Bivalvia

302392
Bivalvia

302390
Bivalvia

302569
Bivalvia

302566
Bivalvia

302567
Bivalvia

302467
Bivalvia

302468
Bivalvia

302469
Bivalvia

302470
Bivalvia

302471
Bivalvia

302472
Bivalvia

302473
Bivalvia

302474
Bivalvia

302475
Bivalvia

302476
Bivalvia

302477
Bivalvia

302478
Bivalvia

302479
Bivalvia

302480
Bivalvia

302466
Bivalvia

302481
Bivalvia

302483
Bivalvia

302484
Bivalvia

302485
Bivalvia

302486
Bivalvia

302487
Bivalvia

302488
Bivalvia

302489
Bivalvia

302490
Bivalvia

302491
Bivalvia

302492
Bivalvia

302493
Bivalvia

302494
Bivalvia

302495
Bivalvia

302496
Bivalvia

302482
Bivalvia

302465
Bivalvia

302464
Bivalvia

302463
Bivalvia

302432
Bivalvia

302433
Bivalvia

302434
Bivalvia

302435
Bivalvia

302436
Bivalvi

303219
Bivalvia

303216
Bivalvia

301275
Bivalvia

303202
nan

301247
Bivalvia

301248
Bivalvia

301249
Bivalvia

301250
Bivalvia

301251
Bivalvia

301252
Bivalvia

301253
Gastropoda

301254
Bivalvia

301255
Bivalvia

301270
Bivalvia

301269
Bivalvia

301268
Bivalvia

301267
Bivalvia

301256
Bivalvia

301257
Bivalvia

301258
Bivalvia

301259
Bivalvia

301260
Bivalvia

301261
Bivalvia

301262
Bivalvia

301266
Bivalvia

301246
Bivalvia

303203
Bivalvia

301245
Bivalvia

301265
Bivalvia

303201
Bivalvia

303200
Bivalvia

303199
Bivalvia

303198
Bivalvia

303197
Bivalvia

301278
Bivalvia

303196
Bivalvia

303195
Bivalvia

301277
Bivalvia

303194
Bivalvia

303092
Bivalvia

301276
Bivalvia

303091
Bivalvia

301274
Bivalvia

301242
Bivalvia

301243
Bivalvia

301273
Bivalvia

301244
Bivalvia

301272
Bivalvia

301263
Bivalvia

301264
Bivalvia

301271
Bivalvia

301236
Bivalvia

301799
Bivalvia

301234
Bivalvia

301769
Bivalvia

301770
Bivalvia

301771
Bivalvia

301772
Bivalvia

301773
Bivalvia



Bivalvia

300650
Bivalvia

300649
Gastropoda

300648
Bivalvia

301480
Gastropoda

301483
Gastropoda

301482
Bivalvia

301484
Bivalvia

301287
Gastropoda

301481
Cirripedia

300222
Bivalvia

301536
Bivalvia

301130
Bivalvia

301921
Gastropoda

303026
Gastropoda

303023
Bivalvia

303024
Bivalvia

303025
Bivalvia

303027
Bivalvia

303030
Gastropoda

303031
Bivalvia

301735
Bivalvia

301092
Gastropoda

301022
Bivalvia

301148
Bivalvia

301091
Gastropoda

301094
Gastropoda

301093
Bivalvia

301023
Bivalvia

301086
Gastropoda

301024
Bivalvia

301088
Bivalvia

301090
Bivalvia

301089
Gastropoda

301087
Bivalvia

301815
Bivalvia

301813
Bivalvia

301814
Bivalvia

301812
Bivalvia

301811
Bivalvia

300105
Bivalvia

300139
Bivalvia

301345
Bivalvia

301346
Bivalvia

300509
Bivalvia

300367
Bivalvia

300368
Bivalvia

300366
Bivalvia

300364
Bivalvia

300370
Bivalvia

300365
Bivalvia

300371
Bivalvia

300372
Bivalvia

300373
Bivalvia

300374
Bivalvia

300375
Bivalvia

300369
Bivalvia

300363
Bival

In [114]:
len(time_series['class'].unique())-1

8

In [126]:
#making a counter for the total number of classes

counter = []
temp6 = []

tax_ts = pd.DataFrame(columns=('phyla', 'classes', 'orders', 'families'))
tax_ts

Unnamed: 0,phyla,classes,orders,families


In [127]:
its = pd.DataFrame(columns = time_series) #interim time series

for i in range(len(time_series)):
    
    its = its.append(time_series.iloc[i])
    
    phyla = its['phylum'].unique()
    classes = its['class'].unique()
    orders = its['ordr'].unique()
    families = its['family'].unique()
    taxon = [phyla, classes, orders, families]
    
    tax_ts= tax_ts.append(taxon)
    print(tax_ts.iloc[i])
    print()

0           Mollusca
classes          NaN
families         NaN
orders           NaN
phyla            NaN
Name: 0, dtype: object

0           Bivalvia
classes          NaN
families         NaN
orders           NaN
phyla            NaN
Name: 1, dtype: object

0           NaN
classes     NaN
families    NaN
orders      NaN
phyla       NaN
Name: 2, dtype: object

0           NaN
classes     NaN
families    NaN
orders      NaN
phyla       NaN
Name: 3, dtype: object

0           Mollusca
classes          NaN
families         NaN
orders           NaN
phyla            NaN
Name: 0, dtype: object

0           Bivalvia
classes          NaN
families         NaN
orders           NaN
phyla            NaN
Name: 1, dtype: object

0           NaN
classes     NaN
families    NaN
orders      NaN
phyla       NaN
Name: 2, dtype: object

0           NaN
classes     NaN
families    NaN
orders      NaN
phyla       NaN
Name: 3, dtype: object

0           Mollusca
classes          NaN
families         NaN
order

0           Mollusca
1                NaN
2                NaN
classes          NaN
families         NaN
orders           NaN
phyla            NaN
Name: 0, dtype: object

0           Bivalvia
1                NaN
2                NaN
classes          NaN
families         NaN
orders           NaN
phyla            NaN
Name: 1, dtype: object

0           NaN
1           NaN
2           NaN
classes     NaN
families    NaN
orders      NaN
phyla       NaN
Name: 2, dtype: object

0           NaN
1           NaN
2           NaN
classes     NaN
families    NaN
orders      NaN
phyla       NaN
Name: 3, dtype: object

0           Mollusca
1                NaN
2                NaN
classes          NaN
families         NaN
orders           NaN
phyla            NaN
Name: 0, dtype: object

0           Bivalvia
1                NaN
2                NaN
classes          NaN
families         NaN
orders           NaN
phyla            NaN
Name: 1, dtype: object

0           NaN
1           NaN
2           

KeyboardInterrupt: 

In [None]:
inv_4 = inv_4.append(cleaned.iloc[i], ignore_index = False)

# 1) Collection Curve of Invertebrate Fossils

Let us correlate the number of taxa collected with their date of discovery. This needs to be done in a couple steps. We can use the field number as the discovery date (x-axis), and the taxonomies from the spreadsheet (y-axis). 

In [None]:
#we want to find out how the number of taxa changes over time, using the field numbers as a metric for time. 
#let us start at the class level

# Collection Curve of Vertebrate Fossils - Shark Teeth