# Seasonality

By **Franklin Oliveira**

-----

This notebook contains Python code to generate seasonality visualizations on the `repteis` database. Here you'll find some basic data treatment and adjustments that presented necessary to make adjustments for the graph. <font color='blue'>'Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls'</font>.

In [1]:
import datetime
import numpy as np
import pandas as pd

from collections import defaultdict

# pacotes para visualização rápida
import seaborn as sns
import matplotlib.pyplot as plt

# pacote para visualização principal
import altair as alt

# habilitando renderizador para notebook
# alt.renderers.enable('notebook')
alt.renderers.enable('default')


# desabilitando limite de linhas
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Importing data...

In [2]:
excel = pd.ExcelFile('Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls')
sheet_name = excel.sheet_names

print('The excel file contains the following sheets:', sheet_name)
print('\nDatabase is in sheet:', sheet_name[0])

The excel file contains the following sheets: ['Repteis-2020-02-11-csv']

Database is in sheet: Repteis-2020-02-11-csv


In [3]:
db = excel.parse(sheet_name[0], sep=';', encoding='utf-8-sig')

# p.s.: I'm parsing a pre-treated file provided by Asla
#db = pd.read_excel('db.csv', sep=',', encoding='utf-8-sig', low_memory=False)
print(f'The database has {db.shape[0]} rows and {db.shape[1]} columns.')

The database has 23119 rows and 114 columns.


In [4]:
# copying database to another variable to make a few changes keeping the original intact
repteis = db.copy()

<br>

<font color='red' size='5'>**p.s.:** We were told by the Curator that some names in this database may remain confidential. So, I'm gonna skip this step for now until we know how to treat this data appropriately </font>

### Name columns

`DeterminatorFirstName1` e `DeterminatorLastName1`

#### treating determiner's name 

In this step, we're going to put together 'determinator's' first and last name. 

`DeterminatorFirstName1` + `DeterminatorLastName1`

Just concatenating those two columns is not satisfactory because there's some dirt in these data columns. So, let's begin by treating them...

In [5]:
# repteis['DeterminatorFirstName1'].value_counts()
# repteis['DeterminatorLastName1'].value_counts()

In [6]:
def treat_names(name, pos='first'):
    '''
    Treat names keeping NaN as such.
    
    Arguments: 
        - name: name to be treated. 
        - pos (str): name position. One of ['first', 'last']
    '''
    if type(name) == str and pos == 'first':     # first name
        
        if len(name.split(' ')) > 1:             # treats composite names (+ 1 name)
            return str(name).strip().split(' ')[0].capitalize()
        else:
            return str(name).strip().capitalize()
    
    elif type(name) == str and pos == 'last':    # last name
        
        if len(name.split(' ')) > 1:             # treats composite last name (+ 1 surname)
            return str(name).strip().split(' ')[-1].capitalize()
        else:
            return str(name).strip().capitalize()   
    else:
        return name

applying function to the two column names

In [7]:
#crustaceas['Determiner First Name1'] = crustaceas['Determiner First Name1'].apply(treat_names)
# crustaceas['Determiner First Name1'].value_counts()

In [8]:
#crustaceas['Determiner Last Name1'] = crustaceas['Determiner Last Name1'].apply(lambda x:
#                                                                treat_names(x, pos='last'))
# crustaceas['Determiner Last Name1'].value_counts()

In [9]:
# creating column with First and Last name for identification
#crustaceas['Determiner First_and_Last Name'] = crustaceas['Determiner First Name1'] + ' ' + crustaceas['Determiner Last Name1']

<br>

## Adjusting columns names

### removing '\n'

In [10]:
repteis.columns = [str(col).replace(r'\n','') for col in repteis.columns]

#### closer look on some columns...

In [11]:
# for col in repteis.columns:
#     print('-', col)

In [12]:
repteis['Especie_atual'].value_counts()

torquatus       1403
ocellifera       779
jararaca         635
hispidus         607
ameiva           558
                ... 
boulengeri         1
arenarius          1
baturitensis       1
percarinatum       1
squamosus          1
Name: Especie_atual, Length: 803, dtype: int64

<font size='5'>**Equivalência de colunas:** </font>

**Nome diferente:** <br>
- Species1: Especie_ent ou Especie_atual
- Species Author1: ?
- Type Status1: Type Status 1
- Qualifier1: Qualificador_ent Qualificador_atual
- Determiner First Name1: DeterminatorFirstName1
- Determiner Middle1: DeterminatorMiddleInitial1
- Determiner Last Name1: DeterminatorLastName1
- Determined Date1: DataDaDeterminacao

In [13]:
repteis[['Especie_atual', 'Type Status 1', 'Qualificador_atual', 'DeterminatorFirstName1',
       'DeterminatorMiddleInitial1', 'DeterminatorLastName1', 'DataDaDeterminacao']]

Unnamed: 0,Especie_atual,Type Status 1,Qualificador_atual,DeterminatorFirstName1,DeterminatorMiddleInitial1,DeterminatorLastName1,DataDaDeterminacao
0,geoffroanus,,-,L.,M.F.,Cunha,00/11/2018
1,suspectum,,,,,,
2,alba,,,,,,
3,corais,,,,,,
4,bottae,,,,,,
...,...,...,...,...,...,...,...
23114,ibiboboca,,,P.,,Passos,00/10/2016
23115,ibiboboca,,,P.,,Passos,00/10/2016
23116,quadricarinatus,,,P.,,Passos,00/10/2016
23117,bicarinatus,,,P.,,Pinna,00/10/2016


<br>

## preparing data for charts...

### Column: `Type Status1`

contains info on species type

In [14]:
repteis['Type Status 1'].value_counts().head()  # essa coluna está vazia!?

Series([], Name: Type Status 1, dtype: int64)

#### Let's begin cleaning this data and lowering all cases.

In [15]:
#repteis['Type Status 1'] = repteis['Type Status 1'].str.strip().str.lower().str.capitalize()

In [16]:
repteis['Type Status 1'].value_counts().head()

Series([], Name: Type Status 1, dtype: int64)

### preparing taxonomy columns

`Kingdom` - `Phylum` - `Class` - `Ordem` - `Familia` - `Genero_ent`- `Genero_atual` - `Especie_ent` - `Especie_atual`

**Colunas faltando:**
- `Subphylum1`
- `Subclass1`
- `Infraclass1`
- `Superorder1`
- `Suborder1` 
- `Infraorder1` 
- `Superfamily1`
- `Subfamily1` 
- `Tribe1`

In [17]:
taxon_columns = ['Kingdom', 'Phylum', 'Class', 'Ordem', 'Familia', 'Genero_ent',
                 'Genero_atual', 'Especie_ent', 'Especie_atual', 'Subespecie_ent',
                 'Subespecie_atual']  # selecting taxonomy columns

# defining function
def treat_str(x):
    return str(x).lower().capitalize().strip()

# applying treatment
for col in taxon_columns:
    print(f'Adjusting column {col}')
    repteis[col] = repteis[col].apply(treat_str)

Adjusting column Kingdom
Adjusting column Phylum
Adjusting column Class
Adjusting column Ordem
Adjusting column Familia
Adjusting column Genero_ent
Adjusting column Genero_atual
Adjusting column Especie_ent
Adjusting column Especie_atual
Adjusting column Subespecie_ent
Adjusting column Subespecie_atual


### adding `Genero` and `Especie`together (they completely identify each animal's species)

In [18]:
repteis['genero_e_especie_ent'] = repteis['Genero_ent'] + ' ' + repteis['Especie_ent']
repteis['genero_e_especie_atual'] = repteis['Genero_atual'] + ' ' + repteis['Especie_atual']

repteis['genero_e_especie_ent'] = repteis['genero_e_especie_ent'].str.lower().str.capitalize()
repteis['genero_e_especie_atual'] = repteis['genero_e_especie_atual'].str.lower().str.capitalize()

<br>

### Collecting date (year) and sclicing main DB to a smaller dataset 

Columns: `Determined Date1` - `Class1` - `Kingdom` and more...

In [19]:
# slicing main database (repteis)
Table = repteis[['DataDeEntrada','DataDaDeterminacao','DataColetaInicial','Class','Kingdom', 
                    'Genero_ent', 'Genero_atual', 'Especie_ent', 'Especie_atual', 'Type Status 1',
                    'DeterminatorFirstName1', 'DeterminatorLastName1', 'genero_e_especie_ent',
                    'genero_e_especie_atual','MinAltitude',
                    'Ordem', 'Familia', 'Phylum', 'Qualificador_atual', 'NotasTaxonomicas']].copy()

# OBS: Determined Date1 has many missing values... CHECK THAT
d = []
counter=0
for row in Table['DataColetaInicial']:
    if not str(row).find('/')==-1:
        dates_values = str(row).split("/")
        year = int(dates_values[-1])
        month = int(dates_values[1])
#        if (month>1) and (month<12):
            #store the year and month in a datetime datatype for later sorting
#            dateRecord = datetime.datetime(year,month,1) 
    else:
        year = Table.loc[counter, 'DataColetaInicial']
        month = Table.loc[counter, 'DataColetaInicial']
    
    # mais um condicional para tratar anos vazios ' '
    if year == ' ':
        year = np.NAN
        month = np.NAN
        
    d.append({'ano_coleta':year,
              'mes_coleta':month,
              'class':Table.loc[counter,'Class'],
              'kingdom':Table.loc[counter,'Kingdom'], 'genero_ent':Table.loc[counter,'Genero_ent'],
              'genero_atual':Table.loc[counter,'Genero_atual'],
              'especie_ent':Table.loc[counter,'Especie_ent'],
              'especie_atual':Table.loc[counter,'Especie_atual'],
              'genero_e_especie_ent': Table.loc[counter,'genero_e_especie_ent'],
              'genero_e_especie_atual': Table.loc[counter,'genero_e_especie_atual'],
              'type_status':Table.loc[counter,'Type Status 1'], 
              'determinator_first_name':Table.loc[counter,'DeterminatorFirstName1'],
              'determinator_last_name':Table.loc[counter,'DeterminatorLastName1'],
              'altitude':Table.loc[counter,'MinAltitude'],
              'ordem':Table.loc[counter,'Ordem'],
              'familia':Table.loc[counter,'Familia'],
              'phylum': Table.loc[counter,'Phylum'],
              'tipo': Table.loc[counter,'NotasTaxonomicas'],
              'qualificador_atual': Table.loc[counter,'Qualificador_atual']
              })
    counter = counter+1

    
NewTable = pd.DataFrame(d)


### collecting determined year (p.s.: being careful to keep NaNs as they show up)
NewTable['ano_determinacao'] = np.nan

d1 = []
counter=0
for row in Table['DataDaDeterminacao']:
    try:  # if Determined Date1 is empty, keep it so 
        if np.isnan(row):
            year= np.NAN
    
    except:
        if not str(row).find('/')==-1:
            dates_values = str(row).split("/")
            year = int(dates_values[-1])
            month = int(dates_values[1])
#            if (month>1) and (month<12):
                #store the year and month in a datetime datatype for later sorting
#                dateRecord = datetime.datetime(year,month,1)    
    
    NewTable.loc[counter, 'ano_determinacao'] = year
    counter = counter+1

    
### collecting start year (p.s.: being careful to keep NaNs as they show up)  
NewTable['ano_entrada'] = np.nan
d1 = []
counter=0
for row in Table['DataDeEntrada']:
    try:  # if Start Date is empty, keep it so 
        if np.isnan(row):
            year= np.NAN
    
    except:
        if not str(row).find('/')==-1:
            dates_values = str(row).split("/")
            year = int(dates_values[-1])
            month = int(dates_values[1])
#            if (month>1) and (month<12):
                #store the year and month in a datetime datatype for later sorting
#                dateRecord = datetime.datetime(year,month,1)    

    NewTable.loc[counter, 'ano_entrada'] = year
    counter = counter+1

# NewTable['determined_year'] = pd.Series(year, index=NewTable.index)
NewTable.head(2)

Unnamed: 0,ano_coleta,mes_coleta,class,kingdom,genero_ent,genero_atual,especie_ent,especie_atual,genero_e_especie_ent,genero_e_especie_atual,...,determinator_first_name,determinator_last_name,altitude,ordem,familia,phylum,tipo,qualificador_atual,ano_determinacao,ano_entrada
0,,,Reptilia,Animalia,Nan,Phrynops,Nan,Geoffroanus,Nan nan,Phrynops geoffroanus,...,L.,Cunha,,Testudines,Chelidae,Chordata,,-,2018.0,
1,,,Reptilia,Animalia,Heloderma,Heloderma,Suspectum,Suspectum,Heloderma suspectum,Heloderma suspectum,...,,,,Squamata,Helodermatidae,Chordata,,,,


In [20]:
# checks if NaNs are in the same position 
result = (NewTable['ano_entrada'].isna() == repteis['DataDeEntrada'].isna()).sum() == NewTable.shape[0]

if result:
    print('ano_entrada info is valid.')
else:
    print("There's something wrong with NewTable. Check how you're collecting Start Year info.")

ano_entrada info is valid.


In [21]:
# year in which the holotipo was "firstly" found 
#NewTable['holotipo_year'] = NewTable['species_author'].str.extract('(\d+)')

<br>

### creating `years` columns in repteis

In [22]:
def catch_year(row):
    if not str(row).find('/')==-1:
        dates_values = str(row).split("/")
        year = int(dates_values[-1])
        month = int(dates_values[1])
        return year
    else:
        return np.NaN

In [23]:
repteis['ano_determinacao'] = repteis['DataDaDeterminacao'].apply(catch_year)
repteis['ano_coleta'] = repteis['DataColetaInicial'].apply(catch_year)
repteis['ano_entrada'] = repteis['DataDeEntrada'].apply(catch_year)

<br>

<font size=5>**Paleta de cores por Ordem**</font>

Abaixo está a imagem usada como inspiração (https://color.adobe.com/create/image)

<img src="./paleta_cores.jpeg" width='500px'>

Cores: 

- verde_escuro: #284021
- verde_claro: #88BF11
- amarelo: #D9CB0B
- laranja: #D99311
- laranja_escuro: #BF4417
- marrom-_laro: #BF8D7A

In [24]:
# OBS: caudata é erro na base. Deve ser retirada. 
cores_ordem = {
    'Squamata': '#BF4417',
    'Testudines': '#D9CB0B', 
    'Crocodylia': '#284021'
}

In [25]:
ordens = list(cores_ordem.keys())
cores = list(cores_ordem.values())

## Graphs

### Total counts over time

In [31]:
# formato: dd/mm/YYYY
repteis.loc[idx, 'DataColetaInicial'].value_counts()

00/00/1986    109
00/00/2009    101
00/00/2010     91
00/00/1876     74
00/00/1996     60
             ... 
00/00/1977      1
00/00/2013      1
00/00/1913      1
00/00/1928      1
00/00/1954      1
Name: DataColetaInicial, Length: 75, dtype: int64

In [29]:
idx = NewTable[NewTable['mes_coleta'] == 0].index

In [104]:
counts = NewTable.dropna(subset=['ano_coleta', 'mes_coleta'], how='all')

In [105]:
counts = counts.groupby(['ano_coleta', 'mes_coleta']).count()['class'].reset_index().rename(
                                                                            columns={'class':'counts'})

In [106]:
counts['ano_coleta'] = counts['ano_coleta'].astype(int)
counts['mes_coleta'] = counts['mes_coleta'].astype(int)

In [107]:
counts.head()

Unnamed: 0,ano_coleta,mes_coleta,counts
0,0,5,1
1,0,6,1
2,0,7,1
3,0,11,2
4,1824,0,1


In [108]:
counts['mes_coleta'].value_counts()

0     75
11    72
2     71
12    70
1     70
10    69
7     68
4     67
3     67
9     66
8     66
5     66
6     61
Name: mes_coleta, dtype: int64

In [126]:
total = alt.Chart(counts, title='Total of collected animals per month/year', width=1200,
         height=200).mark_rect().encode(
    y = alt.Y('mes_coleta', type='ordinal', title='Collected Month',
              sort= alt.EncodingSortField('mes_coleta', order='descending')),
    x = alt.X('ano_coleta', type='ordinal', title='Collected Year'),
    color= alt.Color('counts', title='Counts'),
    tooltip = alt.Tooltip(['counts', 'ano_coleta', 'mes_coleta'])
)

# total.save('./graphs/seasonality/season-total.html')

total

<br>

### counts over time (per order)

In [155]:
counts = NewTable.dropna(subset=['ano_coleta', 'mes_coleta'], how='all')

counts = counts.groupby(['ano_coleta', 'mes_coleta', 'ordem']).count()['class'].reset_index().rename(
                                                                            columns={'class':'counts'})

counts['ano_coleta'] = counts['ano_coleta'].astype(int)
counts['mes_coleta'] = counts['mes_coleta'].astype(int)

In [156]:
counts['ordem'].value_counts()

Squamata      877
Testudines    161
Crocodylia     40
Nan             4
#n/d            2
Squamta         1
Caudata         1
Name: ordem, dtype: int64

<font color='red' size='5'>Ajustes na coluna ordem </font>

In [157]:
# corrige Squama e Squamta
def corrige_squamata(string):
    if str(string).lower() == 'squama' or str(string).lower() == 'squamta':
        return 'Squamata'
    else:
        return str(string)
    
# corrige de #n/d para 'Nan'
def corrige_nd(string):
    if str(string) == "#n/d":
        return np.NAN
    elif str(string).lower() == 'nan':
        return np.NAN
    else:
        return string

In [158]:
counts['ordem'] = counts['ordem'].apply(corrige_squamata)
counts['ordem'] = counts['ordem'].apply(corrige_nd)

In [159]:
anos = counts['ano_coleta'].unique()

In [176]:
temp = alt.Chart(counts[(~counts['ordem'].isna()) & (counts['ordem'] != 'Caudata')], title=f'Total of collected animals per month/year',
             width=1200, height=200).mark_rect().encode(
        y = alt.Y('mes_coleta', type='ordinal', title='Collected Month',
                  sort= alt.EncodingSortField('mes_coleta', order='descending')),
        x = alt.X('ano_coleta', type='ordinal', title='Collected Year',
                 scale= alt.Scale(domain=anos)),
        color= alt.Color('counts', title='Counts'),
        tooltip = alt.Tooltip(['counts', 'ano_coleta', 'mes_coleta'])
)

temp.facet(row='ordem').resolve_scale(x='independent').resolve_legend('independent').save('./graphs/seasonality/order/season-facetado.html')

temp.facet(row='ordem').resolve_scale(x='independent').resolve_legend('independent')

In [181]:
# range para escala de cores
min_ct = counts['counts'].min()
max_ct = counts['counts'].max()

In [183]:
# fazendo gráficos independentes
for ordem in ordens:
    temp = alt.Chart(counts[counts['ordem'] == ordem], title=f'Total of collected {ordem} per month/year',
             width=1200, height=200).mark_rect().encode(
        y = alt.Y('mes_coleta', type='ordinal', title='Collected Month',
                  sort= alt.EncodingSortField('mes_coleta', order='descending')),
        x = alt.X('ano_coleta', type='ordinal', title='Collected Year',
                 scale= alt.Scale(domain=anos)),
        color= alt.Color('counts', title='Counts', scale=alt.Scale(domain=list(range(min_ct, max_ct)))),
        tooltip = alt.Tooltip(['counts', 'ano_coleta', 'mes_coleta'])
    )
    
    temp.save(f'./graphs/seasonality/order/season-{ordem}.html')

<br>

### counts over time (per family)

In [195]:
counts = NewTable.dropna(subset=['ano_coleta', 'mes_coleta'], how='all')

counts = counts.groupby(['ano_coleta', 'mes_coleta', 'familia']).count()['class'].reset_index().rename(
                                                                            columns={'class':'counts'})

counts['ano_coleta'] = counts['ano_coleta'].astype(int)
counts['mes_coleta'] = counts['mes_coleta'].astype(int)

In [197]:
top_familias = list(counts['familia'].value_counts().index[:15])

# range para escala de cores
min_ct = counts['counts'].min()
max_ct = counts['counts'].max()

In [186]:
for familia in top_familias:

    temp = alt.Chart(counts[counts['familia'] == familia], 
                     title=f'Total of collected {familia} per month/year').mark_rect().encode(
        y = alt.Y('mes_coleta', type='ordinal', title='Collected Month',
                  sort= alt.EncodingSortField('mes_coleta', order='descending')),
        x = alt.X('ano_coleta', type='ordinal', title='Collected Year',
                 scale= alt.Scale(domain=anos)),
        color= alt.Color('counts', title='Counts', scale=alt.Scale(domain=list(range(min_ct, max_ct)))),
        tooltip = alt.Tooltip(['counts', 'ano_coleta', 'mes_coleta'])
    )
    
    temp.save(f'./graphs/seasonality/family/season-{familia}.html')

# .facet(row='familia').resolve_scale(x='independent')

<br>

**versão facetada**

OBS: está com problema

In [212]:
# temp = alt.Chart(counts[counts['familia'].isin(top_familias)], 
#                  title='Total of collected animals per month/year').mark_rect().encode(
#     y = alt.Y('mes_coleta', type='ordinal', title='Collected Month',
#               sort= alt.EncodingSortField('mes_coleta', order='descending')),
#     x = alt.X('ano_coleta', type='ordinal', title='Collected Year',
#              scale= alt.Scale(domain=anos)),
#     color= alt.Color('counts', title='Counts', scale=alt.Scale(domain=list(range(min_ct, max_ct)))),
#     tooltip = alt.Tooltip(['counts', 'ano_coleta', 'mes_coleta'])
# )

# temp.facet(row='familia').resolve_legend('independent').resolve_scale(x='independent')

-----

**That's it!**