# Time-Spacial Exploration

By **Franklin Oliveira**

-----
This notebook contains all code necessary to make charts from `repteis` database with focus on time and space exploration. Here you'll find some basic data treatment and charts' code. 

Database: <font color='blue'>'Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls'</font>.m
    

In [1]:
import datetime
import numpy as np
import pandas as pd

from collections import defaultdict

# quick visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Waffle Charts
# from pywaffle import Waffle 
# docs: https://pywaffle.readthedocs.io/en/latest/examples/block_shape_distance_location_and_direction.html

# visualization
import altair as alt

# enabling notebook renderer
# alt.renderers.enable('notebook')
alt.renderers.enable('default')

# disabling rows limit
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Importing data...

In [2]:
excel = pd.ExcelFile('Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls')
sheet_name = excel.sheet_names

print('The excel file contains the following sheets:', sheet_name)
print('\nDatabase is in sheet:', sheet_name[0])

The excel file contains the following sheets: ['Repteis-2020-02-11-csv']

Database is in sheet: Repteis-2020-02-11-csv


In [3]:
db = excel.parse(sheet_name[0], sep=';', encoding='utf-8-sig')

#db = pd.read_excel('db.csv', sep=',', encoding='utf-8-sig', low_memory=False)

print(f'The database has {db.shape[0]} rows and {db.shape[1]} columns.')

The database has 23119 rows and 114 columns.


In [4]:
# copying database to another variable to make a few changes keeping the original intact
repteis = db.copy()

<br>

<font color='red' size='5'>**p.s.:** We were told by the Curator that some names in this database may remain confidential. So, I'm gonna skip this step for now until we know how to treat this data appropriately </font>

### Name columns

**Determiner:** `DeterminatorFirstName1` and `DeterminatorLastName1`

**Collector:** `CollectorFirstName1` and `CollectorLastName1`

#### treating determiner's name 

In this step, we're going to put together 'determinator's' first and last name. 

`DeterminatorFirstName1` + `DeterminatorLastName1`

Just concatenating those two columns is not satisfactory because there's some dirt in these data columns. So, let's begin by treating them...

In [5]:
def treat_names(name, pos='first'):
    '''
    Treat names keeping NaN as such.
    
    Arguments: 
        - name: name to be treated. 
        - pos (str): name position. One of ['first', 'last']
    '''
    if type(name) == str and pos == 'first':     # first name
        
        if len(name.split(' ')) > 1:             # checks if it's a composite name (+ 1 name)
            return str(name).strip().split(' ')[0].capitalize()         # keeps only 1st name
        else:
            return str(name).strip().capitalize()
    
    elif type(name) == str and pos == 'last':    # last name
        
        if len(name.split(' ')) > 1:             # treats composite name (+ 1 surname)
            return str(name).strip().split(' ')[-1].capitalize()
        else:
            return str(name).strip().capitalize()   
    else:
        return name

applying function to the two column names

In [6]:
repteis['DeterminatorFirstName1'] = repteis['DeterminatorFirstName1'].apply(treat_names)
repteis['CollectorFirstName1'] = repteis['CollectorFirstName1'].apply(treat_names)
# crustaceas['Determiner First Name1'].value_counts()

In [7]:
repteis['DeterminatorLastName1'] = repteis['DeterminatorLastName1'].apply(lambda x:
                                                                treat_names(x, pos='last'))

repteis['CollectorLastName1'] = repteis['CollectorLastName1'].apply(lambda x:
                                                                treat_names(x, pos='last'))

# crustaceas['Determiner Last Name1'].value_counts()

In [8]:
# creating column with First and Last name for identification
repteis['DeterminatorFirst_and_LastName'] = repteis['DeterminatorFirstName1'] + ' ' + repteis['DeterminatorLastName1']

repteis['CollectorFirst_and_LastName'] = repteis['CollectorFirstName1'] + ' ' + repteis['CollectorLastName1']

<br>

## Adjusting columns names

### removing '\n'

In [9]:
repteis.columns = [str(col).replace(r'\n','') for col in repteis.columns]

<br>

## preparing data for charts...

### Column: `Type Status 1`

In [10]:
repteis['Type Status 1'].value_counts().head()

Series([], Name: Type Status 1, dtype: int64)

#### Let's begin cleaning this data and lowering all cases.

In [11]:
# repteis['Type Status 1'] = repteis['Type Status 1'].str.strip().str.lower().str.capitalize()

In [12]:
# repteis['Type Status 1'].value_counts().head()

### preparing taxonomy columns

`Kingdom` - `Phylum` - `Class` - `Ordem` - `Familia` - `Genero_ent`- `Genero_atual` - `Especie_ent` - `Especie_atual`

**Colunas "faltando" (se comparado à base crustaceas):**
- `Subphylum1`
- `Subclass1`
- `Infraclass1`
- `Superorder1`
- `Suborder1` 
- `Infraorder1` 
- `Superfamily1`
- `Subfamily1` 
- `Tribe1`

In [13]:
taxon_columns = ['Kingdom', 'Phylum', 'Class', 'Ordem', 'Familia', 'Genero_ent',
                 'Genero_atual', 'Especie_ent', 'Especie_atual', 'Subespecie_ent',
                 'Subespecie_atual']  # selecting taxonomy columns

# defining function
def treat_str(x):
    return str(x).lower().capitalize().strip()

# applying treatment
for col in taxon_columns:
    print(f'Adjusting column {col}')
    repteis[col] = repteis[col].apply(treat_str)

Adjusting column Kingdom
Adjusting column Phylum
Adjusting column Class
Adjusting column Ordem
Adjusting column Familia
Adjusting column Genero_ent
Adjusting column Genero_atual
Adjusting column Especie_ent
Adjusting column Especie_atual
Adjusting column Subespecie_ent
Adjusting column Subespecie_atual


### adding `Genero` and `Especie` together (they completely identify each animal's species)

In [14]:
repteis['genero_e_especie_ent'] = repteis['Genero_ent'] + ' ' + repteis['Especie_ent']
repteis['genero_e_especie_atual'] = repteis['Genero_atual'] + ' ' + repteis['Especie_atual']

# tratamento (removendo espaços e padronizando letras)
repteis['genero_e_especie_ent'] = repteis['genero_e_especie_ent'].str.strip().str.lower().str.capitalize()
repteis['genero_e_especie_atual'] = repteis['genero_e_especie_atual'].str.strip().str.lower().str.capitalize()

<br>

### Collecting date (year) and sclicing main DB to a smaller dataset 

Columns: `Determined Date1` - `Class1` - `Kingdom` and more...

colunas adicionadas nesse notebook: 

Lat - Long - Municipio - EstadoOuProvincia - Pais - Continente

In [16]:
# slicing main database (repteis)
Table = repteis[['DataDeEntrada','DataDaDeterminacao','DataColetaInicial','Class','Kingdom', 
                    'Genero_ent', 'Genero_atual', 'Especie_ent', 'Especie_atual', 'Type Status 1',
                    'DeterminatorFirstName1', 'DeterminatorLastName1',
                    'DeterminatorFirst_and_LastName', 'CollectorFirstName1', 
                    'CollectorLastName1', 'CollectorFirst_and_LastName', 'genero_e_especie_ent',
                    'genero_e_especie_atual','MinAltitude',
                    'Ordem', 'Familia', 'Phylum', 'Lat', 'Long', 'Municipio', 
                    'EstadoOuProvincia', 'Pais', 'Continente']].copy()

# OBS: Determined Date1 has many missing values... CHECK THAT
d = []
counter=0
for row in Table['DataColetaInicial']:
    if not str(row).find('/')==-1:
        dates_values = str(row).split("/")
        year = int(dates_values[-1])
        month = int(dates_values[1])
#        if (month>1) and (month<12):
            #store the year and month in a datetime datatype for later sorting
#            dateRecord = datetime.datetime(year,month,1) 
    else:
        year = Table.loc[counter, 'DataColetaInicial']
    
    # mais um condicional para tratar anos vazios ' '
    if year == ' ':
        year = np.NAN
        
    d.append({'ano_coleta':year,
              'class':Table.loc[counter,'Class'],
              'kingdom':Table.loc[counter,'Kingdom'], 'genero_ent':Table.loc[counter,'Genero_ent'],
              'genero_atual':Table.loc[counter,'Genero_atual'],
              'especie_ent':Table.loc[counter,'Especie_ent'],
              'especie_atual':Table.loc[counter,'Especie_atual'],
              'genero_e_especie_ent': Table.loc[counter,'genero_e_especie_ent'],
              'genero_e_especie_atual': Table.loc[counter,'genero_e_especie_atual'],
              'type_status':Table.loc[counter,'Type Status 1'], 
              'determinator_first_name':Table.loc[counter,'DeterminatorFirstName1'],
              'determinator_last_name':Table.loc[counter,'DeterminatorLastName1'],
              'determinator_first_and_last_name': Table.loc[counter,'DeterminatorFirst_and_LastName'],
              'collector_first_name':Table.loc[counter,'CollectorFirstName1'],
              'collector_last_name':Table.loc[counter,'CollectorLastName1'],
              'collector_first_and_last_name': Table.loc[counter,'CollectorFirst_and_LastName'],
              'altitude':Table.loc[counter,'MinAltitude'],
              'ordem':Table.loc[counter,'Ordem'],
              'familia':Table.loc[counter,'Familia'],
              'phylum': Table.loc[counter,'Phylum'],
              'lat': Table.loc[counter,'Lat'],
              'long': Table.loc[counter,'Long'], 
              'municipio': Table.loc[counter,'Municipio'], 
              'estado_ou_provincia': Table.loc[counter,'EstadoOuProvincia'],
              'pais': Table.loc[counter,'Pais'], 
              'continente': Table.loc[counter,'Continente']
              })
    counter = counter+1

    
NewTable = pd.DataFrame(d)


### collecting determined year (p.s.: being careful to keep NaNs as they show up)
NewTable['ano_determinacao'] = np.nan

d1 = []
counter=0
for row in Table['DataDaDeterminacao']:
    try:  # if Determined Date1 is empty, keep it so 
        if np.isnan(row):
            year= np.NAN
    
    except:
        if not str(row).find('/')==-1:
            dates_values = str(row).split("/")
            year = int(dates_values[-1])
            month = int(dates_values[1])
#            if (month>1) and (month<12):
                #store the year and month in a datetime datatype for later sorting
#                dateRecord = datetime.datetime(year,month,1)    
    
    NewTable.loc[counter, 'ano_determinacao'] = year
    counter = counter+1

    
### collecting start year (p.s.: being careful to keep NaNs as they show up)  
NewTable['ano_entrada'] = np.nan
d1 = []
counter=0
for row in Table['DataDeEntrada']:
    try:  # if Start Date is empty, keep it so 
        if np.isnan(row):
            year= np.NAN
    
    except:
        if not str(row).find('/')==-1:
            dates_values = str(row).split("/")
            year = int(dates_values[-1])
            month = int(dates_values[1])
#            if (month>1) and (month<12):
                #store the year and month in a datetime datatype for later sorting
#                dateRecord = datetime.datetime(year,month,1)    

    NewTable.loc[counter, 'ano_entrada'] = year
    counter = counter+1

# NewTable['determined_year'] = pd.Series(year, index=NewTable.index)
NewTable.head(2)

Unnamed: 0,ano_coleta,class,kingdom,genero_ent,genero_atual,especie_ent,especie_atual,genero_e_especie_ent,genero_e_especie_atual,type_status,...,familia,phylum,lat,long,municipio,estado_ou_provincia,pais,continente,ano_determinacao,ano_entrada
0,,Reptilia,Animalia,Nan,Phrynops,Nan,Geoffroanus,Nan nan,Phrynops geoffroanus,,...,Chelidae,Chordata,,,,,,,2018.0,
1,,Reptilia,Animalia,Heloderma,Heloderma,Suspectum,Suspectum,Heloderma suspectum,Heloderma suspectum,,...,Helodermatidae,Chordata,36.0544,-112.14,Arizona,Condado de Coconino,Estados Unidos,América do Norte,,


In [17]:
# checks if NaNs are in the same position 
result = (NewTable['ano_entrada'].isna() == repteis['DataDeEntrada'].isna()).sum() == NewTable.shape[0]

if result:
    print('ano_entrada info is valid.')
else:
    print("There's something wrong with NewTable. Check how you're collecting Start Year info.")

ano_entrada info is valid.


In [18]:
# year in which the holotipo was "firstly" found 
# NewTable['holotipo_year'] = NewTable['species_author'].str.extract('(\d+)')

<br>

### creating `years` columns in crustaceas

In [19]:
def catch_year(row):
    if not str(row).find('/')==-1:
        dates_values = str(row).split("/")
        year = int(dates_values[0])
        month = int(dates_values[1])
        return year
    else:
        return np.NaN

In [20]:
repteis['ano_determinacao'] = repteis['DataDaDeterminacao'].apply(catch_year)
repteis['ano_coleta'] = repteis['DataColetaInicial'].apply(catch_year)
repteis['ano_entrada'] = repteis['DataDeEntrada'].apply(catch_year)

#### adjusting columns `ano_determinacao`, `ano_coleta` and `ano_entrada` to ```*int*``` format 

In [21]:
def str_with_nan2int(string):
    if not np.isnan(string):
        return int(string)
    else:
        return np.NAN

In [22]:
NewTable['ano_determinacao'] = NewTable['ano_determinacao'].apply(str_with_nan2int) #has NaN
NewTable['ano_coleta'] = NewTable['ano_coleta'].apply(str_with_nan2int) #has NaN
NewTable['ano_entrada'] = NewTable['ano_entrada'].apply(str_with_nan2int) #has NaN

#### determinator_full_name = determinator_first_name + determinator_last_name

In [23]:
NewTable['determinator_full_name'] = NewTable['determinator_first_name'] + ' ' + NewTable['determinator_last_name']

NewTable['collector_full_name'] = NewTable['collector_first_name'] + ' ' + NewTable['collector_last_name']

<br>

## Graph: Exploring time-spacial evolution of the database

In [30]:
for col in repteis.columns:
    print('-', col)

- NumeroDeCatalogo
- NumeroDeCampo
- DataDeEntrada
- DataDaDeterminacao
- Kingdom
- Phylum
- Class
- Ordem
- Familia
- Genero_ent
- Qualificador_ent
- Especie_ent
- Subespecie_ent
- Type Status 1
- Current 1
- Determined Date 2
- Class2
- Order2
- Family2
- Genero_atual
- Qualificador_atual
- Especie_atual
- Subespecie_atual
- NotasTaxonomicas
- Current 2
- DeterminatorLastName1
- DeterminatorFirstName1
- DeterminatorMiddleInitial1
- DeterminatorLastName2
- DeterminatorFirstName2
- DeterminatorMiddleInitial2
- DeterminationRemark
- AssociatedTaxa
- TypeOf
- ColecaoEspecial
- DataColetaInicial
- DataColetaFinal
- Complemento
- NomeDaLocalidade
- Municipio
- EstadoOuProvincia
- Pais
- Continente
- LocalityRemark
- UtmDatum
- UtmZone
- UtmEasting
- UtmNorthing
- VerbatimLatitude
- VerbatimLongitude
- Lat
- Long
- PrecisaoDaCoordenada
- GrupoDeColeta
- MinAltitude
- MaxAltitude
- CollectingInformationRemark
- CollectorLastName1
- CollectorFirstName1
- CollectorMiddleName1
- CollectorLastNa

In [32]:
repteis['Continente'].value_counts()

América do Sul      21924
América Central       135
América do Norte      103
Ásia                   54
#N/D                   25
África                 13
Europa                  9
Améica do Sul           1
Name: Continente, dtype: int64

### Chart 1: Evolution per time and continent

In [54]:
# agrupando por Ano e Continente
teste = NewTable.groupby(['ano_coleta','continente', 'ordem']).count()['class'].reset_index().rename(columns={
    'class':'counts'
})

In [55]:
g1 = alt.Chart(teste, title='Evolução temporal da coleção por continente').mark_circle().encode(
    x= alt.X('ano_coleta', type="ordinal", title='Ano de Coleta'),
    y= alt.Y('continente', title='Continente', 
             sort=alt.EncodingSortField('counts', op="count", order='descending')), 
    size=alt.Size('counts', scale= alt.Scale(range=[30,1200])), 
    color=alt.Color('ordem'),
    tooltip= alt.Tooltip(['continente','ano_coleta','ordem','counts'])
)


g1.save('./graphs/time-spacial/evolucao_por_continente.html')

g1

<br>

### Chart 2: Evolution per time and country

In [57]:
# agrupando por Ano e Continente
teste1 = NewTable.groupby(['ano_coleta','pais', 'ordem']).count()['class'].reset_index().rename(columns={
    'class':'counts'
})

In [59]:
g1 = alt.Chart(teste1, title='Evolução temporal da coleção por país').mark_circle().encode(
    x= alt.X('ano_coleta', type="ordinal", title='Ano de Coleta'),
    y= alt.Y('pais', title='País', 
             sort=alt.EncodingSortField('counts', op="count", order='descending')), 
    size=alt.Size('counts', scale= alt.Scale(range=[30,1200])), 
    color=alt.Color('ordem'),
    tooltip= alt.Tooltip(['pais','ano_coleta','ordem','counts'])
)


g1.save('./graphs/time-spacial/evolucao_por_pais.html')

g1

<br>

### Chart 3: Evolution per time and brazilian states

In [66]:
# agrupando por Ano e Continente

teste2 = NewTable[NewTable['pais'] == 'Brasil']
teste2 = teste2.groupby(['ano_coleta','estado_ou_provincia', 'ordem']).count()['class'].reset_index().rename(columns={
    'class':'counts'
})

In [70]:
teste2['estado_ou_provincia'].unique()

array(['Rio de Janeiro', 'São Paulo', 'Pernambuco', 'Santa Catarina',
       'Amazonas', 'Goiás', 'Roraima', 'Pará', 'Espírito Santo',
       'Mato Grosso', 'Acre', 'Bahia', 'Minas Gerais',
       'Mato Grosso do Sul', 'Paraná', 'Rondônia', 'Ceará', 'Maranhão',
       'Rio Grande do Sul', 'Paraíba', 'Distrito Federal', 'Alagoas',
       'Amapá', 'Piauí', 'Brasília', 'Tocantins', 'Rio Grande do Norte',
       'Sergipe', 'Minas Gerais/Goiás/Distrito Federal',
       'Santa Catarina-Rio Grande do Sul'], dtype=object)

In [99]:
regioes = {
    'Rio de Janeiro':'SE',
    'São Paulo':'SE',
    'Espírito Santo': 'SE',
    'Pernambuco':'NE',
    'Santa Catarina':'S',
    'Amazonas':'N',
    'Goiás':'CO',
    'Roraima':'N',
    'Pará':'N',
    'Mato Grosso':'CO',
    'Acre': 'N',
    'Bahia': 'NE',
    'Minas Gerais': 'SE',
    'Mato Grosso do Sul': 'CO',
    'Paraná': 'S',
    'Rondônia': 'N',
    'Ceará': 'NE',
    'Maranhão': 'N',
    'Rio Grande do Sul': 'S',
    'Paraíba': 'NE',
    'Distrito Federal': 'CO',
    'Alagoas': 'NE',
    'Amapá':'N',
    'Piauí': 'NE',
    'Brasília': 'CO',
    'Tocantins': 'N',
    'Rio Grande do Norte': 'NE',
    'Sergipe': 'NE',
    'Minas Gerais/Goiás/Distrito Federal': 'CO',
    'Santa Catarina-Rio Grande do Sul': 'S'
}

# criando coluna com as regiões
teste2['regiao'] = teste2['estado_ou_provincia'].apply(lambda x: regioes[str(x)])

# ordenando
teste2.sort_values('counts', inplace=True, ascending=False)
sorting = teste2.sort_values('regiao', ascending=False)['estado_ou_provincia'].unique()

In [102]:
g1 = alt.Chart(teste2, title='Evolução temporal da coleção por Estados Brasileiros').mark_circle().encode(
    x= alt.X('ano_coleta', type="ordinal", title='Ano de Coleta'),
    y= alt.Y('estado_ou_provincia', type="nominal",title='Estado ou Província',
            sort= sorting), 
    size=alt.Size('counts', scale= alt.Scale(range=[30,1200])), 
    color=alt.Color('ordem'),
    tooltip= alt.Tooltip(['estado_ou_provincia','ano_coleta','ordem','counts'])
)


# g1.save('./graphs/time-spacial/evolucao_por_estados_brasileiros.html')

g1

<br>

## choropleth map: brazilian regions

In [137]:
def rem_nd(s):
    try:
        return float(s)
    except:
        return np.NAN

In [140]:
NewTable['lat'] = NewTable['lat'].apply(rem_nd)
NewTable['long'] = NewTable['long'].apply(rem_nd)

In [185]:
def regiao(est):
    if est not in regioes.keys():
        return np.NAN
    else:
        return regioes[est]

In [186]:
NewTable['regiao'] = NewTable['estado_ou_provincia'].apply(regiao)

In [187]:
# ordem das cores
order_sort = sorted(NewTable['ordem'].unique(),reverse=True)

# cores
colors = ['#4c78a8', '#f58518', '#e45756', '#72b7b2', '#54a24b', '#eeca3b', '#b279a2', '#ff9da6']

In [192]:
# dynamic version
slider = alt.binding_range(min=1900, max=2016, step=1)
select_year = alt.selection_single(name="ano_coleta", fields=['ano_coleta'],
                                   bind=slider, init={'ano_coleta': 1935})

data = NewTable[(NewTable['long'] < 10) & (NewTable['long'] >-150)].copy()

back = alt.Chart(data, width=600, height=400).mark_circle(opacity=0.5, color='lightgray').encode(
    x = alt.X('long', type='quantitative', title='Longitude'),
    y = alt.Y('lat', type='quantitative', title='Latitude'),
)

front = alt.Chart(data, width=600, height=400).mark_circle().encode(
    x = alt.X('long', type='quantitative', title='Longitude'),
    y = alt.Y('lat', type='quantitative', title='Latitude'),
    color = alt.Color('ordem', title='Ordem', scale=
                     alt.Scale(range=colors, domain=order_sort)),
    tooltip = alt.Tooltip(['long','lat','ordem','regiao','estado_ou_provincia','ano_coleta'])
).add_selection(
    select_year
).transform_filter(
    select_year
)

g = alt.layer(back, front)

g.save('./graphs/time-spacial/lat_long.html')

g