# Type charts

By **Franklin Oliveira**

-----
This notebook contains all code necessary to make the "type" charts from `repteis` database. Here you'll find some basic data treatment and charts' code. 

Database: <font color='blue'>'Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls'</font>.

In [1]:
import datetime
import numpy as np
import pandas as pd

from collections import defaultdict

# quick visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Waffle Charts
# from pywaffle import Waffle 
# docs: https://pywaffle.readthedocs.io/en/latest/examples/block_shape_distance_location_and_direction.html

# visualization
import altair as alt

# enabling notebook renderer
# alt.renderers.enable('notebook')
alt.renderers.enable('default')

# disabling rows limit
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Importing data...

In [2]:
excel = pd.ExcelFile('Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls')
sheet_name = excel.sheet_names

print('The excel file contains the following sheets:', sheet_name)
print('\nDatabase is in sheet:', sheet_name[0])

The excel file contains the following sheets: ['Repteis-2020-02-11-csv']

Database is in sheet: Repteis-2020-02-11-csv


In [3]:
db = excel.parse(sheet_name[0], sep=';', encoding='utf-8-sig')

#db = pd.read_excel('db.csv', sep=',', encoding='utf-8-sig', low_memory=False)

print(f'The database has {db.shape[0]} rows and {db.shape[1]} columns.')

The database has 23119 rows and 114 columns.


In [4]:
# copying database to another variable to make a few changes keeping the original intact
repteis = db.copy()

<br>

<font color='red' size='5'>**p.s.:** We were told by the Curator that some names in this database may remain confidential. So, I'm gonna skip this step for now until we know how to treat this data appropriately </font>

### Name columns

`DeterminatorFirstName1` e `DeterminatorLastName1`

#### treating determiner's name 

In this step, we're going to put together 'determinator's' first and last name. 

`DeterminatorFirstName1` + `DeterminatorLastName1`

Just concatenating those two columns is not satisfactory because there's some dirt in these data columns. So, let's begin by treating them...

In [5]:
def treat_names(name, pos='first'):
    '''
    Treat names keeping NaN as such.
    
    Arguments: 
        - name: name to be treated. 
        - pos (str): name position. One of ['first', 'last']
    '''
    if type(name) == str and pos == 'first':     # first name
        
        if len(name.split(' ')) > 1:             # checks if it's a composite name (+ 1 name)
            return str(name).strip().split(' ')[0].capitalize()         # keeps only 1st name
        else:
            return str(name).strip().capitalize()
    
    elif type(name) == str and pos == 'last':    # last name
        
        if len(name.split(' ')) > 1:             # treats composite name (+ 1 surname)
            return str(name).strip().split(' ')[-1].capitalize()
        else:
            return str(name).strip().capitalize()   
    else:
        return name

applying function to the two column names

In [6]:
# crustaceas['Determiner First Name1'] = crustaceas['Determiner First Name1'].apply(treat_names)
# crustaceas['Determiner First Name1'].value_counts()

In [8]:
# crustaceas['Determiner Last Name1'] = crustaceas['Determiner Last Name1'].apply(lambda x:
#                                                                 treat_names(x, pos='last'))
# crustaceas['Determiner Last Name1'].value_counts()

In [9]:
# creating column with First and Last name for identification
# crustaceas['Determiner First_and_Last Name'] = crustaceas['Determiner First Name1'] + ' ' + crustaceas['Determiner Last Name1']

<br>

## Adjusting columns names

### removing '\n'

In [10]:
repteis.columns = [str(col).replace(r'\n','') for col in repteis.columns]

<br>

## preparing data for charts...

### Column: `Type Status1`

In [11]:
repteis['Type Status 1'].value_counts().head()  # nenhum tipo preenchido

Series([], Name: Type Status 1, dtype: int64)

#### Let's begin cleaning this data and lowering all cases.

In [12]:
# repteis['Type Status 1'] = repteis['Type Status1'].str.strip().str.lower().str.capitalize()

In [13]:
# repteis['Type Status1'].value_counts().head()

### preparing taxonomy columns

`Kingdom` - `Phylum` - `Class` - `Ordem` - `Familia` - `Genero_ent`- `Genero_atual` - `Especie_ent` - `Especie_atual`

**Colunas "faltando" (se comparado à base crustaceas):**
- `Subphylum1`
- `Subclass1`
- `Infraclass1`
- `Superorder1`
- `Suborder1` 
- `Infraorder1` 
- `Superfamily1`
- `Subfamily1` 
- `Tribe1`

In [14]:
taxon_columns = ['Kingdom', 'Phylum', 'Class', 'Ordem', 'Familia', 'Genero_ent',
                 'Genero_atual', 'Especie_ent', 'Especie_atual', 'Subespecie_ent',
                 'Subespecie_atual']  # selecting taxonomy columns

# defining function
def treat_str(x):
    return str(x).lower().capitalize().strip()

# applying treatment
for col in taxon_columns:
    print(f'Adjusting column {col}')
    repteis[col] = repteis[col].apply(treat_str)

Adjusting column Kingdom
Adjusting column Phylum
Adjusting column Class
Adjusting column Ordem
Adjusting column Familia
Adjusting column Genero_ent
Adjusting column Genero_atual
Adjusting column Especie_ent
Adjusting column Especie_atual
Adjusting column Subespecie_ent
Adjusting column Subespecie_atual


In [15]:
# checar: Qualificador_ent, Type Status 1, Current 1
# for col in repteis.columns:
#     print(col)

repteis['Type Status 1'].value_counts()

Series([], Name: Type Status 1, dtype: int64)

In [16]:
repteis['Ordem'].value_counts()

Squamata      21992
Testudines      844
Crocodylia      257
Nan              21
#n/d              2
Caudata           1
Squama            1
Squamta           1
Name: Ordem, dtype: int64

### adding `Genero` and `Especie` together (they completely identify each animal's species)

In [17]:
repteis['genero_e_especie_ent'] = repteis['Genero_ent'] + ' ' + repteis['Especie_ent']
repteis['genero_e_especie_atual'] = repteis['Genero_atual'] + ' ' + repteis['Especie_atual']

# tratamento (removendo espaços e padronizando letras)
repteis['genero_e_especie_ent'] = repteis['genero_e_especie_ent'].str.strip().str.lower().str.capitalize()
repteis['genero_e_especie_atual'] = repteis['genero_e_especie_atual'].str.strip().str.lower().str.capitalize()

<br>

### Collecting date (year) and sclicing main DB to a smaller dataset 

Columns: `Determined Date1` - `Class1` - `Kingdom` and more...

In [18]:
# slicing main database (repteis)
Table = repteis[['DataDeEntrada','DataDaDeterminacao','DataColetaInicial','Class','Kingdom', 
                    'Genero_ent', 'Genero_atual', 'Especie_ent', 'Especie_atual', 'Type Status 1',
                    'DeterminatorFirstName1', 'DeterminatorLastName1', 'genero_e_especie_ent',
                    'genero_e_especie_atual','MinAltitude',
                    'Ordem', 'Familia', 'Phylum']].copy()

# OBS: Determined Date1 has many missing values... CHECK THAT
d = []
counter=0
for row in Table['DataColetaInicial']:
    if not str(row).find('/')==-1:
        dates_values = str(row).split("/")
        year = int(dates_values[-1])
        month = int(dates_values[1])
#        if (month>1) and (month<12):
            #store the year and month in a datetime datatype for later sorting
#            dateRecord = datetime.datetime(year,month,1) 
    else:
        year = Table.loc[counter, 'DataColetaInicial']
    
    # mais um condicional para tratar anos vazios ' '
    if year == ' ':
        year = np.NAN
        
    d.append({'ano_coleta':year,
              'class':Table.loc[counter,'Class'],
              'kingdom':Table.loc[counter,'Kingdom'], 'genero_ent':Table.loc[counter,'Genero_ent'],
              'genero_atual':Table.loc[counter,'Genero_atual'],
              'especie_ent':Table.loc[counter,'Especie_ent'],
              'especie_atual':Table.loc[counter,'Especie_atual'],
              'genero_e_especie_ent': Table.loc[counter,'genero_e_especie_ent'],
              'genero_e_especie_atual': Table.loc[counter,'genero_e_especie_atual'],
              'type_status':Table.loc[counter,'Type Status 1'], 
              'determinator_first_name':Table.loc[counter,'DeterminatorFirstName1'],
              'determinator_last_name':Table.loc[counter,'DeterminatorLastName1'],
              'altitude':Table.loc[counter,'MinAltitude'],
              'ordem':Table.loc[counter,'Ordem'],
              'familia':Table.loc[counter,'Familia'],
              'phylum': Table.loc[counter,'Phylum']
              })
    counter = counter+1

    
NewTable = pd.DataFrame(d)


### collecting determined year (p.s.: being careful to keep NaNs as they show up)
NewTable['ano_determinacao'] = np.nan

d1 = []
counter=0
for row in Table['DataDaDeterminacao']:
    try:  # if Determined Date1 is empty, keep it so 
        if np.isnan(row):
            year= np.NAN
    
    except:
        if not str(row).find('/')==-1:
            dates_values = str(row).split("/")
            year = int(dates_values[-1])
            month = int(dates_values[1])
#            if (month>1) and (month<12):
                #store the year and month in a datetime datatype for later sorting
#                dateRecord = datetime.datetime(year,month,1)    
    
    NewTable.loc[counter, 'ano_determinacao'] = year
    counter = counter+1

    
### collecting start year (p.s.: being careful to keep NaNs as they show up)  
NewTable['ano_entrada'] = np.nan
d1 = []
counter=0
for row in Table['DataDeEntrada']:
    try:  # if Start Date is empty, keep it so 
        if np.isnan(row):
            year= np.NAN
    
    except:
        if not str(row).find('/')==-1:
            dates_values = str(row).split("/")
            year = int(dates_values[-1])
            month = int(dates_values[1])
#            if (month>1) and (month<12):
                #store the year and month in a datetime datatype for later sorting
#                dateRecord = datetime.datetime(year,month,1)    

    NewTable.loc[counter, 'ano_entrada'] = year
    counter = counter+1

# NewTable['determined_year'] = pd.Series(year, index=NewTable.index)
NewTable.head(2)

Unnamed: 0,ano_coleta,class,kingdom,genero_ent,genero_atual,especie_ent,especie_atual,genero_e_especie_ent,genero_e_especie_atual,type_status,determinator_first_name,determinator_last_name,altitude,ordem,familia,phylum,ano_determinacao,ano_entrada
0,,Reptilia,Animalia,Nan,Phrynops,Nan,Geoffroanus,Nan nan,Phrynops geoffroanus,,L.,Cunha,,Testudines,Chelidae,Chordata,2018.0,
1,,Reptilia,Animalia,Heloderma,Heloderma,Suspectum,Suspectum,Heloderma suspectum,Heloderma suspectum,,,,,Squamata,Helodermatidae,Chordata,,


In [19]:
# checks if NaNs are in the same position 
result = (NewTable['ano_entrada'].isna() == repteis['DataDeEntrada'].isna()).sum() == NewTable.shape[0]

if result:
    print('ano_entrada info is valid.')
else:
    print("There's something wrong with NewTable. Check how you're collecting Start Year info.")

ano_entrada info is valid.


In [20]:
# year in which the holotipo was "firstly" found 
# NewTable['holotipo_year'] = NewTable['species_author'].str.extract('(\d+)')

<br>

### creating `years` columns in repteis

In [21]:
def catch_year(row):
    if not str(row).find('/')==-1:
        dates_values = str(row).split("/")
        year = int(dates_values[0])
        month = int(dates_values[1])
        return year
    else:
        return np.NaN

In [22]:
repteis['ano_determinacao'] = repteis['DataDaDeterminacao'].apply(catch_year)
repteis['ano_coleta'] = repteis['DataColetaInicial'].apply(catch_year)
repteis['ano_entrada'] = repteis['DataDeEntrada'].apply(catch_year)

<br>


## Graphs

---

### Types (*per year*) per researcher

x: Species1, cor: Type Status1, size: counts

#### adjusting columns `ano_determinacao`, `ano_coleta` and `ano_entrada` to ```*int*``` format 

In [23]:
def str_with_nan2int(string):
    if not np.isnan(string):
        return int(string)
    else:
        return np.NAN

In [24]:
NewTable['ano_determinacao'] = NewTable['ano_determinacao'].apply(str_with_nan2int) #has NaN
NewTable['ano_coleta'] = NewTable['ano_coleta'].apply(str_with_nan2int) #has NaN
NewTable['ano_entrada'] = NewTable['ano_entrada'].apply(str_with_nan2int) #has NaN

In [25]:
# número de informações em branco em cada coluna (ano_coleta é a coluna que vou tomar como data)
NewTable.isna().apply(sum)

ano_coleta                  4345
class                          0
kingdom                        0
genero_ent                     0
genero_atual                   0
especie_ent                    0
especie_atual                  0
genero_e_especie_ent           0
genero_e_especie_atual         0
type_status                23119
determinator_first_name     5307
determinator_last_name      4704
altitude                   22472
ordem                          0
familia                        0
phylum                         0
ano_determinacao            9013
ano_entrada                23093
dtype: int64

<br>

### Creating chart: counts per order per year

In [26]:
orders = NewTable.groupby(['ano_coleta','ordem']).count()['class'].reset_index().rename(columns={'class':'counts'})

orders.sort_values(['ano_coleta','ordem'], inplace=True)  # ordering

In [27]:
orders.head()

Unnamed: 0,ano_coleta,ordem,counts
0,0.0,Squamata,5
1,1824.0,Squamata,1
2,1876.0,Squamata,74
3,1901.0,Squamata,3
4,1902.0,Squamata,2


In [28]:
g1 = alt.Chart(orders,
               width=800, height=400, title='Number of collected repteis per order each year').mark_circle(
                                                                                color='green').encode(
    x= alt.X('ano_coleta', type='ordinal', title='Year'),
    y= alt.Y('ordem', type='nominal', title='Order',
            sort= alt.EncodingSortField(field='count', op='max', order='descending')),
    size = alt.Size('counts', scale=alt.Scale(range=[1,500])),
    tooltip= alt.Tooltip(['ano_coleta', 'counts'])
)

# saving graph
g1.save('./graphs/orders_per_year.html')

g1

In [29]:
NewTable.columns

Index(['ano_coleta', 'class', 'kingdom', 'genero_ent', 'genero_atual',
       'especie_ent', 'especie_atual', 'genero_e_especie_ent',
       'genero_e_especie_atual', 'type_status', 'determinator_first_name',
       'determinator_last_name', 'altitude', 'ordem', 'familia', 'phylum',
       'ano_determinacao', 'ano_entrada'],
      dtype='object')

### number of reptiles per family per year

In [30]:
teste = NewTable.groupby(['familia','ano_coleta']).count()['class'].reset_index().rename(
                                                                                    columns={'class':'counts'})

teste['ano_coleta'] = teste['ano_coleta'].astype(int)

In [31]:
teste.head()

Unnamed: 0,familia,ano_coleta,counts
0,#n/d,1990,1
1,#n/d,2015,1
2,Agamidae,1962,2
3,Alligatoridae,1912,1
4,Alligatoridae,1948,1


<br>

**graph:** family per year

In [32]:
g1 = alt.Chart(teste,
               width=800, height=400, title='Number of collected animals of each family per year').mark_circle(
                                                                                size=60).encode(
    x= alt.X('ano_coleta', type='ordinal', title='Ano de Coleta'),
    y= alt.Y('familia', type='nominal', title='Ordem',
            sort= alt.EncodingSortField(field='counts', op='count', order='descending')),
    size= alt.Size('counts', title='Count'),
    tooltip = alt.Tooltip(['familia', 'ano_coleta', 'counts'])
)

g1.save('./graphs/familias_por_ano.html')

g1

In [33]:
teste = NewTable.groupby(['familia','ordem','ano_coleta']).count()['class'].reset_index().rename(
                                                                                    columns={'class':'counts'})

teste['ano_coleta'] = teste['ano_coleta'].astype(int)

In [34]:
g1 = alt.Chart(teste,
               width=800, height=400, title='Number of collected animals of each family per year').mark_circle(
                                                                                size=60).encode(
    x= alt.X('ano_coleta', type='ordinal', title='Ano de Coleta'),
    y= alt.Y('familia', type='nominal', title='Ordem',
            sort= alt.EncodingSortField(field='counts', op='count', order='descending')),
    size= alt.Size('counts', title='Count'),
    color= alt.Color('ordem', title='Ordem'),
    tooltip = alt.Tooltip(['familia', 'ano_coleta', 'counts'])
)

g1.save('./graphs/familias_por_ano_c_ordem.html')

g1

In [35]:
# dynamic version
slider = alt.binding_range(min=1900, max=2016, step=1)
select_year = alt.selection_single(name="ano_coleta", fields=['ano_coleta'],
                                   bind=slider, init={'ano_coleta': 2000})

g1 = alt.Chart(teste,
               width=800, height=400, title='Number of animals of each family and order').mark_circle(
                                                                                size=60).encode(
    x= alt.X('familia', type='nominal', title='Familia'),
    y= alt.Y('ordem', type='nominal', title='Ordem',
            sort= alt.EncodingSortField(field='counts', op='count', order='descending')),
    size= alt.Size('counts', title='Count'),
    color= alt.Color('ordem', title='Ordem'),
    tooltip=alt.Tooltip(['familia','ordem','ano_coleta','counts','ordem'])
).add_selection(
    select_year
).transform_filter(
    select_year
)

# saving graph
g1.save('./graphs/familias_por_ano_c_ordem-dinamico.html')
# g1.save(f'./types/counts_per_year/other-types_per_order.png')
# g1.save(f'./types/counts_per_year/other-types_per_order.html')

g1

<br>

**The end!**

-----