# Cumulative Counts per researcher

By **Franklin Oliveira**

-----
This notebook contains all code necessary to make the cumulative counts line charts for `repteis` database. Here you'll find some basic data treatment and charts' code. 

Database: <font color='blue'>'Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls'</font>.

In [1]:
import datetime
import numpy as np
import pandas as pd

from collections import defaultdict

# quick visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Waffle Charts
# from pywaffle import Waffle 
# docs: https://pywaffle.readthedocs.io/en/latest/examples/block_shape_distance_location_and_direction.html

# visualization
import altair as alt

# enabling notebook renderer
# alt.renderers.enable('notebook')
alt.renderers.enable('default')

# disabling rows limit
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Importing data...

In [2]:
NewTable = pd.read_csv('./data/treated_db.csv', sep=';', encoding='utf-8-sig')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<br>

<font size=5>**Paleta de cores por Ordem**</font>

Abaixo está a imagem usada como inspiração (https://color.adobe.com/create/image)

<img src="./src/paleta_cores.jpeg" width='500px'>

Cores: 

- verde_escuro: #284021
- verde_claro: #88BF11
- amarelo: #D9CB0B
- laranja: #D99311
- laranja_escuro: #BF4417
- marrom-_laro: #BF8D7A

In [3]:
# p.s.: Caudata is an error and should be removed
cores_ordem = {
    'Squamata': '#BF4417',
    'Testudines': '#D9CB0B', 
    'Crocodylia': '#284021'
}

ordens = list(cores_ordem.keys())
cores = list(cores_ordem.values())

<br>


## Graphs

---

<br>

### Creating chart: counts per determiner per year

To calculate the cumulative counts, we need to consider ALL determiner's columns, in this case:

    - 'determinator_full_name'
    - 'determinator_full_name2': as I'm writing this script, it's all empty

In [4]:
determiner_columns = ['determinator_full_name', 'determinator_full_name2']

# name of all determiners (first and second)
determiners = set(NewTable['determinator_full_name'].str.strip()).union(set(NewTable['determinator_full_name2']))

determiners = [name for name in determiners if 'nan' not in str(name)]  # removing NaN and parsing into a list

In [5]:
# cumulatively counting the contribution of each determiner
result = defaultdict()
for col in determiner_columns:
    temp = NewTable.groupby([col, 'ano_determinacao']).count()['class'].reset_index().rename(columns={
        'class':'counts'
    })
    
    temp.sort_values('ano_determinacao')
    for det in temp[col].unique():
        if 'nan' not in str(det):  # OBS: se precisar coletar mais informações, fazer isso AQUI!
            cumSum = list(temp[temp[col] == det]['counts'].cumsum())
            years = list(temp[temp[col] == det]['ano_determinacao'])

            for i in range(len(cumSum)):
                result[f'{det}-{years[i]}'] = cumSum[i]
                
counts = pd.DataFrame(dict(result), index=['cumulative_sum']).transpose().reset_index()

In [6]:
counts['determinator_full_name'] = counts['index'].apply(lambda x:str(x).split('-')[0])
counts['ano_determinacao'] = counts['index'].apply(lambda x:float(str(x).split('-')[-1]))

In [7]:
g1 = alt.Chart(counts, title='Número acumulado de espécimes por determinador', width=800,
              height=400).mark_line(point=True).encode(
    x= alt.X('ano_determinacao', type="ordinal", title='Ano de Determinação'),
    y= alt.Y('cumulative_sum', title='', 
             sort=alt.EncodingSortField('counts', op="count", order='descending')),
    color= alt.Color('determinator_full_name:N', title='', legend=None),
    tooltip= alt.Tooltip(['determinator_full_name','ano_determinacao','cumulative_sum'])
)

g1 = g1.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

g1.save('./graphs/cumCounts/determiner/cumCount-per-year.html')

g1

<br>

### Collector's cumulative contribution per year

<font color='red' size='5'> Collectors name is sensitive data. Do not publish it without curator's permission </font>

To calculate the cumulative counts, we need to consider ALL collector's columns, in this case:

    - 'collector_full_name'
    - 'collector_full_name2' ... 'collector_full_name6'

In [8]:
collector_columns = ['collector_full_name', 'collector_full_name2', 
                     'collector_full_name3', 'collector_full_name4', 
                     'collector_full_name5', 'collector_full_name6']

# name of all collectors (first through sixth)
collectors = set(NewTable['collector_full_name'].str.strip())

for col in collector_columns[1:]:
    collectors = collectors.union(set(NewTable[col]))

collectors = [name for name in collectors if 'nan' not in str(name)]  # removing NaN and parsing into a list

In [9]:
# organising the yearly contribution of each collector
result = defaultdict()
for col in collector_columns:
    temp = NewTable.groupby([col, 'ano_coleta']).count()['class'].reset_index().rename(columns={
        'class':'counts'
    })
    
    temp.sort_values('ano_coleta')
    for collector in temp[col].unique():
        if 'nan' not in str(collector):  # OBS: se precisar coletar mais informações, fazer isso AQUI!
            count = list(temp[temp[col] == collector]['counts'])
            years = list(temp[temp[col] == collector]['ano_coleta'])

            for i in range(len(count)):
                result[f'{collector}-{years[i]}'] = count[i]
                
counts = pd.DataFrame(dict(result), index=['count']).transpose().reset_index()

In [10]:
counts['collector_full_name'] = counts['index'].apply(lambda x:str(x).split('-')[0])
counts['ano_coleta'] = counts['index'].apply(lambda x:float(str(x).split('-')[-1]))

In [11]:
# sorting
counts.sort_values(['collector_full_name', 'ano_coleta'], inplace=True)

In [12]:
# cumulatively counting
cumSum = []
for collector in counts['collector_full_name'].unique():
    cumSum.extend(list(counts[counts['collector_full_name'] == collector]['count'].cumsum()))

counts['cumulative_sum'] = cumSum

In [13]:
g1 = alt.Chart(counts, title='Número acumulado de espécimes por coletor', width=800,
              height=400).mark_line(point=True).encode(
    x= alt.X('ano_coleta', type="ordinal", title='Ano de Coleta'),
    y= alt.Y('cumulative_sum', title='', 
             sort=alt.EncodingSortField('counts', op="count", order='descending')),
    color= alt.Color('collector_full_name:N', title='', legend=None),
    tooltip= alt.Tooltip(['collector_full_name','ano_coleta','cumulative_sum'])
)

g1 = g1.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

g1.save('./graphs/cumCounts/collector/cumCount-per-year.html')

g1

<br>

**The end!**

-----