# Cumulative Counts per researcher & family

By **Franklin Oliveira**

-----
This notebook contains all code necessary to make the cumulative counts line charts for `poliqueta` database. Here you'll find some basic data treatment and charts' code. 

Database: <font color='blue'>'IBUFRJ27.07.2020 - visualização.xlsx'</font> and <font color='blue'>'MNRJP27.07.2020 - visualização.xls'</font>.

In [2]:
import datetime
import numpy as np
import pandas as pd

from collections import defaultdict

# quick visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Waffle Charts
# from pywaffle import Waffle 
# docs: https://pywaffle.readthedocs.io/en/latest/examples/block_shape_distance_location_and_direction.html

# visualization
import altair as alt

# enabling notebook renderer
# alt.renderers.enable('notebook')
# alt.renderers.enable('default')

# disabling rows limit
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Importing data...

In [3]:
NewTable = pd.read_csv('./data/merged_db.csv', sep=';', encoding='utf-8-sig')

In [4]:
# formatando a string NaN
NewTable['family'] = NewTable['family'].apply(lambda x: 'NaN' if x=='Nan' else x)

<br>

<font size=5>**Color Palette per Order**</font>

These images were used as inspiration (https://color.adobe.com/create/image)


<div class='row' style='padding-top:20px;'>
    <div class='col-md-6'>
        <img src="./src/img1.jpg" width='400px'>
    </div>
    <div class='col-md-6'>
        <img src="./src/img2.jpg" width='400px'>
    </div>
</div>

<br>

A partir das imagens acima, selecionamos cores (centróides) para criar a paleta de cores. Foram elas: 
<ul>
    <li style='color:#3CA67F'><b> #3CA67F </b># verde</li>
    <li style='color:#7A9FBF'><b> #7A9FBF </b># azul</li>
    <li style='color:#D94814'><b> #D94814 </b># laranja</li>
    <li style='color:#D96236'><b> #D96236 </b># laranja 2</li>
    <li style='color:#F2B999'><b> #F2B999 </b># 'cor de pele'</li>
    <li style='color:#A66C4B'><b> #A66C4B </b># marrom 1</li>
    <li style='color:#732C02'><b> #732C02 </b># marrom 2</li>
</ul>

A partir das cores "centróides", utilizamos a ferramenta Color Crafter para selecionar diferentes "shades" e auxiliar para categorização em diferentes grupos sugeridos pela equipe de Poliquetas do Museu Nacional. 

<ul>
    <li style='color:#3CA67F'><b> Verde: </b> ['#daffef', '#bbebd3', '#9adabc', '#77c8a5', '#57b791', '#3ca67f', '#2a9670', '#238762', '#257a56']</li>
    <li style='color:#7A9FBF'><b> Azul: </b> ['#e7e5df', '#ccd2d8', '#b2c0d0', '#96afc8', '#7a9fbf', '#5d90b6', '#3c81ae', '#0673a4', '#00669a']</li>
    <li style='color:#D94814'><b> laranja: </b> ['#ffbd84', '#ffaa74', '#ff9760', '#ff814b', '#fc6b36', '#eb5824', '#d94814', '#c83b03', '#b73000']</li>
    <li style='color:#D96236'><b> laranja 2: ['#ffeba9', '#ffd391', '#ffbb7b', '#fda468', '#f18e56', '#e57846', '#d96236', '#cc4d28', '#bf381b']</b> </li>
    <li style='color:#F2B999'><b> cor de pele: ['#ffe9c3', '#fbd0ad', '#f2b999', '#e8a287', '#dd8c76', '#d27666', '#c76158', '#bb4d4b', '#ae393e']</b> </li>
    <li style='color:#A66C4B'><b> marrom 1: ['#d9c6af', '#ccad96', '#c1977c', '#b48061', '#a66c4b', '#975b39', '#874c2c', '#774124', '#683720']</b> </li>
    <li style='color:#732C02'><b> marrom 2: ['#eebd93', '#dfa47a', '#d28d60', '#c37746', '#b4622f', '#a3501d', '#92420e', '#823606', '#732c02']</b> </li>
</ul>



**Colors  (antigas):** 

<ul>
    <li style='color:#41A681'><b> #41A681 </b># verde1</li>
    <li style='color:#3CA67F'><b> #3CA67F </b># verde2</li>
    <li style='color:#7ACAAB'><b> #7ACAAB </b># verde claro</li>
    <li style='color:#78a1a1'><b> #78a1a1 </b># azul</li>
    <li style='color:#8ABFB0'><b> #8ABFB0 </b># azul claro</li>
    <li style='color:#FFB27C'><b> #FFB27C </b># cor de pele clara</li>
    <li style='color:#F29877'><b> #F29877 </b># cor de pele</li>
    <li style='color:#ed845e'><b> #ed845e </b># laranja claro1</li>
    <li style='color:#D96236'><b> #D96236 </b># laranja claro2</li>
    <li style='color:#D95323'><b> #D95323 </b># laranja 1</li>
    <li style='color:#D94B18'><b> #D94B18 </b># laranja 2</li>
    <li style='color:#D9C2AD'><b> #D9C2AD </b># bege</li>
    <li style='color:#A66C4B'><b> #A66C4B </b># marrom claro</li>
    <li style='color:#86471B'><b> #86471B </b># marrom1</li>
    <li style='color:#732C02'><b> #732C02 </b># marrom2</li>
    <li style='color:#592202'><b> #592202 </b># marrom escuro1</li>
    <li style='color:#3D1806'><b> #3D1806 </b># marrom escuro2</li>
    <li style='color:#0D0D0D'><b> #0D0D0D </b># preto</li>
</ul>

In [5]:
# importing customized color palettes
from src.MNViz_colors import *

**Paleta de Cores - Família:**

- grupo 1: 1 cor  (verde escuro)
<ul>
    <li style="color:#142611"><b>centroide 1</b></li>
</ul>
- grupo 2: 8 cores
<ul>
    <li style="color:#85D907"><b>centroide 2</b></li>
</ul>

['#d7ff81', '#bafd62', '#9feb3f', '#85d907', '#6cc700', '#52b700', '#35a600', '#0b9700', '#008800']

<font color="#d7ff81"><b>cor 1</b> (ficou fora)</font>
<font color="#bafd62"><b>cor 2</b></font>
<font color="#9feb3f"><b>cor 3</b></font>
<font color="#85d907"><b>cor 4</b></font>
<font color="#6cc700"><b>cor 5</b></font>
<font color="#52b700"><b>cor 6</b></font>
<font color="#35a600"><b>cor 7</b></font>
<font color="#0b9700"><b>cor 8</b></font>
<font color="#008800"><b>cor 9</b></font>


- grupo 3: 2 cores
#888C03
<ul>
    <li style="color:#22401E"><b>centroide 3 (puxando para tons frios mais claros)</b></li>
</ul>

<font color="#99b6b2"><b>cor 1</b></font>
<font color="#81a58b"><b>cor 2</b></font>


- grupo 4: 1 cor  (amarelo)
<ul>
    <li style="color:#F2CB07"><b>centroide 4</b></li>
</ul>

- grupo 5: 10 cores
<ul>
    <li style="color:#cb97d4"><b>centroide 5 (puxando para o roxo)</b></li>
</ul>

['#f8dcf9', '#ebc5ed', '#ddafe2', '#ce9ad6', '#bf86cc', '#af73c2', '#a160b8', '#924fae', '#833fa4'] #803da1

<font color="#f8dcf9"><b>cor 1</b></font>
<font color="#ebc5ed"><b>cor 2</b></font>
<font color="#ddafe2"><b>cor 3</b></font>
<font color="#ce9ad6"><b>cor 4</b></font>
<font color="#bf86cc"><b>cor 5</b></font>
<font color="#af73c2"><b>cor 6</b></font>
<font color="#a160b8"><b>cor 7</b></font>
<font color="#924fae"><b>cor 8</b></font>
<font color="#833fa4"><b>cor 9</b></font>
<font color="#803da1"><b>cor 10</b></font>


- grupo 6: 12 cores
<ul>
    <li style="color:#91F2E9"><b>centroide 6</b></li>
</ul>

['#c9fff9', '#b3eff2', '#9cdcea', '#83c9e2', '#68b7da', '#4aa6d2', '#2096ca', '#0087c1', '#0079b7']

['#cee5d8', '#b3d2d1', '#9bbfc9', '#83adc2', '#6d9bba', '#568ab2', '#3e7baa', '#226ca2', '#005e98']


<font color="#c9fff9"><b>cor 1</b></font>
<font color="#b3eff2"><b>cor 2</b></font>
<font color="#9cdcea"><b>cor 3</b></font>
<font color="#83c9e2"><b>cor 4</b></font>
<font color="#68b7da"><b>cor 5</b></font>
<font color="#4aa6d2"><b>cor 6</b></font>
<font color="#2096ca"><b>cor 7</b></font>
<font color="#0087c1"><b>cor 8</b></font>
<font color="#0079b7"><b>cor 9</b></font>
<font color="#3e7baa"><b>cor 10</b></font>
<font color="#226ca2"><b>cor 11</b></font>
<font color="#005e98"><b>cor 12</b></font>


- grupo 7: 3 cores
<ul>
    <li style="color:#8C1A0F"><b>centroide 7 (puxando para o marrom)</b></li>
</ul>

['#fde5bf', '#efd09f', '#e1bb82', '#d3a767', '#c6934d', '#b98033', '#ac6f18', '#9e5e00', '#914e00']

<font color="#ac6f18"><b>cor 1</b></font>
<font color="#9e5e00"><b>cor 2</b></font>
<font color="#914e00"><b>cor 3</b></font>


- grupo 8: 13 cores
<ul>
    <li style="color:#D9430D"><b>centroide 8</b></li>
</ul>

['#ffce9f', '#ffb683', '#ff9f69', '#ff8851', '#f5723b', '#e75b25', '#d9430d', '#cb2800', '#bc0000']

['#ff8f68', '#ff7e56', '#ff6b40', '#ee5829', '#d9430d', '#c62f00', '#b41b00', '#a40300', '#930000']

<font color="#ffce9f"><b>cor 1</b></font>
<font color="#ffb683"><b>cor 2</b></font>
<font color="#ff9f69"><b>cor 3</b></font>
<font color="#ff8851"><b>cor 4</b></font>
<font color="#f5723b"><b>cor 5</b></font>
<font color="#e75b25"><b>cor 6</b></font>
<font color="#d9430d"><b>cor 7</b></font>
<font color="#cb2800"><b>cor 8</b></font>
<font color="#bc0000"><b>cor 9</b></font>
<font color="#c62f00"><b>cor 10</b></font>
<font color="#b41b00"><b>cor 11</b></font>
<font color="#a40300"><b>cor 12</b></font>
<font color="#930000"><b>cor 13</b></font>

<br>

**TOTAL: 50 cores**

In [4]:
# cores_familia = {
#     # known errors treatment
# #     '#n/d':'#000000',
# #     'nan':'#000000',
#     # grupo 1: Crocodylia
#     'Alligatoridae':'#142611',
#     # grupo 2: Testudines - Cryptodira
#     'Cheloniidae':'#bafd62',
#     'Chelydridae':'#9feb3f',
#     'Dermochelyidae':'#85d907',
#     'Emydidae':'#6cc700',
#     'Geoemydidae':'#52b700',
#     'Kinosternidae':'#35a600',
#     'Testudinidae':'#0b9700',
#     'Trionychidae':'#008800',
#     # grupo 3: Testudines - Pleurodira
#     'Chelidae':'#99b6b2' ,
#     'Podocnemididae':'#81a58b',
#     # grupo 4: Amphisbaenia - Amphisbaenia
#     'Amphisbaenidae':'#F2CB07',
#     # grupo 5: Sauria - Iguania
#     'Agamidae':'#f8dcf9',
#     'Chamaeleonidae':'#ebc5ed',
#     'Dactyloidae':'#ddafe2',
#     'Iguanidae':'#ce9ad6',
#     'Hoplocercidae':'#bf86cc',
#     'Leiosauridae':'#af73c2',
#     'Liolaemidae':'#a160b8', 
#     'Phrynosomatidae':'#924fae',
#     'Polychrotidae':'#833fa4',
#     'Tropiduridae':'#803da1',
#     # grupo 6: Sauria - Scleroglossa
#     'Anguidae':'#c9fff9',
#     'Gekkonidae':'#b3eff2',
#     'Gymnophthalmidae':'#9cdcea',
#     'Helodermatidae':'#83c9e2',
#     'Lacertidae':'#68b7da',
#     'Mabuyinae':'#4aa6d2',
#     'Phyllodactylidae':'#2096ca',
#     'Scincidae':'#0087c1',
#     'Sphaerodactylidae':'#0079b7',
#     'Teiidae':'#3e7baa',
#     'Varanidae':'#226ca2',
#     'Xantusiidae':'#005e98',
#     # grupo 7: Serpentes - Scolecophidia
#     'Anomalepididae':'#ac6f18',
#     'Leptotyphlopidae':'#9e5e00',
#     'Typhlopidae':'#914e00', 
#     # grupo 8: Alethinophidia
#     'Aniliidae':'#ffce9f',
#     'Boidae':'#ffb683',
#     'Colubridae':'#ff9f69',
#     'Dipsadidae':'#ff8851',
#     'Elapidae':'#f5723b',
#     'Homalopsidae':'#e75b25',
#     'Lamprophiidae':'#d9430d', 
#     'Loxocemidae':'#cb2800',
#     'Natricidae':'#bc0000',
#     'Pythonidae':'#c62f00',
#     'Tropidophiidae':'#b41b00',
#     'Viperidae':'#a40300',
#     'Xenopeltidae':'#930000'
# }

In [6]:
ordens = NewTable['order'].unique()
familias = NewTable['family'].unique()

<br>


## Graphs

---

<font color='blue' size='6'>Ajustar a partir daqui!!! </font>

<br>

### Creating chart: counts per determiner per year

To calculate the cumulative counts, we need to consider ALL determiner's columns, in this case:

    - 'determinator_full_name'
    - 'determinator_full_name2': as I'm writing this script, it's all empty

In [5]:
determiner_columns = ['determinator_full_name', 'determinator_full_name2']

# name of all determiners (first and second)
determiners = set(NewTable['determinator_full_name'].str.strip()).union(set(NewTable['determinator_full_name2']))

determiners = [name for name in determiners if 'nan' not in str(name)]  # removing NaN and parsing into a list

In [6]:
# cumulatively counting the contribution of each determiner
result = defaultdict()
for col in determiner_columns:
    temp = NewTable.groupby([col, 'ano_determinacao']).count()['class'].reset_index().rename(columns={
        'class':'counts'
    })
    
    temp.sort_values('ano_determinacao')
    for det in temp[col].unique():
        if 'nan' not in str(det):  # OBS: se precisar coletar mais informações, fazer isso AQUI!
            cumSum = list(temp[temp[col] == det]['counts'].cumsum())
            years = list(temp[temp[col] == det]['ano_determinacao'])

            for i in range(len(cumSum)):
                result[f'{det}-{years[i]}'] = cumSum[i]
                
counts = pd.DataFrame(dict(result), index=['cumulative_sum']).transpose().reset_index()

In [7]:
counts['determinator_full_name'] = counts['index'].apply(lambda x:str(x).split('-')[0])
counts['ano_determinacao'] = counts['index'].apply(lambda x:float(str(x).split('-')[-1]))

In [9]:
g1 = alt.Chart(counts, title='Número acumulado de espécimes por determinador', width=800,
              height=400).mark_line(point=True).encode(
    x= alt.X('ano_determinacao', type="ordinal", title='Ano de Determinação'),
    y= alt.Y('cumulative_sum', title='', 
             sort=alt.EncodingSortField('counts', op="count", order='descending')),
    color= alt.Color('determinator_full_name:N', title='', legend=None),
    tooltip= alt.Tooltip(['determinator_full_name','ano_determinacao','cumulative_sum'])
)

g1 = g1.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

g1.save('./graphs/cumCounts/determiner/cumCount-per-year.html')

# g1

<br>

### Collector's cumulative contribution per year

<font color='red' size='5'> Collectors name is sensitive data. Do not publish it without curator's permission </font>

To calculate the cumulative counts, we need to consider ALL collector's columns, in this case:

    - 'collector_full_name'
    - 'collector_full_name2' ... 'collector_full_name6'

In [10]:
collector_columns = ['collector_full_name', 'collector_full_name2', 
                     'collector_full_name3', 'collector_full_name4', 
                     'collector_full_name5', 'collector_full_name6']

# name of all collectors (first through sixth)
collectors = set(NewTable['collector_full_name'].str.strip())

for col in collector_columns[1:]:
    collectors = collectors.union(set(NewTable[col]))

collectors = [name for name in collectors if 'nan' not in str(name)]  # removing NaN and parsing into a list

In [11]:
# organising the yearly contribution of each collector
result = defaultdict()
for col in collector_columns:
    temp = NewTable.groupby([col, 'ano_coleta']).count()['class'].reset_index().rename(columns={
        'class':'counts'
    })
    
    temp.sort_values('ano_coleta')
    for collector in temp[col].unique():
        if 'nan' not in str(collector):  # OBS: se precisar coletar mais informações, fazer isso AQUI!
            count = list(temp[temp[col] == collector]['counts'])
            years = list(temp[temp[col] == collector]['ano_coleta'])

            for i in range(len(count)):
                result[f'{collector}-{years[i]}'] = count[i]
                
counts = pd.DataFrame(dict(result), index=['count']).transpose().reset_index()

In [12]:
counts['collector_full_name'] = counts['index'].apply(lambda x:str(x).split('-')[0])
counts['ano_coleta'] = counts['index'].apply(lambda x:float(str(x).split('-')[-1]))

In [13]:
# sorting
counts.sort_values(['collector_full_name', 'ano_coleta'], inplace=True)

In [14]:
# cumulatively counting
cumSum = []
for collector in counts['collector_full_name'].unique():
    cumSum.extend(list(counts[counts['collector_full_name'] == collector]['count'].cumsum()))

counts['cumulative_sum'] = cumSum

In [16]:
g1 = alt.Chart(counts, title='Número acumulado de espécimes por coletor', width=800,
              height=400).mark_line(point=True).encode(
    x= alt.X('ano_coleta', type="ordinal", title='Ano de Coleta'),
    y= alt.Y('cumulative_sum', title='', 
             sort=alt.EncodingSortField('counts', op="count", order='descending')),
    color= alt.Color('collector_full_name:N', title='', legend=None),
    tooltip= alt.Tooltip(['collector_full_name','ano_coleta','cumulative_sum'])
)

g1 = g1.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

g1.save('./graphs/cumCounts/collector/cumCount-per-year.html')

# g1

<br>

## Cumulative counts per Family


In [17]:
# grouping per Year and Family
teste = NewTable.groupby(['ano_coleta','familia']).count()['class'].reset_index().rename(columns={
    'class':'counts'
})

# sorting...
teste = teste.sort_values(['familia', 'ano_coleta'])

In [18]:
# cumulatively counting
cumSum = []
for family in teste['familia'].unique():
    cumSum.extend(list(teste[teste['familia'] == family]['counts'].cumsum()))
    
teste['cumulative_sum'] = cumSum

### Chart: per collected year

In [26]:
# filtering out some families lost while grouping
familias = [f for f in cores_familia.keys() if f in teste['familia'].unique()]
cores_temp = [cores_familia[f] for f in familias] 


# selector
select_family = alt.selection_multi(fields=['familia'], bind='legend')

g1 = alt.Chart(teste, title='Quantidade de exemplares por família', 
               width=600, height=400).mark_line(point=True).encode(
    x= alt.X('ano_coleta', type="ordinal", title='Ano de Coleta'),
    y= alt.Y('cumulative_sum', title='', 
             sort=alt.EncodingSortField('counts', op="count", order='descending')),
    color= alt.Color('familia:N', title='Familia',
                     legend= alt.Legend(columns=2, symbolLimit=50),
                     scale=alt.Scale(domain=familias, range= cores_temp)),
    tooltip= alt.Tooltip(['familia','ano_coleta','counts', 'cumulative_sum']),
    opacity= alt.condition(select_family, alt.value(1), alt.value(0))
).add_selection(select_family)

g1 = g1.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

g1.save('./graphs/cumCounts/time-spacial/evolucao_por_familia.html')

g1

<br>

### Chart: per determination year

In [27]:
# grouping per Year and Family
teste = NewTable.groupby(['ano_determinacao','familia']).count()['class'].reset_index().rename(columns={
    'class':'counts'
})

# sorting...
teste = teste.sort_values(['familia', 'ano_determinacao'])

In [28]:
# cumulatively counting
cumSum = []
for family in teste['familia'].unique():
    cumSum.extend(list(teste[teste['familia'] == family]['counts'].cumsum()))
    
teste['cumulative_sum'] = cumSum

In [31]:
# filtering out some families lost while grouping
familias = [f for f in cores_familia.keys() if f in teste['familia'].unique()]
cores_temp = [cores_familia[f] for f in familias] 


# selector
select_family = alt.selection_multi(fields=['familia'], bind='legend')

g1 = alt.Chart(teste, title='Quantidade de exemplares por família', 
               width=600, height=400).mark_line(point=True).encode(
    x= alt.X('ano_determinacao', type="ordinal", title='Ano de Determinação'),
    y= alt.Y('cumulative_sum', title='', 
             sort=alt.EncodingSortField('counts', op="count", order='descending')),
    color= alt.Color('familia:N', title='Família',
                     legend= alt.Legend(columns=2, symbolLimit=50),
                     scale=alt.Scale(domain=familias, range= cores_temp)),
    tooltip= alt.Tooltip(['familia','ano_determinacao','counts', 'cumulative_sum']),
    opacity= alt.condition(select_family, alt.value(1), alt.value(0))
).add_selection(select_family)

g1 = g1.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

g1.save('./graphs/cumCounts/time-spacial/evolucao_por_familia-ANO-DETERMINACAO.html')

# g1

<br>

**The end!**

-----