# Counts per researcher

By **Franklin Oliveira**

-----
This notebook contains all code necessary to make the "type" charts from `repteis` database. Here you'll find some basic data treatment and charts' code. 

Database: <font color='blue'>'Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls'</font>.

In [1]:
import datetime
import numpy as np
import pandas as pd

from collections import defaultdict

# quick visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Waffle Charts
# from pywaffle import Waffle 
# docs: https://pywaffle.readthedocs.io/en/latest/examples/block_shape_distance_location_and_direction.html

# visualization
import altair as alt

# enabling notebook renderer
# alt.renderers.enable('notebook')
alt.renderers.enable('default')

# disabling rows limit
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Importing data...

In [2]:
NewTable = pd.read_csv('./data/treated_db.csv', sep=';', encoding='utf-8-sig')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<br>

<font size=5>**Paleta de cores por Ordem**</font>

Abaixo está a imagem usada como inspiração (https://color.adobe.com/create/image)

<img src="./src/paleta_cores.jpeg" width='500px'>

Cores: 

- verde_escuro: #284021
- verde_claro: #88BF11
- amarelo: #D9CB0B
- laranja: #D99311
- laranja_escuro: #BF4417
- marrom-_laro: #BF8D7A

In [3]:
# p.s.: Caudata is an error and should be removed
cores_ordem = {
    'Squamata': '#BF4417',
    'Testudines': '#D9CB0B', 
    'Crocodylia': '#284021'
}

ordens = list(cores_ordem.keys())
cores = list(cores_ordem.values())

<br>


## Graphs

---

<br>

### Creating chart: counts per determiner per year

In [4]:
teste = NewTable.groupby(['determinator_full_name','ano_determinacao']).count()['class'].reset_index().rename(columns=
                                                                                            {'class':'counts'})

In [6]:
g1 = alt.Chart(teste, width=800, height=2200).mark_circle().encode(
    x= alt.X('ano_determinacao', type='ordinal', title='Determined Year'),
    y= alt.Y('determinator_full_name', type='nominal', title='Determiner Name', 
            sort=alt.EncodingSortField('counts', op="count", order='descending')),
    size= alt.Size('counts', scale=alt.Scale(range=[15, 500])),  # range ajusta tamanho do circulo
    tooltip= alt.Tooltip(['determinator_full_name', 'ano_determinacao', 'counts'])
)

g1 = g1.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

g1.save('./graphs/determiner/counts_per_determiner.html')
# g1

<font color='red' size='5'>mesmo gráfico, ordenando também pela ordem</font>

**OBS:** Note que

- as contagens mudam (porque estamos ordenando por um fator a mais)
- há pontos sobrepostos (semelhante ao que tinhamos para a base crustaceas - para um mesmo ano, um mesmo pesquisador descobriu animais de ordens/familias diferentes)

In [15]:
teste1 = NewTable.groupby(['determinator_full_name','ano_determinacao', 'ordem']).count()['class'].reset_index().rename(columns=
                                                                                            {'class':'counts'})

In [16]:
g2 = alt.Chart(teste1, width=800, height=2200).mark_circle().encode(
    x= alt.X('ano_determinacao', type='ordinal', title='Determined Year'),
    y= alt.Y('determinator_full_name', type='nominal', title='Determiner Name', 
            sort=alt.EncodingSortField('counts', op="count", order='descending')),
    size= alt.Size('counts', type="quantitative",scale=alt.Scale(range=[15, 500])),  # range ajusta tamanho do circulo
    order= alt.Order('counts', sort='descending'),  # smaller points in front
    color= alt.Color('ordem', type="nominal", title="Order", scale=alt.Scale(domain=ordens, range=cores)),
    tooltip= alt.Tooltip(['determinator_full_name', 'ano_determinacao', 'counts', 'ordem'])
)

g2 = g2.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

# saving chart
g2.save('./graphs/determiner/counts_per_determiner-w-order.html')
# g2

<br>

<font color='red'>**mesmo gráfico, agora ordenando pelo primeiro ano que o pesquisador aparece na base**</font>

In [17]:
# ordenando
teste1.sort_values(['ano_determinacao', 'determinator_full_name'], inplace=True)

# salvando ordem das entradas
sorting = list(teste1['determinator_full_name'].unique())

In [18]:
g2 = alt.Chart(teste1, width=800, height=2200).mark_circle().encode(
    x= alt.X('ano_determinacao', type='ordinal', title='Determined Year'),
    y= alt.Y('determinator_full_name', type='nominal', title='Determiner Name', 
            sort=sorting),
    size= alt.Size('counts', type="quantitative",scale=alt.Scale(range=[15, 500])),  # range ajusta tamanho do circulo
    order= alt.Order('counts', sort='descending'),  # smaller points in front
    color= alt.Color('ordem', type="nominal", title="Order", scale=alt.Scale(domain=ordens, range=cores)),
    tooltip= alt.Tooltip(['determinator_full_name', 'ano_determinacao', 'counts', 'ordem'])
)

g2 = g2.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

# saving chart
g2.save('./graphs/determiner/counts_per_determiner-c-ordem_reordenado.html')
# g2

<br>

### Gráfico dos determinadores mais expressivos (top 50)

In [19]:
# somando contagens de cada pesquisador
sorting = teste1.groupby('determinator_full_name').sum()['counts'].reset_index().rename(
    columns={'counts':'sum'})

sorting = sorting.sort_values('sum', ascending=False)

# Nomes dos pesquisadores ordenados
sort_list = sorting['determinator_full_name'].unique()

sorting.head()

Unnamed: 0,determinator_full_name,sum
98,P. Passos,3183
99,P. Pinna,1666
13,A. Silveira,1614
92,N. Silva,796
124,T. Freitas,602


In [21]:
g2 = alt.Chart(teste1[teste1['determinator_full_name'].isin(sort_list[:50])], width=800, height=600).mark_circle().encode(
    x= alt.X('ano_determinacao', type='ordinal', title='Determined Year'),
    y= alt.Y('determinator_full_name', type='nominal', title='Determiner Name', 
            sort=sort_list),
    size= alt.Size('counts', type="quantitative",scale=alt.Scale(range=[15, 500])),  # range ajusta tamanho do circulo
    order= alt.Order('counts', sort='descending'),  # smaller points in front
    color= alt.Color('ordem', type="nominal", title="Order", scale=alt.Scale(domain=ordens, range=cores)),
    tooltip= alt.Tooltip(['determinator_full_name', 'ano_determinacao', 'counts', 'ordem'])
)

g2 = g2.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

# saving chart
g2.save('./graphs/determiner/counts_per_determiner-top_50.html')

# g2

In [23]:
g2 = alt.Chart(teste1[teste1['determinator_full_name'].isin(sort_list[50:])], width=800, height=600).mark_circle().encode(
    x= alt.X('ano_determinacao', type='ordinal', title='Determined Year'),
    y= alt.Y('determinator_full_name', type='nominal', title='Determiner Name', 
            sort=sort_list),
    size= alt.Size('counts', type="quantitative",scale=alt.Scale(range=[15, 60], zero=False)),  # range ajusta tamanho do circulo
    order= alt.Order('counts', sort='descending'),  # smaller points in front
    color= alt.Color('ordem', type="nominal", title="Order", scale=alt.Scale(domain=ordens, range=cores)),
    tooltip= alt.Tooltip(['determinator_full_name', 'ano_determinacao', 'counts', 'ordem'])
)

g2 = g2.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

# saving chart
g2.save('./graphs/determiner/counts_per_determiner-menos_freq.html')

# g2

<br>

### Creating chart: counts per collector per year

<font color='red' size='5'> Collectors name is sensitive data. Do not publish it without curator's permission </font>

In [24]:
teste = NewTable.groupby(['collector_full_name','ano_coleta']).count()['class'].reset_index().rename(columns=
                                                                                            {'class':'counts'})

In [26]:
g1 = alt.Chart(teste, width=800, height=2200).mark_circle().encode(
    x= alt.X('ano_coleta', type='ordinal', title='Collected Year'),
    y= alt.Y('collector_full_name', type='nominal', title='Collector Name', 
            sort=alt.EncodingSortField('counts', op="count", order='descending')),
    size= alt.Size('counts', scale=alt.Scale(range=[15, 500])),  # range ajusta tamanho do circulo
    order= alt.Order('counts', sort='descending'),  # smaller points in front
    tooltip= alt.Tooltip(['collector_full_name', 'ano_coleta', 'counts'])
)

g1 = g1.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

g1.save('./graphs/collector/counts_per_collector.html')

# g1

<font color='red' size='5'>mesmo gráfico, ordenando também pela ordem</font>

**OBS:** Note que

- as contagens mudam (porque estamos ordenando por um fator a mais)
- há pontos sobrepostos (semelhante ao que tinhamos para a base crustaceas - para um mesmo ano, um mesmo pesquisador descobriu animais de ordens/familias diferentes)

In [27]:
teste1 = NewTable.groupby(['collector_full_name','ano_coleta', 'ordem']).count()['class'].reset_index().rename(columns=
                                                                                            {'class':'counts'})

In [28]:
g2 = alt.Chart(teste1, width=800, height=2200).mark_circle().encode(
    x= alt.X('ano_coleta', type='ordinal', title='Collected Year'),
    y= alt.Y('collector_full_name', type='nominal', title='Collector Name', 
            sort=alt.EncodingSortField('counts', op="count", order='descending')),
    size= alt.Size('counts', type="quantitative",scale=alt.Scale(range=[15, 500])),  # range ajusta tamanho do circulo
    order= alt.Order('counts', sort='descending'),  # smaller points in front
    color= alt.Color('ordem', type="nominal", title="Order", scale=alt.Scale(domain=ordens, range=cores)),
    tooltip= alt.Tooltip(['collector_full_name', 'ano_coleta', 'counts', 'ordem'])
)

g2 = g2.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

# saving chart
g2.save('./graphs/collector/counts_per_collector-w-order.html')

# g2

<br>

<font color='red'>**mesmo gráfico, agora ordenando pelo primeiro ano que o pesquisador aparece na base**</font>

In [29]:
# ordenando
teste1.sort_values(['ano_coleta', 'collector_full_name'], inplace=True)

# salvando ordem das entradas
sorting = list(teste1['collector_full_name'].unique())

In [30]:
g2 = alt.Chart(teste1, width=800, height=10000).mark_circle().encode(
    x= alt.X('ano_coleta', type='ordinal', title='Collected Year'),
    y= alt.Y('collector_full_name', type='nominal', title='Collector Name', 
            sort=alt.EncodingSortField('ano_coleta', op="min", order='ascending')),
    size= alt.Size('counts', type="quantitative",scale=alt.Scale(range=[15, 500])),  # range ajusta tamanho do circulo
    order= alt.Order('counts', sort='descending'),  # smaller points in front
    color= alt.Color('ordem', type="nominal", title="Order", scale=alt.Scale(domain=ordens, range=cores)),
    tooltip= alt.Tooltip(['collector_full_name', 'ano_coleta', 'counts', 'ordem'])
)

g2 = g2.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

# saving chart
g2.save('./graphs/collector/counts_per_researcher-c-ordem_reordenado.html')

# g2

<br>

### Gráfico dos coletores mais expressivos (top 50)

In [31]:
# somando contagens de cada pesquisador
sorting = teste1.groupby('collector_full_name').sum()['counts'].reset_index().rename(
    columns={'counts':'sum'})

sorting = sorting.sort_values('sum', ascending=False)

# Nomes dos pesquisadores ordenados
sort_list = sorting['collector_full_name'].unique()

sorting.head()

Unnamed: 0,collector_full_name,sum
155,A. Silveira,966
1523,U. Caramaschi,753
406,D. Vrcibradic,701
37,A. Carvalho,567
302,C. Rocha,478


In [32]:
g2 = alt.Chart(teste1[teste1['collector_full_name'].isin(sort_list[:50])], 
               width=800, height=600).mark_circle().encode(
    x= alt.X('ano_coleta', type='ordinal', title='Collected Year'),
    y= alt.Y('collector_full_name', type='nominal', title='Collector Name', sort= sort_list[:50]),
    size= alt.Size('counts', type="quantitative",scale=alt.Scale(range=[15, 500])),  # range ajusta tamanho do circulo
    order= alt.Order('counts', sort='descending'),  # smaller points in front
    color= alt.Color('ordem', type="nominal", title="Order", scale=alt.Scale(domain=ordens, range=cores)),
    tooltip= alt.Tooltip(['collector_full_name', 'ano_coleta', 'counts', 'ordem'])
)

g2 = g2.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

# saving chart
g2.save('./graphs/collector/counts_per_collector-top_50.html')

g2

In [34]:
g2 = alt.Chart(teste1[teste1['collector_full_name'].isin(sort_list[50:900])], 
               width=800, height=600).mark_circle().encode(
    x= alt.X('ano_coleta', type='ordinal', title='Collected Year'),
    y= alt.Y('collector_full_name', type='nominal', title='Collector Name', sort= sort_list[50:900]),
    size= alt.Size('counts', type="quantitative",scale=alt.Scale(range=[15, 120])),  # range ajusta tamanho do circulo
    order= alt.Order('counts', sort='descending'),  # smaller points in front
    color= alt.Color('ordem', type="nominal", title="Order", scale=alt.Scale(domain=ordens, range=cores)),
    tooltip= alt.Tooltip(['collector_full_name', 'ano_coleta', 'counts', 'ordem'])
)

g2 = g2.configure_title(fontSize=16).configure_axis(
    labelFontSize=12,
    titleFontSize=12
).configure_legend(
    labelFontSize=12,
    titleFontSize=12
)

# saving chart
g2.save('./graphs/collector/counts_per_collector-menos_freq.html')

# g2

<br>

**The end!**

-----