# 1 - Data Treatment

In this notebook I'm doing all the data treatments and adjustments necessary to produce high quality visualizations. 

This notebook has an output <font color='blue'>treated_db.csv</font> with a subset of the original database properly treated.

-----

Specifically, for the `repteis` database, the file we'll read is <font color='blue'>Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls</font>

<font color='red'>**p.s.:** the idea is to transform all the code in this notebook into a `.py` file with a CLI interface to parse a file and produce a treated csv file right away. </font>

## Imports

In [1]:
import datetime
import numpy as np
import pandas as pd

from collections import defaultdict

# quick visualizations for data analytics
import seaborn as sns
import matplotlib.pyplot as plt

# proprietary functions in ./src/MNViz.py
from src.MNViz import *

## Importing data

In [2]:
excel = pd.ExcelFile('./data/Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls')
sheet_name = excel.sheet_names

print('The excel file contains the following sheets:', sheet_name)
print('\nDatabase is in sheet:', sheet_name[0])

The excel file contains the following sheets: ['Repteis-2020-02-11-csv']

Database is in sheet: Repteis-2020-02-11-csv


In [3]:
db = excel.parse(sheet_name[0], sep=';', encoding='utf-8-sig')
repteis = db.copy()

print(f'The database has {db.shape[0]} rows and {db.shape[1]} columns.')

The database has 23119 rows and 114 columns.


## Adjusting column names

### removing '\n', '\t', and other special characters

In [4]:
repteis.columns = [str(col).replace(r'\n','') for col in repteis.columns]

In [5]:
for col in repteis.columns:
    if 'collector' in col.lower():
        print(col)

CollectorLastName1
CollectorFirstName1
CollectorMiddleName1
CollectorLastName2
CollectorFirstName2
CollectorMiddleName2
CollectorLastName3
CollectorFirstName3
CollectorMiddleName3
CollectorLastName4
CollectorFirstName4
CollectorMiddleName4
CollectorLastName5
CollectorFirstName5
CollectorMiddleName5
CollectorLastName6
CollectorFirstName6
CollectorMiddleName6
CollectorLastName7
CollectorFirstName7
CollectorMiddleName7
CollectorLastName8
CollectorFirstName8
CollectorMiddleName8
CollectorRemark


## Adjusting Determiners and Collectors Names

In [6]:
names_col = ['DeterminatorLastName1', 'DeterminatorFirstName1', 'DeterminatorLastName2',
             'DeterminatorFirstName2', 'CollectorLastName1', 'CollectorFirstName1', 
             'CollectorLastName2', 'CollectorFirstName2', 'CollectorLastName3', 'CollectorFirstName3',
             'CollectorLastName4', 'CollectorFirstName4', 'CollectorLastName5', 'CollectorFirstName5',
             'CollectorLastName6', 'CollectorFirstName6']

In [7]:
for name_col in names_col:
    if 'last' in name_col.lower():
        repteis[name_col] = repteis[name_col].apply(lambda x: treat_names(x, pos='last'))
    else:
        repteis[name_col] = repteis[name_col].apply(treat_names)

### creating column joining First and Last names

In [8]:
# fazer depois (olhar notebook que usa o nome dos pesquisadores e inserir tratamento aqui)

## Treating taxon columns

In [9]:
taxon_columns = ['Kingdom', 'Phylum', 'Class', 'Ordem', 'Familia', 'Genero_ent',
                 'Genero_atual', 'Especie_ent', 'Especie_atual', 'Subespecie_ent',
                 'Subespecie_atual']  # selecting taxonomy columns

treat_taxon_columns(repteis, taxon_columns)

Adjusting column Kingdom
Adjusting column Phylum
Adjusting column Class
Adjusting column Ordem
Adjusting column Familia
Adjusting column Genero_ent
Adjusting column Genero_atual
Adjusting column Especie_ent
Adjusting column Especie_atual
Adjusting column Subespecie_ent
Adjusting column Subespecie_atual


In [10]:
repteis['Kingdom'].value_counts()

Animalia    23119
Name: Kingdom, dtype: int64

## Catching Year

In [11]:
repteis['ano_determinacao'] = repteis['DataDaDeterminacao'].apply(catch_year)
repteis['ano_coleta'] = repteis['DataColetaInicial'].apply(catch_year)
repteis['ano_entrada'] = repteis['DataDeEntrada'].apply(catch_year)

In [12]:
repteis['ano_determinacao'] = repteis['ano_determinacao'].apply(str_with_nan2int) #has NaN
repteis['ano_coleta'] = repteis['ano_coleta'].apply(str_with_nan2int) #has NaN
repteis['ano_entrada'] = repteis['ano_entrada'].apply(str_with_nan2int) #has NaN