# Data Treatment (Repteis Database)

In this notebook I'm doing all the data treatments and adjustments necessary to produce high quality visualizations. 

This notebook has an output <font color='blue'>treated_db.csv</font> with a subset of the original database properly treated.

-----

Specifically, for the `repteis` database, the file we'll read is <font color='blue'>Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls</font>

<font color='red'>**p.s.:** the idea is to transform all the code in this notebook into a `.py` file with a CLI interface to parse a file and produce a treated csv file right away. </font>

is it feasible? as we're selecting plenty different columns... maybe selecting them from a .txt file might be a good solution

<br>

<font size='5'>**Equivalência de colunas:** Repteis e Crustaceas</font>

**Nome diferente:** <br>
- Species1: Especie_ent ou Especie_atual
- Species Author1: ?
- Type Status1: Type Status 1
- Qualifier1: Qualificador_ent Qualificador_atual
- Determiner First Name1: DeterminatorFirstName1
- Determiner Middle1: DeterminatorMiddleInitial1
- Determiner Last Name1: DeterminatorLastName1
- Determined Date1: DataDaDeterminacao

## Imports

In [1]:
import datetime
import numpy as np
import pandas as pd

from collections import defaultdict

# quick visualizations for data analytics
import seaborn as sns
import matplotlib.pyplot as plt

# proprietary functions in ./src/MNViz.py
from src.MNViz import *

## Importing data

In [2]:
excel = pd.ExcelFile('./data/Compilacao Livros Repteis - 2 a 10 - 2020_04_28.xls')
sheet_name = excel.sheet_names

print('The excel file contains the following sheets:', sheet_name)
print('\nDatabase is in sheet:', sheet_name[0])

The excel file contains the following sheets: ['Repteis-2020-02-11-csv']

Database is in sheet: Repteis-2020-02-11-csv


In [3]:
db = excel.parse(sheet_name[0], sep=';', encoding='utf-8-sig')
repteis = db.copy()

print(f'The database has {db.shape[0]} rows and {db.shape[1]} columns.')

The database has 23119 rows and 114 columns.


## Adjusting column names

### removing '\n', '\t', and other special characters

In [4]:
repteis.columns = [str(col).replace(r'\n','') for col in repteis.columns]

In [5]:
# for col in repteis.columns:
#     if 'collector' in col.lower():
#         print(col)

## Adjusting Determiners and Collectors Names

In [6]:
names_col = ['DeterminatorLastName1', 'DeterminatorFirstName1', 'DeterminatorLastName2',
             'DeterminatorFirstName2', 'CollectorLastName1', 'CollectorFirstName1', 
             'CollectorLastName2', 'CollectorFirstName2', 'CollectorLastName3', 'CollectorFirstName3',
             'CollectorLastName4', 'CollectorFirstName4', 'CollectorLastName5', 'CollectorFirstName5',
             'CollectorLastName6', 'CollectorFirstName6']

In [7]:
for name_col in names_col:
    if 'last' in name_col.lower():
        repteis[name_col] = repteis[name_col].apply(lambda x: treat_names(x, pos='last'))
    else:
        repteis[name_col] = repteis[name_col].apply(treat_names)

### creating column joining First and Last names

In [8]:
# fazer depois (olhar notebook que usa o nome dos pesquisadores e inserir tratamento aqui)

## Treating taxon columns

In [9]:
taxon_columns = ['Kingdom', 'Phylum', 'Class', 'Ordem', 'Familia', 'Genero_ent',
                 'Genero_atual', 'Especie_ent', 'Especie_atual', 'Subespecie_ent',
                 'Subespecie_atual']  # selecting taxonomy columns

treat_taxon_columns(repteis, taxon_columns)

Adjusting column Kingdom
Adjusting column Phylum
Adjusting column Class
Adjusting column Ordem
Adjusting column Familia
Adjusting column Genero_ent
Adjusting column Genero_atual
Adjusting column Especie_ent
Adjusting column Especie_atual
Adjusting column Subespecie_ent
Adjusting column Subespecie_atual


## Adjusting Gender and Species

In [None]:
# dica da Manoela: epiteto especifico deve ser todo minusculo (especie e subespecie, nesse caso)
repteis['especie_atual'] = repteis['especie_atual'].str.lower()
repteis['subespecie_atual'] = repteis['subespecie_atual'].str.lower()

<br>

## adding `Genero` and `Especie`together (they completely identify each animal's species)

In [None]:
repteis['genero_e_especie_ent'] = repteis['Genero_ent'] + ' ' + repteis['Especie_ent']
repteis['genero_e_especie_atual'] = repteis['Genero_atual'] + ' ' + repteis['Especie_atual']

repteis['genero_e_especie_ent'] = repteis['genero_e_especie_ent'].str.lower().str.capitalize()
repteis['genero_e_especie_atual'] = repteis['genero_e_especie_atual'].str.lower().str.capitalize()

## Catching Month and Year

date_columns= ['DataDeEntrada','DataDaDeterminacao','DataColetaInicial'] 

In [11]:
repteis['ano_determinacao'] = repteis['DataDaDeterminacao'].apply(lambda x: getMonthAndYear(x)[-1])
repteis['ano_coleta'] = repteis['DataColetaInicial'].apply(lambda x: getMonthAndYear(x)[-1])
repteis['ano_entrada'] = repteis['DataDeEntrada'].apply(lambda x: getMonthAndYear(x)[-1])

repteis['mes_determinacao'] = repteis['DataDaDeterminacao'].apply(lambda x: getMonthAndYear(x)[0])
repteis['mes_coleta'] = repteis['DataColetaInicial'].apply(lambda x: getMonthAndYear(x)[0])
repteis['mes_entrada'] = repteis['DataDeEntrada'].apply(lambda x: getMonthAndYear(x)[0])

converting to int

In [12]:
repteis['ano_determinacao'] = repteis['ano_determinacao'].apply(str_with_nan2int) #has NaN
repteis['ano_coleta'] = repteis['ano_coleta'].apply(str_with_nan2int) #has NaN
repteis['ano_entrada'] = repteis['ano_entrada'].apply(str_with_nan2int) #has NaN

repteis['mes_determinacao'] = repteis['ano_determinacao'].apply(str_with_nan2int) #has NaN
repteis['mes_coleta'] = repteis['ano_coleta'].apply(str_with_nan2int) #has NaN
repteis['mes_entrada'] = repteis['ano_entrada'].apply(str_with_nan2int) #has NaN

<br>

## Adjusting `Altitude` column

<font color='red'>**p.s.:** I'm assuming it's all on the same measure unit (in meters) </font>

In [13]:
repteis['altitude'] = repteis['MinAltitude'].str.extract('(\d+)')
repteis['max_altitude'] = repteis['MaxAltitude'].str.extract('(\d+)')

<br>

## Adjusting Types (`NotasTaxonomicas`)

In [14]:
# inserir depois
# repteis['Type Status 1'].str.strip().str.lower().str.capitalize()

<br>

## Adjusting `Ordem` column

In [15]:
repteis['Ordem'] = repteis['Ordem'].apply(correct_squamata)
repteis['Ordem'] = repteis['Ordem'].apply(correct_nd)

<br>

## Selecting Subset of Main DB

In [20]:
# 'genero_e_especie_ent', 'genero_e_especie_atual'
selected_columns = ['NumeroDeCatalogo','DataDeEntrada','DataDaDeterminacao','DataColetaInicial',
                    'ano_entrada', 'mes_entrada', 'ano_determinacao', 'mes_determinacao',
                    'ano_coleta', 'mes_coleta',
                    'Class','Kingdom', 
                    'Genero_ent', 'Genero_atual', 'Especie_ent', 'Especie_atual','Subespecie_atual','Type Status 1',
                    'DeterminatorFirstName1', 'DeterminatorLastName1','altitude','max_altitude',
                    'Ordem', 'Familia', 'Phylum', 'Qualificador_atual', 'NotasTaxonomicas']

In [21]:
NewTable = repteis[selected_columns].copy()

## Renaming columns

Setting new standardized column names to facilitate future steps.

In [22]:
renames = {
    'NumeroDeCatalogo':'numero_catalogo',
    'DataDeEntrada':'data_entrada',
    'DataDaDeterminacao':'data_determinacao',
    'DataColetainicial':'data_coleta',
    'Class':'class',
    'Kingdom':'kingdom',
    'Genero_ent':'genero_ent',
    'Genero_atual':'genero_atual',
    'Especie_ent':'especie_ent',
    'Especie_atual':'especie_atual',
    'Subespecie_atual':'subespecie_atual',
    'Type Status 1':'type',
    'Ordem':'ordem',
    'Familia':'familia',
    'Phylum':'phylum',
    'Qualificador_atual':'qualificador_atual',
    'NotasTaxonomicas':'notas_taxonomicas'
}

In [23]:
NewTable = NewTable.rename(columns=renames)

<br>

## Exporting to `CSV`

name: <font color='blue'>./src/treated_db.csv</font>
sep: ';'
encoding: 'utf-8-sig'

In [25]:
NewTable.to_csv('./data/treated_db.csv', sep=';', encoding='utf-8-sig')