# Data Treatment (Poliqueta Database)

In this notebook I'm doing all the data treatments and adjustments necessary to produce high quality visualizations. 

This notebook has an output <font color='blue'>treated_db.csv</font> with a subset of the original database properly treated.

-----

Specifically, for the `poliqueta` database, the files we'll read are <font color='blue'>IBUFRJ27.07.2020 - visualização.xlsx</font> and <font color='blue'>MNRJP27.07.2020 - visualização.xls</font>.

<font color='red'>**p.s.:** the idea is to transform all the code in this notebook into a `.py` file with a CLI interface to parse a file and produce a treated csv file right away. </font>

is it feasible? as we're selecting plenty different columns... maybe selecting them from a .txt file might be a good solution

## Imports

In [1]:
import datetime
import numpy as np
import pandas as pd

from collections import defaultdict

# quick visualizations for data analytics
import seaborn as sns
import matplotlib.pyplot as plt

# proprietary functions in ./src/MNViz.py
from src.MNViz import *

## Importing data

In [2]:
excel = pd.ExcelFile('./data/IBUFRJ27.07.2020 - visualização.xlsx')
sheet_name = excel.sheet_names

print('The excel file contains the following sheets:', sheet_name)
print('\nDatabase is in sheet:', sheet_name[0])

The excel file contains the following sheets: ['Planilha1', 'Planilha2']

Database is in sheet: Planilha1


In [3]:
db = excel.parse(sheet_name[0], sep=';', encoding='utf-8-sig')
poliqueta = db.copy()

print(f'The database has {db.shape[0]} rows and {db.shape[1]} columns.')

The database has 4231 rows and 91 columns.


## Adjusting column names

### removing '\n', '\t', and other special characters

In [4]:
poliqueta.columns = [str(col).replace(r'\n','') for col in poliqueta.columns]

## Adjusting Determiners and Collectors Names

<font color='red' size='5'>**p.s.:** Determiner Last Name 1 contains the FULL name of the Researcher </font>

Determiner Last Name 2 is all empty.

In [5]:
names_col = ['Collector Last Name 1', 'Collector First Name 1', 
             'Collector Last Name 2', 'Collector First Name 2', 'Collector Last Name 3', 
             'Collector First Name 3', 'Collector Last Name 4', 'Collector First Name 4']

In [6]:
for name_col in names_col:
    if 'last' in name_col.lower():
        poliqueta[name_col] = poliqueta[name_col].apply(lambda x: treat_names(x, pos='last'))
    else:
        poliqueta[name_col] = poliqueta[name_col].apply(treat_names)

In [7]:
def getFirstName(name):
    name = str(name).split(';')
    if len(name) > 1:
        return name[0].strip()
    elif name[0] == 'nan':
        return np.NAN
    else:
        return name[0].strip()

In [8]:
poliqueta['Determiner Last Name 1'] = poliqueta['Determiner Last Name 1'].apply(getFirstName)

### creating column joining First and Last names

I'm doing this only for the first collectors and determiners

In [9]:
# repteis['DeterminatorFirst_and_LastName'] = repteis['DeterminatorFirstName1'] + ' ' + repteis['DeterminatorLastName1']

poliqueta['CollectorFirst_and_LastName'] = poliqueta['Collector First Name 1'] + ' ' + poliqueta['Collector Last Name 1']

## Treating taxon columns

In [10]:
taxon_columns = ['Kingdom', 'Phylum 1', 'Class 1', 'Order 1', 'Family 1', 'Genus 1',
                 'Species 1']  # selecting taxonomy columns

treat_taxon_columns(poliqueta, taxon_columns)

Adjusting column Kingdom
Adjusting column Phylum 1
Adjusting column Class 1
Adjusting column Order 1
Adjusting column Family 1
Adjusting column Genus 1
Adjusting column Species 1


## Adjusting Genus and Species

In [11]:
# dica da Manoela: epiteto especifico deve ser todo minusculo (especie e subespecie, nesse caso)
poliqueta['Species 1'] = poliqueta['Species 1'].str.lower()
poliqueta['Genus 1'] = poliqueta['Genus 1'].str.lower()

<br>

## adding `Genero` and `Especie`together (they completely identify each animal's species)

In [12]:
poliqueta['genus_and_species'] = poliqueta['Genus 1'] + ' ' + poliqueta['Species 1']

poliqueta['genus_and_species'] = poliqueta['genus_and_species'].str.lower().str.capitalize()

## Catching Month and Year

date_columns= ['Cataloged Date','Return Date','Determined date 1', 'Start Date]

format: YYYY-mm-dd

<font color='red'>Return Date is empty<font>

In [13]:
def fetchDate(string):
    '''
    This function tries to fetch a date (day, month and year) from a string via datetime library. 
    
    If it fails, it'll try to identify these elements from a string in the format YYYY-mm-dd.
    '''
    string = str(string)
    if string.lower() == 'nan':
        return (np.NAN, np.NAN)
    
    try:
        date = datetime.datetime.strptime(string, '%Y-%m-%d')
        return (date.month, date.year)
    
    except:
        str_list = string.split('-')
        year = str_list[0]
        month = str_list[1]
        
        return (month, year)
        

In [14]:
poliqueta['determined_month'] = poliqueta['Determined date 1'].apply(lambda x: fetchDate(x)[0])
poliqueta['start_month'] = poliqueta['Start Date'].apply(lambda x: fetchDate(x)[0])
poliqueta['cataloged_month'] = poliqueta['Cataloged Date'].apply(lambda x: fetchDate(x)[0])

poliqueta['determined_year'] = poliqueta['Determined date 1'].apply(lambda x: fetchDate(x)[1])
poliqueta['start_year'] = poliqueta['Start Date'].apply(lambda x: fetchDate(x)[1])
poliqueta['cataloged_year'] = poliqueta['Cataloged Date'].apply(lambda x: fetchDate(x)[1])

converting to int

In [16]:
def str_with_nan2int(string):
    try:
        if not np.isnan(string):
            return int(string)
        else:
            return np.NAN
    except:
        if str(string).lower() == 'nan':
            return np.NaN
        else:
            return int

In [17]:
poliqueta['determined_month'] = poliqueta['determined_month'].apply(str_with_nan2int) #has NaN
poliqueta['start_month'] = poliqueta['start_month'].apply(str_with_nan2int) #has NaN
poliqueta['cataloged_month'] = poliqueta['cataloged_month'].apply(str_with_nan2int) #has NaN

poliqueta['determined_year'] = poliqueta['determined_year'].apply(str_with_nan2int) #has NaN
poliqueta['start_year'] = poliqueta['start_year'].apply(str_with_nan2int) #has NaN
poliqueta['cataloged_year'] = poliqueta['cataloged_year'].apply(str_with_nan2int) #has NaN

<br>

## Adjusting `Depth` columns

<font color='red'>**p.s.:** I'm assuming it's all on the same measure unit (in meters) </font>

In [23]:
poliqueta['min_depth'] = poliqueta['Min Depth'].astype(float)
poliqueta['max_depth'] = poliqueta['Max Depth'].astype(float)

<br>

## Adjusting Latitude and Longitude

In [26]:
poliqueta['Latitude'] = poliqueta['Latitude'].apply(convert2float)
poliqueta['Longitude'] = poliqueta['Longitude'].apply(convert2float)

<br>

## Adjusting locality columns

Continent
WaterBody
Country
State
County
Locality Name

In [37]:
# colunas estão boas. Inserir tratamento depois, se necessário

<br>

## Adjusting Types

In [16]:
# inserir depois (Ainda não foi necessário)
# poliqueta['Type Status 1'].str.strip().str.lower().str.capitalize()

<br>

## Adjusting `Ordem` column

In [57]:
def correct_order(order):
    order = str(order)
    
    if order.lower() == 'nan':
        return np.NAN
    else:
        return order.strip().capitalize()

In [60]:
poliqueta['Order 1'] = poliqueta['Order 1'].apply(correct_order)

<br>

## Selecting Subset of Main DB

In [70]:
# 'genero_e_especie_ent', 'genero_e_especie_atual'
selected_columns = ['Catalog Number','Cataloged Date','Determined date 1','Start Date',
                    'start_year', 'start_month', 'determined_year', 'determined_month',
                    'cataloged_month', 'cataloged_year',
                    'Class 1','Kingdom', 
                    'Genus 1', 'Species 1', 'Family 1', 'Phylum 1','Order 1','Type Status 1',
                    'Species Author 1', 'Species Author Year 1','Determiner Last Name 1',
                    'CollectorFirst_and_LastName', 'Collector First Name 1', 'Collector Last Name 1',
                    'Qualifier 1', 'min_depth', 'max_depth',
                    'Latitude', 'Longitude', 'Continent',
                    'WaterBody', 'Country', 'State', 'County', 'Locality Name']

In [73]:
NewTable = poliqueta[selected_columns].copy()

## Renaming columns

Setting new standardized column names to facilitate future steps.

In [74]:
renames = {
    'Catalog Number':'catalog_number',
    'Cataloged Date':'cataloged_date',
    'Determined Date 1':'determined_date',
    'Class 1':'class',
    'Kingdom':'kingdom',
    'Genus 1':'genus',
    'Species 1':'species',
    'Family 1':'family',
    'Phylum 1':'phylum',
    'Subespecie_atual':'subespecie_atual',
    'Type Status 1':'type',
    'Order 1':'order',
    'Species Author 1':'author',
    'Species Author Year 1':'author_year',
    'Determiner Last Name 1':'determiner_full_name',
    'Collector First Name 1':'collector_first_name',
    'Collector Last Name 1':'collector_last_name',
    'Qualifier 1':'qualifier',
    'Latitude':'lat',
    'Longitude':'long', 
    'Continent':'continent',
    'WaterBody':'water_body',
    'Country':'country',
    'State':'state',
    'County':'county',
    'Locality Name':'locality',
    'CollectorFirst_and_LastName':'collector_full_name'
}

In [75]:
NewTable = NewTable.rename(columns=renames)

<br>

## Exporting to `CSV`

name: <font color='blue'>./src/treated_db.csv</font>
sep: ';'
encoding: 'utf-8-sig'

In [76]:
NewTable.to_csv('./data/treated_db.csv', sep=';', encoding='utf-8-sig')