# Data Treatment (Poliqueta Database)

In this notebook I'm doing all the data treatments and adjustments necessary to produce high quality visualizations. 

This notebook has an output <font color='blue'>treated_db.csv</font> with a subset of the original database properly treated.

-----

Specifically, for the `poliqueta` database, the files we'll read are <font color='blue'>IBUFRJ27.07.2020 - visualização.xlsx</font> and <font color='blue'>MNRJP27.07.2020 - visualização.xls</font>.


-----

<br>

# IBUFRJ27.07.2020 - visualização.xlsx

## Imports

In [71]:
import datetime
import unidecode
import numpy as np
import pandas as pd

from collections import defaultdict

# quick visualizations for data analytics
import seaborn as sns
import matplotlib.pyplot as plt

# proprietary functions in ./src/MNViz.py
from src.MNViz import *

## Importing data

In [72]:
excel = pd.ExcelFile('./data/IBUFRJ2021.10.11 - visualização.xlsx')
sheet_name = excel.sheet_names

print('The excel file contains the following sheets:', sheet_name)
print('\nDatabase is in sheet:', sheet_name[0])

The excel file contains the following sheets: ['Planilha1']

Database is in sheet: Planilha1


In [73]:
db = excel.parse(sheet_name[0], sep=';', encoding='utf-8-sig')
poliqueta = db.copy()

print(f'The database has {db.shape[0]} rows and {db.shape[1]} columns.')

The database has 4220 rows and 92 columns.


## Adjusting column names

### removing '\n', '\t', and other special characters

In [74]:
poliqueta.columns = [str(col).replace(r'\n','') for col in poliqueta.columns]

## Adjusting Determiners and Collectors Names

<font color='red' size='5'>**p.s.:** Determiner Last Name 1 contains the FULL name of the Researcher </font>

Determiner Last Name 2 is all empty.

In [75]:
names_col = ['Collector Last Name 1', 'Collector First Name 1', 
             'Collector Last Name 2', 'Collector First Name 2', 'Collector Last Name 3', 
             'Collector First Name 3', 'Collector Last Name 4', 'Collector First Name 4']

In [76]:
# for name_col in names_col:
#     if 'last' in name_col.lower():
#         poliqueta[name_col] = poliqueta[name_col].apply(lambda x: treat_names(x, pos='full'))
#     else:
#         poliqueta[name_col] = poliqueta[name_col].apply(treat_names)

for name_col in names_col:
    poliqueta[name_col] = poliqueta[name_col].fillna('')
    poliqueta[name_col] = poliqueta[name_col].apply(lambda x: treat_names(x, pos='full'))

In [77]:
def getFirstName(name, sep=';'):
    name = str(name).split(sep)
    if len(name) > 1:
        return name[0].strip()
    elif name[0] == 'nan':
        return np.NAN
    else:
        return name[0].strip()
    
def getName(name, pos=0, sep=';'):
    list_names = str(name).split(sep)
    size = len(list_names)
    if pos == 0:
        return getFirstName(name, sep=sep)
    elif pos == -1:
        return list_names[pos]
    elif size <= pos:
        return np.NAN
    else:
        return list_names[pos].strip()

In [78]:
# there's more than one determiner in thar column, so I'm getting only the first one
poliqueta['determiner_full_name'] = poliqueta['Determiner Last Name 1'].apply(getFirstName)
poliqueta['determiner_full_name1'] = poliqueta['Determiner Last Name 1'].apply(lambda x: getName(x,pos=1))
poliqueta['determiner_full_name2'] = poliqueta['Determiner Last Name 1'].apply(lambda x: getName(x,pos=2))

### creating column joining First and Last names

I'm doing this only for the first collectors and determiners

In [79]:
# repteis['DeterminatorFirst_and_LastName'] = repteis['DeterminatorFirstName1'] + ' ' + repteis['DeterminatorLastName1']

poliqueta['CollectorFirst_and_LastName'] = poliqueta['Collector First Name 1'] + ' ' + poliqueta['Collector Last Name 1']
poliqueta['CollectorFirst_and_LastName2'] = poliqueta['Collector First Name 2'] + ' ' + poliqueta['Collector Last Name 2']
poliqueta['CollectorFirst_and_LastName3'] = poliqueta['Collector First Name 3'] + ' ' + poliqueta['Collector Last Name 3']
poliqueta['CollectorFirst_and_LastName4'] = poliqueta['Collector First Name 4'] + ' ' + poliqueta['Collector Last Name 4']

In [80]:
for col in ['CollectorFirst_and_LastName', 'CollectorFirst_and_LastName2', 
            'CollectorFirst_and_LastName3', 'CollectorFirst_and_LastName4']:
    poliqueta[col] = poliqueta[col].apply(lambda x: x if str(x) != ' ' else 'Non-identified')

## Treating taxon columns

In [81]:
taxon_columns = ['Kingdom', 'Phylum 1', 'Class 1', 'Order 1', 'Family 1', 'Genus 1',
                 'Species 1']  # selecting taxonomy columns

treat_taxon_columns(poliqueta, taxon_columns)

Adjusting column Kingdom
Adjusting column Phylum 1
Adjusting column Class 1
Adjusting column Order 1
Adjusting column Family 1
Adjusting column Genus 1
Adjusting column Species 1


In [82]:
for col in taxon_columns:
    poliqueta[col] = poliqueta[col].fillna('Non-identified')
    poliqueta[col] = poliqueta[col].astype(str)
    poliqueta[col] = poliqueta[col].apply(lambda x: x if str(x).lower() != 'nan' else 'Non-identified')

## Adjusting Genus and Species

In [83]:
# dica da Manoela: epiteto especifico deve ser todo minusculo (especie e subespecie, nesse caso)
poliqueta['Species 1'] = poliqueta['Species 1'].str.lower()

# genero é escrito com primeira letra maiúscula
poliqueta['Genus 1'] = poliqueta['Genus 1'].str.lower().str.capitalize()

<br>

## adding `Genero` and `Especie`together (they completely identify each animal's species)

In [84]:
poliqueta['genus_and_species'] = poliqueta['Genus 1'] + ' ' + poliqueta['Species 1']

poliqueta['genus_and_species'] = poliqueta['genus_and_species'].str.lower().str.capitalize()

## Catching Month and Year

date_columns= ['Cataloged Date','Return Date','Determined date 1', 'Start Date]

format: YYYY-mm-dd

<font color='red'>Return Date is empty<font>

In [85]:
def fetchDate(string):
    '''
    This function tries to fetch a date (day, month and year) from a string via datetime library. 
    
    If it fails, it'll try to identify these elements from a string in the format YYYY-mm-dd.
    '''
    string = str(string)
    if string.lower() == 'nan':
        return (np.NAN, np.NAN)
    
    try:
        date = datetime.datetime.strptime(string, '%Y-%m-%d')
        return (date.month, date.year)
    
    except:
        str_list = string.split('-')
        year = str_list[0]
        month = str_list[1]
        
        return (month, year)
        

In [86]:
poliqueta['determined_month'] = poliqueta['Determined date 1'].apply(lambda x: fetchDate(x)[0])
poliqueta['start_month'] = poliqueta['Start Date'].apply(lambda x: fetchDate(x)[0])
poliqueta['cataloged_month'] = poliqueta['Cataloged Date'].apply(lambda x: fetchDate(x)[0])

poliqueta['determined_year'] = poliqueta['Determined date 1'].apply(lambda x: fetchDate(x)[1])
poliqueta['start_year'] = poliqueta['Start Date'].apply(lambda x: fetchDate(x)[1])
poliqueta['cataloged_year'] = poliqueta['Cataloged Date'].apply(lambda x: fetchDate(x)[1])

converting to int

In [87]:
def str_with_nan2int(string):
    try:
        if not np.isnan(string):
            return int(string)
        else:
            return np.NAN
    except:
        if str(string).lower() == 'nan':
            return np.NaN
        else:
            return int(str(string))

In [88]:
poliqueta['determined_month'] = poliqueta['determined_month'].apply(str_with_nan2int) #has NaN
poliqueta['start_month'] = poliqueta['start_month'].apply(str_with_nan2int) #has NaN
poliqueta['cataloged_month'] = poliqueta['cataloged_month'].apply(str_with_nan2int) #has NaN

poliqueta['determined_year'] = poliqueta['determined_year'].apply(str_with_nan2int) #has NaN
poliqueta['start_year'] = poliqueta['start_year'].apply(str_with_nan2int) #has NaN
poliqueta['cataloged_year'] = poliqueta['cataloged_year'].apply(str_with_nan2int) #has NaN

<br>

## Adjusting `Depth` columns

<font color='red'>**p.s.:** I'm assuming it's all on the same measure unit (in meters) </font>

In [89]:
poliqueta['min_depth'] = poliqueta['Min Depth'].astype(float)
poliqueta['max_depth'] = poliqueta['Max Depth'].astype(float)

<br>

## Adjusting Latitude and Longitude

In [93]:
poliqueta['Longitude'].unique()

array([-40.3461111 , -40.5833611 , -41.1541667 , -40.595     ,
       -40.15      , -38.7127778 , -37.9125    , -40.1666667 ,
       -37.7727778 , -36.6372222 , -38.4225    , -38.0819444 ,
       -40.5461111 , -41.7225    , -41.0855556 , -40.8483333 ,
       -40.8138889 , -38.6666667 , -41.2541667 , -38.0666667 ,
       -35.8805556 , -38.9666667 , -39.2311111 , -38.7611111 ,
       -45.2333333 , -44.75      , -45.1666667 , -40.5       ,
       -37.485     , -36.8077778 , -36.7502222 , -39.5833333 ,
       -37.5919444 , -40.9819444 , -43.8719444 , -40.6777778 ,
       -43.2777778 , -42.1833333 , -38.4666667 , -40.0744444 ,
       -37.31675   , -37.2069444 , -40.3166667 , -39.0130667 ,
       -39.2726    , -39.04955   , -39.0516667 , -39.0598667 ,
       -38.6794833 , -38.9786667 , -39.2140167 , -39.2186333 ,
       -39.0785833 , -38.6001389 , -38.8738    , -38.63945   ,
       -39.1274667 , -39.0850667 , -33.7727778 , -38.7668333 ,
       -39.5902778 , -38.3744444 , -35.8911111 , -40.02

In [20]:
poliqueta['Latitude'] = poliqueta['Latitude'].apply(convert2float)
poliqueta['Longitude'] = poliqueta['Longitude'].apply(convert2float)

<br>

## Adjusting locality columns

Continent
WaterBody
Country
State
County
Locality Name

In [21]:
# removing unnecessary spaces
poliqueta['Country'] = poliqueta['Country'].apply(lambda x:str(x).strip())

<br>

## Adjusting Types

In [22]:
# inserir depois (Ainda não foi necessário)
# poliqueta['Type Status 1'].str.strip().str.lower().str.capitalize()

<br>

## Adjusting `Order` column

In [23]:
def correct_order(order):
    order = str(order)
    
    if order.lower() == 'nan':
        return np.NAN
    else:
        return order.strip().capitalize()

In [24]:
poliqueta['Order 1'] = poliqueta['Order 1'].apply(correct_order)

<br>

## Selecting Subset of Main DB

In [25]:
poliqueta['Species Author Year 1'] = poliqueta['Species Author 1'].apply(
            lambda x: str(x).split(',')[-1].split(')')[0].split(' ')[-1].strip()
)

In [26]:
# 'genero_e_especie_ent', 'genero_e_especie_atual'
selected_columns = ['Catalog Number','Cataloged Date','Determined date 1','Start Date',
                    'start_year', 'start_month', 'determined_year', 'determined_month',
                    'cataloged_month', 'cataloged_year',
                    'Class 1','Kingdom', 
                    'Genus 1', 'Species 1', 'Family 1', 'Phylum 1','Order 1','Type Status 1',
                    'Species Author 1', 'Species Author Year 1','determiner_full_name',
                    'determiner_full_name1','determiner_full_name2',
                    'CollectorFirst_and_LastName', 'CollectorFirst_and_LastName2',
                    'CollectorFirst_and_LastName3','CollectorFirst_and_LastName4',
                    'Collector First Name 1', 'Collector Last Name 1',
                    'Qualifier 1', 'min_depth', 'max_depth',
                    'Latitude', 'Longitude', 'Continent',
                    'WaterBody', 'Country', 'State', 'County', 'Locality Name']

In [27]:
NewTable = poliqueta[selected_columns].copy()

## Renaming columns

Setting new standardized column names to facilitate future steps.

In [28]:
renames = {
    'Catalog Number':'catalog_number',
    'Cataloged Date':'cataloged_date',
    'Determined Date 1':'determined_date',
    'Class 1':'class',
    'Kingdom':'kingdom',
    'Genus 1':'genus',
    'Species 1':'species',
    'Family 1':'family',
    'Phylum 1':'phylum',
    'Subespecie_atual':'subespecie_atual',
    'Type Status 1':'type',
    'Order 1':'order',
    'Species Author 1':'author',
    'Species Author Year 1':'author_year',
#     'Determiner Last Name 1':'determiner_full_name',
    'Collector First Name 1':'collector_first_name',
    'Collector Last Name 1':'collector_last_name',
    'Qualifier 1':'qualifier',
    'Latitude':'lat',
    'Longitude':'long', 
    'Continent':'continent',
    'WaterBody':'water_body',
    'Country':'country',
    'State':'state',
    'County':'county',
    'Locality Name':'locality',
    'CollectorFirst_and_LastName':'collector_full_name',
    'CollectorFirst_and_LastName2':'collector_full_name2',
    'CollectorFirst_and_LastName3':'collector_full_name3',
    'CollectorFirst_and_LastName4':'collector_full_name4'
}

In [29]:
NewTable = NewTable.rename(columns=renames)

<br>

## Exporting to `CSV`

name: <font color='blue'>./src/treated_db.csv</font>
sep: ';'
encoding: 'utf-8-sig'

In [30]:
NewTable.to_csv('./data/treated_db.csv', sep=';', encoding='utf-8')

-----

<br>

# MNRJP27.07.2020 - visualização.xls

## Importing data

In [31]:
excel = pd.ExcelFile('./data/MNRJP2021.10.11 - visualização.xlsx')
sheet_name = excel.sheet_names

print('The excel file contains the following sheets:', sheet_name)
print('\nDatabase is in sheet:', sheet_name[0])

The excel file contains the following sheets: ['Specify']

Database is in sheet: Specify


In [32]:
db = excel.parse(sheet_name[0], sep=';', encoding='utf-8-sig')
poliqueta2 = db.copy()

print(f'The database has {db.shape[0]} rows and {db.shape[1]} columns.')

The database has 2785 rows and 92 columns.


## Adjusting column names

### removing '\n', '\t', and other special characters

In [33]:
poliqueta2.columns = [str(col).replace(r'\n','') for col in poliqueta2.columns]

## Adjusting Determiners and Collectors Names

<font color='red' size='5'>**p.s.:** Determiner Last Name 1 contains the FULL name of the Researcher </font>

Determiner Last Name 2 is all empty.

In [34]:
names_col = ['Collector Last Name 1', 'Collector First Name 1', 
             'Collector Last Name 2', 'Collector First Name 2', 'Collector Last Name 3', 
             'Collector First Name 3', 'Collector Last Name 4', 'Collector First Name 4']

In [35]:
# for name_col in names_col:
#     if 'last' in name_col.lower():
#         poliqueta2[name_col] = poliqueta2[name_col].apply(lambda x: treat_names(x, pos='full'))
#     else:
#         poliqueta2[name_col] = poliqueta2[name_col].apply(treat_names)

for name_col in names_col:
    poliqueta2[name_col] = poliqueta2[name_col].fillna('')
    poliqueta2[name_col] = poliqueta2[name_col].apply(lambda x: treat_names(x, pos='full'))

In [36]:
poliqueta2['Determiner First Name1'] = poliqueta2['Determiner First Name1'].str.strip().str.lower().str.capitalize()
poliqueta2['Determiner Last Name1'] = poliqueta2['Determiner Last Name1'].str.strip().str.lower().str.capitalize()

#poliqueta2['Determiner First Name 2'] = poliqueta2['Determiner First Name 2'].str.strip().str.lower().str.capitalize()
#poliqueta2['Determiner Last Name 2'] = poliqueta2['Determiner Last Name 2'].str.strip().str.lower().str.capitalize()

### creating column joining First and Last names

I'm doing this only for the first collectors and determiners

In [37]:
poliqueta2['DeterminerFirst_and_LastName'] = poliqueta2['Determiner First Name1'] + ' ' + poliqueta2['Determiner Last Name1']
#poliqueta2['DeterminerFirst_and_LastName2'] = poliqueta2['Determiner First Name 2'] + ' ' + poliqueta2['Determiner Last Name 2']

# collector columns
poliqueta2['CollectorFirst_and_LastName'] = poliqueta2['Collector First Name 1'] + ' ' + poliqueta2['Collector Last Name 1']
poliqueta2['CollectorFirst_and_LastName2'] = poliqueta2['Collector First Name 2'] + ' ' + poliqueta2['Collector Last Name 2']
poliqueta2['CollectorFirst_and_LastName3'] = poliqueta2['Collector First Name 3'] + ' ' + poliqueta2['Collector Last Name 3']
poliqueta2['CollectorFirst_and_LastName4'] = poliqueta2['Collector First Name 4'] + ' ' + poliqueta2['Collector Last Name 4']

In [38]:
for col in ['CollectorFirst_and_LastName', 'CollectorFirst_and_LastName2', 'CollectorFirst_and_LastName3', 
           'CollectorFirst_and_LastName4'] :
    poliqueta2[col] = poliqueta2[col].apply(lambda x: x if str(x) != ' ' else 'Non-identified')

### joining first, middle and last names

In [39]:
poliqueta2['DeterminerFullName'] = poliqueta2['Determiner First Name1'] + ' ' + poliqueta2['Determiner Middle Name1'] + ' ' + poliqueta2['Determiner Last Name1']
#poliqueta2['DeterminerFullName2'] = poliqueta2['Determiner First Name 2'] + ' ' + poliqueta2['DeterminerMiddle Name 2'] + ' ' + poliqueta2['Determiner Last Name 2']

# collector columns
poliqueta2['CollectorFullName'] = poliqueta2['Collector First Name 1'] + ' ' + poliqueta2['CollectorMiddle Name 1'] + ' ' + poliqueta2['Collector Last Name 1']
poliqueta2['CollectorFullName2'] = poliqueta2['Collector First Name 2'] + ' ' + poliqueta2['CollectorMiddle Name 2'] + ' ' + poliqueta2['Collector Last Name 2']
poliqueta2['CollectorFullName3'] = poliqueta2['Collector First Name 3'] + ' ' + poliqueta2['CollectorMiddle Name 3'] + ' ' + poliqueta2['Collector Last Name 3']
poliqueta2['CollectorFullName4'] = poliqueta2['Collector First Name 4'] + ' ' + poliqueta2['CollectorMiddle Name 4'] + ' ' + poliqueta2['Collector Last Name 4']

In [40]:
for col in ['CollectorFullName', 'CollectorFullName2', 'CollectorFullName3', 'CollectorFullName4']:
    poliqueta2[col] = poliqueta2[col].apply(lambda x: x if str(x) != ' ' else 'Non-identified')

## Treating taxon columns

In [41]:
taxon_columns = ['Kingdom', 'Phylum 1', 'Class 1', 'Order 1', 'Family 1', 'Genus 1',
                 'Species 1']  # selecting taxonomy columns

treat_taxon_columns(poliqueta2, taxon_columns)

Adjusting column Kingdom
Adjusting column Phylum 1
Adjusting column Class 1
Adjusting column Order 1
Adjusting column Family 1
Adjusting column Genus 1
Adjusting column Species 1


In [42]:
for col in taxon_columns:
    poliqueta2[col] = poliqueta2[col].fillna('Non-identified')
    poliqueta2[col] = poliqueta2[col].astype(str)
    poliqueta2[col] = poliqueta2[col].apply(lambda x: x if str(x).lower() != 'nan' else 'Non-identified')

## Adjusting Genus and Species

In [43]:
# dica da Manoela: epiteto especifico deve ser todo minusculo (especie e subespecie, nesse caso)
poliqueta2['Species 1'] = poliqueta2['Species 1'].str.lower()

# gênero é escrito com primeira letra maiúscula
poliqueta2['Genus 1'] = poliqueta2['Genus 1'].str.lower().str.capitalize()

<br>

## adding `Genero` and `Especie`together (they completely identify each animal's species)

In [44]:
poliqueta2['genus_and_species'] = poliqueta2['Genus 1'] + ' ' + poliqueta2['Species 1']

poliqueta2['genus_and_species'] = poliqueta2['genus_and_species'].str.lower().str.capitalize()

## Catching Month and Year

date_columns= ['Cataloged Date','Return Date','Determined date 1', 'Start Date]

format: YYYY-mm-dd

<font color='red'>Return Date is empty<font>

In [45]:
def fetchDate(string):
    '''
    This function tries to fetch a date (day, month and year) from a string via datetime library. 
    
    If it fails, it'll try to identify these elements from a string in the format YYYY-mm-dd.
    '''
    string = str(string)
    if string.lower() == 'nan':
        return (np.NAN, np.NAN)
    
    try:
        date = datetime.datetime.strptime(string, '%Y-%m-%d')
        return (date.month, date.year)
    
    except:
        str_list = string.split('-')
        year = str_list[0]
        month = str_list[1]
        
        return (month, year)
        

In [46]:
# poliqueta2['determined_month'] = poliqueta2['Determined date 1'].apply(lambda x: fetchDate(x)[0])
poliqueta2['start_month'] = poliqueta2['Start Date'].apply(lambda x: fetchDate(x)[0])
poliqueta2['cataloged_month'] = poliqueta2['Cataloged Date'].apply(lambda x: fetchDate(x)[0])

# poliqueta['determined_year'] = poliqueta['Determined date 1'].apply(lambda x: fetchDate(x)[1])
poliqueta2['start_year'] = poliqueta2['Start Date'].apply(lambda x: fetchDate(x)[1])
poliqueta2['cataloged_year'] = poliqueta2['Cataloged Date'].apply(lambda x: fetchDate(x)[1])

converting to int

In [47]:
def str_with_nan2int(string):
    try:
        if not np.isnan(string):
            return int(string)
        else:
            return np.NAN
    except:
        if str(string).lower() == 'nan':
            return np.NaN
        else:
            return int(str(string))

In [48]:
# poliqueta2['determined_month'] = poliqueta2['determined_month'].apply(str_with_nan2int) #has NaN
poliqueta2['start_month'] = poliqueta2['start_month'].apply(str_with_nan2int) #has NaN
poliqueta2['cataloged_month'] = poliqueta2['cataloged_month'].apply(str_with_nan2int) #has NaN

# poliqueta2['determined_year'] = poliqueta2['determined_year'].apply(str_with_nan2int) #has NaN
poliqueta2['start_year'] = poliqueta2['start_year'].apply(str_with_nan2int) #has NaN
poliqueta2['cataloged_year'] = poliqueta2['cataloged_year'].apply(str_with_nan2int) #has NaN

<br>

## Adjusting `Depth` columns

<font color='red'>**p.s.:** I'm assuming it's all on the same measure unit (in meters) </font>

In [49]:
poliqueta2['min_depth'] = poliqueta2['Min Depth'].astype(float)
poliqueta2['max_depth'] = poliqueta2['Max Depth'].astype(float)

<br>

## Adjusting Latitude and Longitude

In [50]:
poliqueta2['Latitude'] = poliqueta2['Latitude1'].apply(convert2float)
poliqueta2['Longitude'] = poliqueta2['Longitude1'].apply(convert2float)

<br>

## Adjusting locality columns

Continent
WaterBody
Country
State
County
Locality Name

In [51]:
# removing unnecessary spaces from Country column
poliqueta2['Country'] = poliqueta2['Country'].apply(lambda x:str(x).strip())

### treating continent

In [52]:
def correct_continent(string):
    string = str(string)

    continent_correction = {
        'America Central':'Central America',
        'Asia Oriental':'Asia',  # there's just one register of Asia Oriental
        'European':'Europe',
    }
    
    if string in continent_correction.keys():
        return continent_correction[string]  # returns corrected continent
    else:
        return string

In [53]:
poliqueta2['Continent'] = poliqueta2['Continent'].str.strip().apply(correct_continent).value_counts()

### treating State column

In [54]:
def remove_accents(string):
    string = str(string).strip()
    
    return unidecode.unidecode(string)

In [55]:
poliqueta2['State'] = poliqueta2['State'].str.strip().apply(remove_accents).value_counts()

In [56]:
# other columns will remain the same, while further treatment doesn't seem necessary

<br>

## Adjusting Types

In [57]:
# inserir depois (Ainda não foi necessário)
# poliqueta['Type Status 1'].str.strip().str.lower().str.capitalize()

<br>

## Adjusting `Order` column

In [58]:
def correct_order(order):
    order = str(order)
    
    if order.lower() == 'nan':
        return np.NAN
    else:
        return order.strip().capitalize()

In [59]:
poliqueta2['Order 1'] = poliqueta2['Order 1'].apply(correct_order)

<br>

## Selecting Subset of Main DB

In [60]:
# Não tem Determined Date
selected_columns = ['Catalog Number','Cataloged Date','Start Date',
                    'start_year', 'start_month',
                    'cataloged_month', 'cataloged_year',
                    'Class 1','Kingdom', 
                    'Genus 1', 'Species 1', 'Family 1', 'Phylum 1','Order 1','Type Status 1',
                    'Species Author 1','Determiner Last Name1','Determiner First Name1',
                    'DeterminerFirst_and_LastName', #'DeterminerFirst_and_LastName2', 'Species Author Year 1',
                    'CollectorFirst_and_LastName','CollectorFirst_and_LastName2',
                    'CollectorFirst_and_LastName3', 'CollectorFirst_and_LastName4',
                    'Collector First Name 1', 'Collector Last Name 1',
                    'Qualifier 1', 'min_depth', 'max_depth',
                    'Latitude', 'Longitude', 'Continent',
                    'WaterBody', 'Country', 'State', 'County', 'Locality Name']

In [61]:
NewTable2 = poliqueta2[selected_columns].copy()

## Renaming columns

Setting new standardized column names to facilitate future steps.

In [62]:
renames = {
    'Catalog Number':'catalog_number',
    'Cataloged Date':'cataloged_date',
    'Determined Date 1':'determined_date',
    'Class 1':'class',
    'Kingdom':'kingdom',
    'Genus 1':'genus',
    'Species 1':'species',
    'Family 1':'family',
    'Phylum 1':'phylum',
    'Subespecie_atual':'subespecie_atual',
    'Type Status 1':'type',
    'Order 1':'order',
    'Species Author 1':'author',
    'Species Author Year 1':'author_year',
    'Determiner Last Name1':'determiner_last_name',
    'Determiner First Name1':'determiner_first_name',
    'Collector First Name 1':'collector_first_name',
    'Collector Last Name 1':'collector_last_name',
    'Qualifier 1':'qualifier',
    'Latitude':'lat',
    'Longitude':'long', 
    'Continent':'continent',
    'WaterBody':'water_body',
    'Country':'country',
    'State':'state',
    'County':'county',
    'Locality Name':'locality',
    'CollectorFirst_and_LastName':'collector_full_name',  # first database only has first and last names...
    'CollectorFirst_and_LastName2':'collector_full_name2',  
    'CollectorFirst_and_LastName3':'collector_full_name3',
    'CollectorFirst_and_LastName4':'collector_full_name4',
    'DeterminerFirst_and_LastName':'determiner_full_name',  # first database only has first and last names...
    'DeterminerFirst_and_LastName2':'determiner_full_name2'  # first database only has first and last names...
}

In [63]:
NewTable2 = NewTable2.rename(columns=renames)

<br>

## Exporting to `CSV`

name: <font color='blue'>./src/treated_db.csv</font>
sep: ';'
encoding: 'utf-8-sig'

In [64]:
NewTable2.to_csv('./data/treated_db2.csv', sep=';', encoding='utf-8')

-----

<br>

# Merging both databases

p.s.: this will represent the whole 'poliqueta' collection 

In [65]:
# merge NewTable and NewTable2
set(NewTable.columns).difference(set(NewTable2.columns))

{'Determined date 1',
 'author_year',
 'determined_month',
 'determined_year',
 'determiner_full_name1',
 'determiner_full_name2'}

In [66]:
merged_table = pd.concat([NewTable, NewTable2], join='outer')

In [67]:
def create_prefix(catalog):
    catalog = str(catalog)
    
    if 'IBUFRJ' in catalog.upper():
        return 'IBUFRJ'
    elif 'MNRJP' in catalog.upper():
        return 'MNRJP'
    else:
        return None

In [68]:
# creating column collection_prefix IBUFRJ and MNRJP (to identify origin of that info.)
merged_table['collection_prefix'] = merged_table['catalog_number'].apply(create_prefix)

In [69]:
# ajustando coluna de famílias
#merged_table['family'] = merged_table['family'].apply(lambda x: x if str(x).lower() != 'nan' else 'Non-identified')

In [70]:
merged_table.to_csv('./data/merged_db.csv', sep=';', encoding='utf-8')