# Preprocesamiento
---


In [2]:
import numpy as np
import pandas as pd
import xarray as xr

## Preprocesamiento db1
---
Se realiza el preprocesamiento de la base de datos [Total Sediment Thickness of the World's Oceans and Marginal Seas Version 3 (GlobSed)](https://www.ngdc.noaa.gov/mgg/sedthick/data/version3/GlobSed.zip)

In [113]:
db1 = pd.read_csv('/home/mofoko/Documents/Metodologias/tdsp_template/scripts/data_acquisition/IHFC_2023_GHFDB.csv',
                  sep=';', encoding="ISO-8859-1")
db1.head()

  db1 = pd.read_csv('/home/mofoko/Documents/Metodologias/tdsp_template/scripts/data_acquisition/IHFC_2023_GHFDB.csv',


Unnamed: 0,ID,q,q_unc,name,lat,lng,elevation,Ref_1,q_acq,env,...,tc_pTfunc,tc_numb,tc_strategy,Ref_ISGN,A_dom,A_reg,A_cont,A_year,Unnamed: 63,Unnamed: 64
0,1.0,201.0,,C21-3F,-16.0733,-17.2717,-3855.9,Abbott_2008,,,...,,,,,marine,South Atlantic Ocean,South American Plate,2008.0,,
1,2.0,214.0,,C21-3G,-16.0733,-17.2717,-3855.9,Abbott_2008,,,...,,,,,marine,South Atlantic Ocean,South American Plate,2008.0,,
2,3.0,46.0,,C27-3C,-0.9917,81.3983,-4720.3,Abbott_2008,,,...,,,,,marine,Indian Ocean,Indo-Australian Plate,2008.0,,
3,4.0,57.0,,C27-3D,-0.9917,81.3983,-4720.3,Abbott_2008,,,...,,,,,marine,Indian Ocean,Indo-Australian Plate,2008.0,,
4,5.0,176.0,,EL21-10A,-39.99,-109.855,-3194.4,Abbott_2008,,,...,,,,,marine,South Pacific Ocean,Antarctic Plate,2008.0,,


In [114]:
print(f'n de filas:{db1.shape[0]} y n de columnas: {db1.shape[1]}')

n de filas:74547 y n de columnas: 65


### Selección de variables
---
Seleccionamos las variables de interés

In [115]:
vars_selection = ['q','lat','lng','elevation','A_dom','A_reg','A_cont','A_year']
db1_select = db1[vars_selection]
db1_select.shape

(74547, 8)

### Búsqueda de valores nulos
---
Eliminamos las filas con valores nulos en nuestra variable de interés *Flujo de calor (q)*. 

In [116]:
# Contamos la cantidad de valores nulos
db1_select.isnull().sum()

q            1514
lat          1514
lng          1514
elevation    3775
A_dom        1514
A_reg        1514
A_cont       1514
A_year       1514
dtype: int64

In [117]:
# Eliminamos los valores nulos respecto a q
db1_nonull = db1_select.dropna()
# Shape despues de eliminar valores nulos
print(f'n de filas:{db1_nonull.shape[0]} y n de columnas: {db1_nonull.shape[1]}')

n de filas:70772 y n de columnas: 8


In [118]:
# Contamos la cantidad de valores nulos despues de elimninar los de q
db1_nonull.iloc[:,0:8].isnull().sum()

q            0
lat          0
lng          0
elevation    0
A_dom        0
A_reg        0
A_cont       0
A_year       0
dtype: int64

### Tipo da datos
---
Se inspecciona los tipos de datos que se tienen para cada variable. 

In [119]:
# Inspeccionamos inconsistencia en la tipologia de los datos 
db1_nonull.dtypes

q            float64
lat          float64
lng          float64
elevation     object
A_dom         object
A_reg         object
A_cont        object
A_year       float64
dtype: object

La variable `elavation` presenta inconsistencias dado que esta debe ser un valor numérico y no carácter. Por tanto, podemos buscar valores no numéricos que estén en dicha columna.

In [120]:
# Patron regex para coincidir caracteres no digitos
pattern = r'^\D+$' 
# Busca cada fila de la columna elevation para verificar si no es digito
col_elev_nodigi = db1_nonull[db1_nonull['elevation'].str.contains(pattern, na=False)]['elevation']
print(f'Non digit values in col elevation: {col_elev_nodigi.shape}')
# Valores unicos de los valores no digitos
nodig_vals = col_elev_nodigi.unique().tolist()
print(f'Valores no digitos encontrados en la columna elevation: \n{nodig_vals}')

Non digit values in col elevation: (212,)
Valores no digitos encontrados en la columna elevation: 
['?', 'Not specified', 'Unspecified']


In [148]:
db1_nodig = db1_nonull[~db1_nonull['elevation'].isin(nodig_vals)]
db1_nodig['elevation'] = (db1_nodig['elevation']
                          .apply(lambda x: x.replace(',', '.') if isinstance(x, str) else x)
                          .astype(float))
db1_nodig.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  db1_nodig['elevation'] = (db1_nodig['elevation']


q            float64
lat          float64
lng          float64
elevation    float64
A_dom         object
A_reg         object
A_cont        object
A_year       float64
dtype: object

In [149]:
db1_nodig

Unnamed: 0,q,lat,lng,elevation,A_dom,A_reg,A_cont,A_year
0,201.0,-16.0733,-17.2717,-3855.9,marine,South Atlantic Ocean,South American Plate,2008.0
1,214.0,-16.0733,-17.2717,-3855.9,marine,South Atlantic Ocean,South American Plate,2008.0
2,46.0,-0.9917,81.3983,-4720.3,marine,Indian Ocean,Indo-Australian Plate,2008.0
3,57.0,-0.9917,81.3983,-4720.3,marine,Indian Ocean,Indo-Australian Plate,2008.0
4,176.0,-39.9900,-109.8550,-3194.4,marine,South Pacific Ocean,Antarctic Plate,2008.0
...,...,...,...,...,...,...,...,...
73028,19.0,24.9000,122.7000,-1542.0,marine,East China Sea,Eurasian Plate - Asia,2019.0
73029,8.0,24.9200,122.7000,-1464.0,marine,East China Sea,Eurasian Plate - Asia,2019.0
73030,12.0,24.9400,122.7000,-1478.0,marine,East China Sea,Eurasian Plate - Asia,2019.0
73031,54.0,24.9600,122.7000,-1462.0,marine,East China Sea,Eurasian Plate - Asia,2019.0
