# Parte I: Producción de un conjunto de datos

Autor: Christian Torres

Este notebook está dedicado a obtener todos los datos necesarios para entrenar los modelos de aprendizaje automático de estos ejercicios. Para ello, necesitamos datos climáticos de reanálisis ERA5 y datos observados en la Cordillera Blanca. Los datos de ERA5 pueden descargarse desde [Climate Data Store](https://cds.climate.copernicus.eu/cdsapp#!/home) de Copernicus.
Primero, importamos las dependencias que vamos a necesitar.

```{note}
Antes de ejecutar este notebook, asegúrese de que el kernel de Jupyter (esquina superior izquierda del notebook) se haya configurado para funcionar con el entorno que tienen instalada todas las librerías y dependencias.
```


In [1]:
import xarray as xr
import glob
import pandas as pd

## 1.1. Obtener datos de temperatura del aire

Para entrenar nuestros modelos de aprendizaje automático, necesitamos datos de destino, en este caso, la temperatura del aire a 2 m. Para ello, obtenemos la temperatura superficial observada en las estaciones meteorológicas disponibles. Como un ejercicio práctico usaremos apenas una estación meteorológica disponible. Puedes incluir datos de muchas más estaciones meteorológicas o incluso puedes usar datos obtenidos por sensoriamento remoto. 

In [2]:
OBS_EMA15 = pd.read_csv('../data/EMA15_Tingua.dat', delimiter='\t', index_col=['Fecha'], 
                         parse_dates=['Fecha'])

In [3]:
OBS_EMA15

Unnamed: 0_level_0,Temp,RH2,Precip,QSWin
Fecha,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-01-01 01:00:00,12.4,,0.0,
2014-01-01 02:00:00,11.6,,0.0,
2014-01-01 03:00:00,11.1,,0.0,
2014-01-01 04:00:00,10.5,,0.0,
2014-01-01 05:00:00,10.5,,0.0,
...,...,...,...,...
2018-12-12 20:00:00,,,,
2018-12-12 21:00:00,,,,
2018-12-12 22:00:00,,,,
2018-12-12 23:00:00,,,,


In [4]:
OBS_EMA15_nan = OBS_EMA15[['Temp']].resample('6h').mean().dropna()

In [5]:
OBS_EMA15_nan

Unnamed: 0_level_0,Temp
Fecha,Unnamed: 1_level_1
2014-01-01 00:00:00,11.220000
2014-01-01 06:00:00,15.266667
2014-01-01 12:00:00,23.083333
2014-01-01 18:00:00,15.516667
2014-01-02 00:00:00,13.033333
...,...
2018-12-11 12:00:00,24.346667
2018-12-11 18:00:00,16.898333
2018-12-12 00:00:00,11.650000
2018-12-12 06:00:00,14.950000


Guardamos el conjuntos de temperatura para entrenamiento

In [6]:
OBS_EMA15_nan.to_csv('../data/temp_aws_entrena.csv')

## 1.2. Obtener datos de ERA5 (predictores)

### Adicionar los archivos .nc de superficie

En esta parte adicionamos el camino de todo los archivos .nc usando **`glob`**.

In [7]:
files = sorted(glob.glob('../data/*_CRS.nc'))
print(len(files))
print(files)


5
['../data/ERA5_2014_CRS.nc', '../data/ERA5_2015_CRS.nc', '../data/ERA5_2016_CRS.nc', '../data/ERA5_2017_CRS.nc', '../data/ERA5_2018_CRS.nc']


Ahora vamos a leer todas los archivos **.nc**

In [8]:
ds_sup = xr.open_mfdataset(files)
ds_sup

Unnamed: 0,Array,Chunk
Bytes,5.21 MiB,1.04 MiB
Shape,"(7304, 17, 11)","(1464, 17, 11)"
Count,15 Tasks,5 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 5.21 MiB 1.04 MiB Shape (7304, 17, 11) (1464, 17, 11) Count 15 Tasks 5 Chunks Type float32 numpy.ndarray",11  17  7304,

Unnamed: 0,Array,Chunk
Bytes,5.21 MiB,1.04 MiB
Shape,"(7304, 17, 11)","(1464, 17, 11)"
Count,15 Tasks,5 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.21 MiB,1.04 MiB
Shape,"(7304, 17, 11)","(1464, 17, 11)"
Count,15 Tasks,5 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 5.21 MiB 1.04 MiB Shape (7304, 17, 11) (1464, 17, 11) Count 15 Tasks 5 Chunks Type float32 numpy.ndarray",11  17  7304,

Unnamed: 0,Array,Chunk
Bytes,5.21 MiB,1.04 MiB
Shape,"(7304, 17, 11)","(1464, 17, 11)"
Count,15 Tasks,5 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,5.21 MiB,1.04 MiB
Shape,"(7304, 17, 11)","(1464, 17, 11)"
Count,15 Tasks,5 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 5.21 MiB 1.04 MiB Shape (7304, 17, 11) (1464, 17, 11) Count 15 Tasks 5 Chunks Type float32 numpy.ndarray",11  17  7304,

Unnamed: 0,Array,Chunk
Bytes,5.21 MiB,1.04 MiB
Shape,"(7304, 17, 11)","(1464, 17, 11)"
Count,15 Tasks,5 Chunks
Type,float32,numpy.ndarray


### Hacemos lo mismo para archivos .nc por niveles

In [9]:
files = sorted(glob.glob('../data/*_CRS_pl.nc'))
print(len(files))
print(files)

5
['../data/ERA5_2014_CRS_pl.nc', '../data/ERA5_2015_CRS_pl.nc', '../data/ERA5_2016_CRS_pl.nc', '../data/ERA5_2017_CRS_pl.nc', '../data/ERA5_2018_CRS_pl.nc']


In [10]:
ds_pl = xr.open_mfdataset(files)
ds_pl

Unnamed: 0,Array,Chunk
Bytes,15.63 MiB,3.13 MiB
Shape,"(7304, 3, 17, 11)","(1464, 3, 17, 11)"
Count,15 Tasks,5 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 15.63 MiB 3.13 MiB Shape (7304, 3, 17, 11) (1464, 3, 17, 11) Count 15 Tasks 5 Chunks Type float32 numpy.ndarray",7304  1  11  17  3,

Unnamed: 0,Array,Chunk
Bytes,15.63 MiB,3.13 MiB
Shape,"(7304, 3, 17, 11)","(1464, 3, 17, 11)"
Count,15 Tasks,5 Chunks
Type,float32,numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,15.63 MiB,3.13 MiB
Shape,"(7304, 3, 17, 11)","(1464, 3, 17, 11)"
Count,15 Tasks,5 Chunks
Type,float32,numpy.ndarray
"Array Chunk Bytes 15.63 MiB 3.13 MiB Shape (7304, 3, 17, 11) (1464, 3, 17, 11) Count 15 Tasks 5 Chunks Type float32 numpy.ndarray",7304  1  11  17  3,

Unnamed: 0,Array,Chunk
Bytes,15.63 MiB,3.13 MiB
Shape,"(7304, 3, 17, 11)","(1464, 3, 17, 11)"
Count,15 Tasks,5 Chunks
Type,float32,numpy.ndarray


### Datos geográficos de las estaciones meteorológicas

In [11]:
EMA15 = [-77.688333, -9.222450, 2118]

### Seleccionamos para las estaciones meteorológicas disponibles 

In [12]:
ERA5_EMA15_sup = ds_sup.sel(longitude=EMA15[0], latitude= EMA15[1], method='nearest')
ERA5_EMA15_pre = ds_pl.sel(longitude=EMA15[0], latitude= EMA15[1], method='nearest')

### Convetir para dataframe

In [13]:
ERA5_EMA15_sup_df = ERA5_EMA15_sup.to_dataframe().drop(['latitude', 'longitude'], axis=1)
ERA5_EMA15_sup_df


Unnamed: 0_level_0,t2m,z,tp
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-01-01 00:00:00,280.645813,37063.71875,1.158565e-05
2014-01-01 06:00:00,277.901794,37063.71875,0.000000e+00
2014-01-01 12:00:00,279.369354,37063.71875,0.000000e+00
2014-01-01 18:00:00,285.019684,37063.71875,7.392447e-04
2014-01-02 00:00:00,279.970215,37063.71875,1.873057e-04
...,...,...,...
2018-12-30 18:00:00,287.092529,37063.71875,3.051260e-04
2018-12-31 00:00:00,281.842804,37063.71875,4.348811e-05
2018-12-31 06:00:00,278.876556,37063.71875,4.726462e-07
2018-12-31 12:00:00,280.022369,37063.71875,7.090624e-06


Selecionamos los diferentes niveles de presión

In [14]:
ERA5_EMA15_pre_1000 = ERA5_EMA15_pre.sel(level=1000).to_dataframe().drop(['latitude', 'longitude', 'level'], axis=1)
ERA5_EMA15_pre_1000.columns = ['g_1000', 't_1000']
ERA5_EMA15_pre_500 = ERA5_EMA15_pre.sel(level=500).to_dataframe().drop(['latitude', 'longitude', 'level'], axis=1)
ERA5_EMA15_pre_500.columns = ['g_500', 't_500']
ERA5_EMA15_pre_200 = ERA5_EMA15_pre.sel(level=200).to_dataframe().drop(['latitude', 'longitude', 'level'], axis=1)
ERA5_EMA15_pre_200.columns = ['g_200', 't_200']


Juntamos los dataframes

In [15]:
df_ERA5_all = pd.concat([ERA5_EMA15_sup_df, ERA5_EMA15_pre_1000, ERA5_EMA15_pre_500, ERA5_EMA15_pre_200], axis=1)

In [16]:
df_ERA5_all

Unnamed: 0_level_0,t2m,z,tp,g_1000,t_1000,g_500,t_500,g_200,t_200
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2014-01-01 00:00:00,280.645813,37063.71875,1.158565e-05,0.008835,297.136017,0.003417,268.610718,0.000093,220.895813
2014-01-01 06:00:00,277.901794,37063.71875,0.000000e+00,0.007724,296.930359,0.002321,267.871521,0.000031,221.523849
2014-01-01 12:00:00,279.369354,37063.71875,0.000000e+00,0.007367,297.062378,0.002287,268.741333,0.000030,221.227905
2014-01-01 18:00:00,285.019684,37063.71875,7.392447e-04,0.008523,297.136017,0.003990,268.285583,0.000041,221.737823
2014-01-02 00:00:00,279.970215,37063.71875,1.873057e-04,0.008609,297.242981,0.003992,269.072021,0.000092,222.624283
...,...,...,...,...,...,...,...,...,...
2018-12-30 18:00:00,287.092529,37063.71875,3.051260e-04,0.009952,296.729767,0.003633,267.876465,0.000052,220.129486
2018-12-31 00:00:00,281.842804,37063.71875,4.348811e-05,0.008959,297.087402,0.002240,268.994110,0.000069,220.172791
2018-12-31 06:00:00,278.876556,37063.71875,4.726462e-07,0.008518,296.977051,0.001803,269.012268,0.000060,219.337357
2018-12-31 12:00:00,280.022369,37063.71875,7.090624e-06,0.007676,297.105560,0.001983,268.418549,0.000052,219.185074


### Convertir UCT para LT

In [17]:
df_ERA5_AWS_LT = df_ERA5_all.tz_localize('utc').tz_convert('America/Lima')
df_ERA5_AWS_LT

Unnamed: 0_level_0,t2m,z,tp,g_1000,t_1000,g_500,t_500,g_200,t_200
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-12-31 19:00:00-05:00,280.645813,37063.71875,1.158565e-05,0.008835,297.136017,0.003417,268.610718,0.000093,220.895813
2014-01-01 01:00:00-05:00,277.901794,37063.71875,0.000000e+00,0.007724,296.930359,0.002321,267.871521,0.000031,221.523849
2014-01-01 07:00:00-05:00,279.369354,37063.71875,0.000000e+00,0.007367,297.062378,0.002287,268.741333,0.000030,221.227905
2014-01-01 13:00:00-05:00,285.019684,37063.71875,7.392447e-04,0.008523,297.136017,0.003990,268.285583,0.000041,221.737823
2014-01-01 19:00:00-05:00,279.970215,37063.71875,1.873057e-04,0.008609,297.242981,0.003992,269.072021,0.000092,222.624283
...,...,...,...,...,...,...,...,...,...
2018-12-30 13:00:00-05:00,287.092529,37063.71875,3.051260e-04,0.009952,296.729767,0.003633,267.876465,0.000052,220.129486
2018-12-30 19:00:00-05:00,281.842804,37063.71875,4.348811e-05,0.008959,297.087402,0.002240,268.994110,0.000069,220.172791
2018-12-31 01:00:00-05:00,278.876556,37063.71875,4.726462e-07,0.008518,296.977051,0.001803,269.012268,0.000060,219.337357
2018-12-31 07:00:00-05:00,280.022369,37063.71875,7.090624e-06,0.007676,297.105560,0.001983,268.418549,0.000052,219.185074


In [18]:
import datetime
dt1_list = []
for i in range(len(df_ERA5_AWS_LT)):
    dt1_list.append(datetime.datetime(df_ERA5_AWS_LT.index[i].year,
                  df_ERA5_AWS_LT.index[i].month,
                  df_ERA5_AWS_LT.index[i].day,                    
                  df_ERA5_AWS_LT.index[i].hour,
                  df_ERA5_AWS_LT.index[i].minute,
                  df_ERA5_AWS_LT.index[i].second))


df_ERA5_AWS_LT.index = dt1_list
df_ERA5_AWS_LT.index.name = 'Datetime'

In [19]:
df_ERA5_AWS_LT

Unnamed: 0_level_0,t2m,z,tp,g_1000,t_1000,g_500,t_500,g_200,t_200
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2013-12-31 19:00:00,280.645813,37063.71875,1.158565e-05,0.008835,297.136017,0.003417,268.610718,0.000093,220.895813
2014-01-01 01:00:00,277.901794,37063.71875,0.000000e+00,0.007724,296.930359,0.002321,267.871521,0.000031,221.523849
2014-01-01 07:00:00,279.369354,37063.71875,0.000000e+00,0.007367,297.062378,0.002287,268.741333,0.000030,221.227905
2014-01-01 13:00:00,285.019684,37063.71875,7.392447e-04,0.008523,297.136017,0.003990,268.285583,0.000041,221.737823
2014-01-01 19:00:00,279.970215,37063.71875,1.873057e-04,0.008609,297.242981,0.003992,269.072021,0.000092,222.624283
...,...,...,...,...,...,...,...,...,...
2018-12-30 13:00:00,287.092529,37063.71875,3.051260e-04,0.009952,296.729767,0.003633,267.876465,0.000052,220.129486
2018-12-30 19:00:00,281.842804,37063.71875,4.348811e-05,0.008959,297.087402,0.002240,268.994110,0.000069,220.172791
2018-12-31 01:00:00,278.876556,37063.71875,4.726462e-07,0.008518,296.977051,0.001803,269.012268,0.000060,219.337357
2018-12-31 07:00:00,280.022369,37063.71875,7.090624e-06,0.007676,297.105560,0.001983,268.418549,0.000052,219.185074


Guardamos el conjuntos de datos ERA5

In [20]:
df_ERA5_AWS_LT.to_csv('../data/ERA5_all.csv')