# Preprocessor Tutorial

This notebook shows the usage of the `hw_predictor/components/preprocessor` package. Note that the package
is designed to automatically work with Kubeflow Pipelines, but this will be an introduction to
better understand how implemented functions can be imported for experimentation use in notebooks.

In [1]:
# to ensure developed modules are reloaded automatically and there's no need
# to restart the kernel
%load_ext autoreload
%autoreload 2


In [2]:
from os import chdir

# change working directory to project's root path, this improves the interaction
# with the data/ and hw_predictor/ folders
chdir("../..")

# Imports

In [3]:
import pandas as pd
import hw_predictor.components.preprocessor.src as pp

# Parameters

In [4]:
input_path = "data/test/input/stations"
output_path = "data/test/output/stations"
station_id = 330020

year = 2022
save = True

# Code

Prior to code execution, have to ensure that needed project environment variables are set. This can
be done with the following command assuming there's already an `.env` file in the project root directory.

```bash
export $(cat .env | xargs)
```

As of Thu 28/12/2023, the following environment variables are needed:

```
METEOCHILE_USER=
METEOCHILE_API_KEY=
CDS_API_URL=
CDS_API_KEY=
CLUSTER_HOST=
CLUSTER_USER=
CLUSTER_PASSWORD=
```
check with Mauro Mendoza (msmendoza@uc.cl) for the values of these variables.

# Preprocess stations data

In [5]:
temp_history = pp.meteochile.load_daily_max_temp_history(input_path, station_id)
temp_history

Unnamed: 0_level_0,max_temp
date,Unnamed: 1_level_1
1967-03-01,26.4
1967-03-02,27.8
1967-03-03,27.2
1967-03-04,27.8
1967-03-05,30.3
...,...
2023-12-16,22.1
2023-12-17,26.3
2023-12-18,30.6
2023-12-19,25.0


In [6]:
data = pp.meteochile.compute_90_percentile(temp_history, year)
data

Unnamed: 0_level_0,max_temp,90_percentile
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-01-01,29.9,33.3
2022-01-02,31.0,32.55
2022-01-03,32.6,32.9
2022-01-04,32.4,33.5
2022-01-05,32.3,32.3
...,...,...
2022-12-27,30.5,32.6
2022-12-28,32.2,32.5
2022-12-29,34.4,32.75
2022-12-30,27.7,32.55


In [7]:
data = pp.meteochile.compute_90_percentile_adj(data)
data

[32m2023-12-28 10:45:23.968[0m | [1mINFO    [0m | [36mhw_predictor.components.preprocessor.src.meteochile.compute_threshold[0m:[36mcompute_90_percentile_adj[0m:[36m119[0m - [1ma + a1*cos(omega*x) + b1*sin(omega*x) 
[0m
[32m2023-12-28 10:45:23.969[0m | [1mINFO    [0m | [36mhw_predictor.components.preprocessor.src.meteochile.compute_threshold[0m:[36mcompute_90_percentile_adj[0m:[36m123[0m - [1ma  :  27.5  CI ~ N [2.74e+01,2.76e+01][0m
[32m2023-12-28 10:45:23.970[0m | [1mINFO    [0m | [36mhw_predictor.components.preprocessor.src.meteochile.compute_threshold[0m:[36mcompute_90_percentile_adj[0m:[36m123[0m - [1ma1 :  5.81  CI ~ N [5.64e+00,5.98e+00][0m
[32m2023-12-28 10:45:23.970[0m | [1mINFO    [0m | [36mhw_predictor.components.preprocessor.src.meteochile.compute_threshold[0m:[36mcompute_90_percentile_adj[0m:[36m123[0m - [1mb1 :  0.991  CI ~ N [8.24e-01,1.16e+00][0m


Unnamed: 0_level_0,max_temp,90_percentile,90_percentile_adj
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-01-01,29.9,33.3,33.278560
2022-01-02,31.0,32.55,33.294753
2022-01-03,32.6,32.9,33.309223
2022-01-04,32.4,33.5,33.321965
2022-01-05,32.3,32.3,33.332975
...,...,...,...
2022-12-27,30.5,32.6,33.165478
2022-12-28,32.2,32.5,33.190630
2022-12-29,34.4,32.75,33.214089
2022-12-30,27.7,32.55,33.235848


In [8]:
data = pp.meteochile.add_above_threshold(
    data,
    station_id,
    save=save,
    path=output_path,
)
data

Unnamed: 0_level_0,max_temp,90_percentile,90_percentile_adj,above_threshold
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-01-01,29.9,33.3,33.278560,0
2022-01-02,31.0,32.55,33.294753,0
2022-01-03,32.6,32.9,33.309223,0
2022-01-04,32.4,33.5,33.321965,0
2022-01-05,32.3,32.3,33.332975,0
...,...,...,...,...
2022-12-27,30.5,32.6,33.165478,0
2022-12-28,32.2,32.5,33.190630,0
2022-12-29,34.4,32.75,33.214089,1
2022-12-30,27.7,32.55,33.235848,0


In [9]:
pd.read_parquet("data/test/output/stations/330020")

Unnamed: 0_level_0,max_temp,90_percentile,90_percentile_adj,above_threshold
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-01-01,29.9,33.30,33.278560,0
2022-01-02,31.0,32.55,33.294753,0
2022-01-03,32.6,32.90,33.309223,0
2022-01-04,32.4,33.50,33.321965,0
2022-01-05,32.3,32.30,33.332975,0
...,...,...,...,...
2022-12-27,30.5,32.60,33.165478,0
2022-12-28,32.2,32.50,33.190630,0
2022-12-29,34.4,32.75,33.214089,1
2022-12-30,27.7,32.55,33.235848,0
