# About
This notebook contains the test definition for a OHE for the datetime data in the AQ datasets, for later use as metadata in forecasting models.

# Introduction
From the Time series analysis, different plots were created to visualize the relationship between PM2.5 measurements and different timeframes. After its analysis, it was concluded that there are strong trends for specific timeframes, so, feeding the forecasting models with additional time metadata would be of benefit for their forecasts. Time metadata would be of the categorical kind. To input categorical data into a forecasting model, a common approach is One-Hot Encoding. The problem with this approach is the high dimensionality it adds and the sparse informational environment it creates. For the intended time metadata to input, this would mean an additional 12 for every month and 24 for every hour, making a total of 36 additional columns using this strategy.

Sparse data is hard to optimize. It requires specialized sparse optimization strategies that can be avoided if the categorical data were structured differently. There are different encoders available like the TargetEncoder that averages 


% -----------------------

The Bayesian encoders use information from the dependent variable in their encodings. They output one column and can work well with high cardinality data.

* Target — use the mean of the DV, must take steps to avoid overfitting/ response leakage. Nominal, ordinal. For classification tasks.
* LeaveOneOut — similar to target but avoids contamination. Nominal, ordinal. For classification tasks.
* WeightOfEvidence — added in v1.3. Not documented in the docs as of April 11, 2019. The method is explained in this post.
* James-Stein — forthcoming in v1.4. Described in the code here.
* M-estimator — forthcoming in v1.4. Described in the code here. Simplified target encoder.

Category Encoders follow the same API as scikit-learn’s preprocessors. They have some added conveniences, such as the ability to easily add an encoder to a pipeline. Additionally, the encoder returns a pandas DataFrame if a DataFrame is passed to it. Here’s an example of the code with the BinaryEncoder:

# Libraries

In [1]:
%run "main_global.ipynb"

Connection with MySQL database is ready!


# Parameters

In [2]:
station = "CE"
mvi_method = "MVI_MICE"
target = "pm25"

In [3]:
table_name = "sima_station_{}_{}".format(mvi_method, station)

# UDF

# Data

In [4]:
sqlq = "Select * from {} where year(datetime) in (2020, 2021, 2022)".format(table_name)


In [50]:
time_sqlq = """Select datetime
, monthname(datetime)
, hour(datetime)
, {}
from {}
{};
""".format(target, table_name, where_from_sqlq(sqlq))

time_data = DataFrame(aux_qdata(time_sqlq), columns = ['datetime', 'month', "hour", target])

time_data = time_data.drop(["datetime"], axis = 1)

# Main

In [53]:
X_train = time_data[["month", "hour"]]
y_train = time_data["pm25"]

In [38]:
y_train

0        94.00
1         5.00
2         6.00
3        18.00
4        23.00
         ...  
20107    25.02
20108    15.61
20109    39.58
20110    48.14
20111    25.07
Name: pm25, Length: 20112, dtype: float64

In [55]:
from category_encoders import WOEEncoder
from category_encoders import TargetEncoder
from category_encoders import LeaveOneOutEncoder
from category_encoders import JamesSteinEncoder
#enc = TargetEncoder(cols = ["month"])
#training_set = enc.fit_transform(X_train, y_train)

encoded = LeaveOneOutEncoder(cols = ["month", "hour"]).fit_transform(X_train, y_train)
encoded

Unnamed: 0,month,hour
0,29.376760,23.514406
1,29.416652,20.870730
2,29.416204,21.247051
3,29.410825,21.869391
4,29.408584,22.138843
...,...,...
20107,28.613651,21.092539
20108,28.618746,22.764452
20109,28.605768,31.134592
20110,28.601133,21.064916


In [33]:
import numpy as np
import pandas as pd
import category_encoders as ce

# make some data
df = pd.DataFrame({
 'color':["a", "b", "a", "c"], 
 'outcome':[1, 2, 3, 2]})

# split into X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)

# instantiate an encoder - here we use Binary()
ce_binary = ce.BinaryEncoder(cols = ['color'])

# fit and transform and presto, you've got encoded data
ce_binary.fit_transform(X, y)

Unnamed: 0,color_0,color_1
0,0,1
1,1,0
2,0,1
3,1,1
