<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

In [None]:
#| include: false

In [None]:
#| include: false
from nbdev.showdoc import *

## Overview: The NumerFrame

[`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe) is a data structure that extends `pd.DataFrame` with functionality convenient for Numerai users. The main benefits include:
1. Automatically track features, targets, prediction and other columns + easily retrieve these data slices.
2. Other library functionality automatically recognizes era column (`era`, `friday_date` or `date`).
3. Integrations with other library components (i.e. `preprocessing`, `model`, `modelpipeline`, `postprocessing`, `evaluation` and `submission`) to create more solid inference pipelines and increase reliability.

Besides, all functionality of Pandas DataFrames is still available in the [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe). You therefore don't have to create new pipelines to process your data when using [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe).

We adopt the convention:
 1. All feature column names should start with `'feature'`.
 2. All target column names should start with `'target'`.
 3. All prediction column names should start with `'prediction'`.
 4. Data should contain an `'era'`, `'friday_date'` or `'date'` column, as is almost always the case with Numerai datasets.

Every column for which these conditions do not hold will be classified as an `'aux'` column.

In [1]:
#| echo: false
#| output: asis
show_doc(NumerFrame)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/numerframe.py#L16){target="_blank" style="float:right; font-size:smaller"}

### NumerFrame

>      NumerFrame (*args, **kwargs)

Data structure which extends Pandas DataFrames and
allows for additional Numerai specific functionality.

[`create_numerframe`](https://crowdcent.github.io/numerblox/numerframe.html#create_numerframe) automatically recognizes your data file format, loads it into a [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe) and allows for column selection before loading.

Support file formats are `.csv`, `.parquet`, `.pkl`, `.pickle`, `.xsl`, `.xslx`, `.xlsm`, `.xlsb`, `.odf`, `.ods` and `.odt`. If the file format for your use case is missing, feel free to create a Github issue or submit a pull request. See `README.md` for more information on contributing.

In [2]:
#| echo: false
#| output: asis
show_doc(create_numerframe)

---

[source](https://github.com/crowdcent/numerblox/blob/master/numerblox/numerframe.py#L146){target="_blank" style="float:right; font-size:smaller"}

### create_numerframe

>      create_numerframe (file_path:str, columns:list=None, *args, **kwargs)

Convenient function to initialize NumerFrame.
Support most used file formats for Pandas DataFrames 

(.csv, .parquet, .xls, .pkl, etc.).
For more details check https://pandas.pydata.org/docs/reference/io.html

:param file_path: Relative or absolute path to data file. 

:param columns: Which columns to read (All by default). 

*args, **kwargs will be passed to Pandas loading function.

## NumerFrame Usage

A [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe) object can be initialized from memory just like you would with a Pandas DataFrame.

### 1. Initialize from memory

In [None]:
test_features = [f"feature_{l}" for l in "ABCDEFGHIK"]
id_col = [uuid.uuid4().hex for _ in range(100)]

# Random DataFrame
dataf = pd.DataFrame(np.random.uniform(size=(100, 10)), columns=test_features)
dataf["id"] = id_col
dataf[["target", "target_1", "target_2"]] = np.random.normal(size=(100, 3))
dataf["date"] = range(100)

In [None]:
memory_dataf = NumerFrame(dataf)
assert memory_dataf.meta.era_col == "date"

In [None]:
memory_dataf.head(2)

Unnamed: 0,feature_A,feature_B,feature_C,feature_D,feature_E,feature_F,feature_G,feature_H,feature_I,feature_K,id,target,target_1,target_2,date
0,0.347479,0.193681,0.572169,0.201514,0.309487,0.784491,0.641908,0.414017,0.667712,0.68248,4e7d7ad23be14c3587ad3f47d4191715,2.144772,0.849946,0.123302,0
1,0.573529,0.86929,0.356529,0.266067,0.973842,0.554975,0.884594,0.006587,0.978762,0.431653,492fb024968a4611a9ebb9dc5c054d67,0.221895,0.194775,-0.213684,1


The `meta` attribute will store which era column is used. This is used in NumerBlox processors to group computations by era where needed.

In [None]:
memory_dataf.meta

{'era_col': 'date'}

### 2. Initialize from file path

You can also use the convenience function [`create_numerframe`](https://crowdcent.github.io/numerblox/numerframe.html#create_numerframe) so [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe) can be easily initialized. Think of it as a dynamic `pd.read_csv`, `pd.read_parquet`, etc.

In [None]:
num_dataf = create_numerframe("test_assets/mini_numerai_version_1_data.csv",
                          )
assert num_dataf.meta.era_col == "era"

### 3. Example functionality

`.get_feature_data` will retrieve all columns where the column name starts with `feature`.

In [None]:
num_dataf.get_feature_data.head(2)

Unnamed: 0,feature_intelligence1,feature_intelligence2,feature_intelligence3,feature_intelligence4,feature_intelligence5,feature_intelligence6,feature_intelligence7,feature_intelligence8,feature_intelligence9,feature_intelligence10,...,feature_wisdom37,feature_wisdom38,feature_wisdom39,feature_wisdom40,feature_wisdom41,feature_wisdom42,feature_wisdom43,feature_wisdom44,feature_wisdom45,feature_wisdom46
0,0.0,0.5,0.25,0.0,0.5,0.25,0.25,0.25,0.75,0.75,...,1.0,1.0,1.0,0.75,0.5,0.75,0.5,1.0,0.5,0.75
1,0.0,0.0,0.0,0.25,0.5,0.0,0.0,0.25,0.5,0.5,...,0.75,1.0,1.0,0.0,0.0,0.75,0.25,0.0,0.25,1.0


`.get_target_data` retrieves all columns if the column name starts with `"target"`.

In [None]:
num_dataf.get_target_data.head(2)

Unnamed: 0,target
0,0.5
1,0.25


`.get_single_target_data` only retrieves the column `"target"`.

In [None]:
num_dataf.get_single_target_data.head(2)

Unnamed: 0,target
0,0.5
1,0.25


`.get_pattern_data` allows you to get columns based on a certain pattern. In this example we retrieve all 20-day targets.

In [None]:
num_dataf.get_pattern_data("_20").head(2)

0
1


In [None]:
num_dataf.head()

Unnamed: 0,id,era,data_type,feature_intelligence1,feature_intelligence2,feature_intelligence3,feature_intelligence4,feature_intelligence5,feature_intelligence6,feature_intelligence7,...,feature_wisdom38,feature_wisdom39,feature_wisdom40,feature_wisdom41,feature_wisdom42,feature_wisdom43,feature_wisdom44,feature_wisdom45,feature_wisdom46,target
0,n000315175b67977,era1,train,0.0,0.5,0.25,0.0,0.5,0.25,0.25,...,1.0,1.0,0.75,0.5,0.75,0.5,1.0,0.5,0.75,0.5
1,n0014af834a96cdd,era1,train,0.0,0.0,0.0,0.25,0.5,0.0,0.0,...,1.0,1.0,0.0,0.0,0.75,0.25,0.0,0.25,1.0,0.25
2,n001c93979ac41d4,era1,train,0.25,0.5,0.25,0.25,1.0,0.75,0.75,...,0.25,0.5,0.0,0.0,0.5,1.0,0.0,0.25,0.75,0.25
3,n0034e4143f22a13,era1,train,1.0,0.0,0.0,0.5,0.5,0.25,0.25,...,1.0,1.0,0.75,0.75,1.0,1.0,0.75,1.0,1.0,0.25
4,n00679d1a636062f,era1,train,0.25,0.25,0.25,0.25,0.0,0.25,0.5,...,0.75,0.75,0.25,0.5,0.75,0.0,0.5,0.25,0.75,0.75


`.get_era_batch` will return a `tf.Tensor` or `np.array` with feature data and target data for one or more eras. Convenient for creating neural network DataGenerators.

In [None]:
X_era, y_era = num_dataf.get_era_batch(['era1'], convert_to_tf=True)
X_era

2023-01-05 14:44:56.511736: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-05 14:44:56.657148: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-01-05 14:44:56.657175: I tensorflow/compiler/xla/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2023-01-05 14:44:57.351388: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-

<tf.Tensor: shape=(10, 310), dtype=float64, numpy=
array([[0.  , 0.5 , 0.25, ..., 1.  , 0.5 , 0.75],
       [0.  , 0.  , 0.  , ..., 0.  , 0.25, 1.  ],
       [0.25, 0.5 , 0.25, ..., 0.  , 0.25, 0.75],
       ...,
       [0.25, 1.  , 1.  , ..., 0.75, 0.5 , 0.25],
       [0.5 , 0.5 , 0.5 , ..., 0.  , 0.  , 0.  ],
       [0.5 , 1.  , 1.  , ..., 1.  , 1.  , 0.75]])>

For people training autoencoders + MLP you can get a target that contains 3 elements: features, targets and targets. Just define `aemlp_batch=True`.
More info on this setup: [AutoEncoder and multitask MLP on new dataset forum post](https://forum.numer.ai/t/autoencoder-and-multitask-mlp-on-new-dataset-from-kaggle-jane-street/4338).

In [None]:
_, y_era_aemlp = num_dataf.get_era_batch(['era1'], convert_to_tf=True, aemlp_batch=True)
y_era_aemlp

[<tf.Tensor: shape=(10, 310), dtype=float64, numpy=
 array([[0.  , 0.5 , 0.25, ..., 1.  , 0.5 , 0.75],
        [0.  , 0.  , 0.  , ..., 0.  , 0.25, 1.  ],
        [0.25, 0.5 , 0.25, ..., 0.  , 0.25, 0.75],
        ...,
        [0.25, 1.  , 1.  , ..., 0.75, 0.5 , 0.25],
        [0.5 , 0.5 , 0.5 , ..., 0.  , 0.  , 0.  ],
        [0.5 , 1.  , 1.  , ..., 1.  , 1.  , 0.75]])>,
 <tf.Tensor: shape=(10, 1), dtype=float64, numpy=
 array([[0.5 ],
        [0.25],
        [0.25],
        [0.25],
        [0.75],
        [0.5 ],
        [0.25],
        [0.25],
        [0.5 ],
        [0.75]])>,
 <tf.Tensor: shape=(10, 1), dtype=float64, numpy=
 array([[0.5 ],
        [0.25],
        [0.25],
        [0.25],
        [0.75],
        [0.5 ],
        [0.25],
        [0.25],
        [0.5 ],
        [0.75]])>]

`.aux_cols` denotes all columns that are not features, targets or prediction columns.

In [None]:
num_dataf.aux_cols

['id', 'era', 'data_type']

In [None]:
num_dataf.get_aux_data.head(2)

Unnamed: 0,id,era,data_type
0,n000315175b67977,era1,train
1,n0014af834a96cdd,era1,train


In [None]:
num_dataf['prediction_1'] = np.random.uniform(size=len(num_dataf))
num_dataf['prediction_2'] = np.random.uniform(size=len(num_dataf))

To track new columns like prediction columns, make sure to initialize a new [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe). Prediction columns can easily be retrieved with `.get_prediction_data` and `get_prediction_aux_data` if you want to also get columns like `era` and `data_type`. This can be handy for ensembling and submission use cases.

In [None]:
num_dataf = NumerFrame(num_dataf)

In [None]:
num_dataf.get_prediction_data.head(2)

Unnamed: 0,prediction_1,prediction_2
0,0.085195,0.055496
1,0.002619,0.687268


In [None]:
num_dataf.get_prediction_aux_data.head(2)

Unnamed: 0,prediction_1,prediction_2,id,era,data_type
0,0.085195,0.055496,n000315175b67977,era1,train
1,0.002619,0.687268,n0014af834a96cdd,era1,train


In [None]:
num_dataf.meta

{'era_col': 'era'}

Because [`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe) inherits from `pd.DataFrame` you still have all functionality of a normal DataFrame at your disposal, like copying.

In [None]:
dataf2 = num_dataf.copy()
assert dataf2.equals(num_dataf)

[`NumerFrame`](https://crowdcent.github.io/numerblox/numerframe.html#numerframe) dynamically tracks which feature, target, aux and prediction columns there are when initialized. For example, here we add a new prediction column. Upon initialization the column will be contained in `prediction_cols`. Prediction columns are all column names that start with `prediction`.

In [None]:
num_dataf.loc[:, "prediction_test_1"] = np.random.uniform(size=len(num_dataf))
new_dataset = NumerFrame(num_dataf)
assert "prediction_test_1" in new_dataset.prediction_cols

Arbitrary columns van be retrieved with `.get_column_selection`. The input argument can be either a string or a list with column names.

In [None]:
selection1 = num_dataf.get_column_selection("era")
selection1.head(2)

Unnamed: 0,era
0,era1
1,era1


In [None]:
selection2 = num_dataf.get_column_selection(["era", "prediction_test_1"])
selection2.head(2)

Unnamed: 0,era,prediction_test_1
0,era1,0.328714
1,era1,0.216408


In [None]:
#| include: false
for sel in [selection1, selection2]:
    assert isinstance(sel, NumerFrame)

For convenience we can get a feature, target pair with one method. If `multi_target=True` all columns where the column name starts with `target` will be retrieved.

In [None]:
features, single_target = num_dataf.get_feature_target_pair(multi_target=False)
features.head(2)

Unnamed: 0,feature_intelligence1,feature_intelligence2,feature_intelligence3,feature_intelligence4,feature_intelligence5,feature_intelligence6,feature_intelligence7,feature_intelligence8,feature_intelligence9,feature_intelligence10,...,feature_wisdom37,feature_wisdom38,feature_wisdom39,feature_wisdom40,feature_wisdom41,feature_wisdom42,feature_wisdom43,feature_wisdom44,feature_wisdom45,feature_wisdom46
0,0.0,0.5,0.25,0.0,0.5,0.25,0.25,0.25,0.75,0.75,...,1.0,1.0,1.0,0.75,0.5,0.75,0.5,1.0,0.5,0.75
1,0.0,0.0,0.0,0.25,0.5,0.0,0.0,0.25,0.5,0.5,...,0.75,1.0,1.0,0.0,0.0,0.75,0.25,0.0,0.25,1.0


In [None]:
single_target.head(2)

Unnamed: 0,target
0,0.5
1,0.25


-----------------------------------------------