This tutorial introduces core functionality of `NumerFrame` and how to build a custom TensorFlow DataGenerator using the `NumerFrame`. This application is a good example of how `NumerFrame` can make it easier to implement Numerai specific applications.

In [1]:
from numerblox.numerframe import create_numerframe, NumerFrame
from numerblox.download import NumeraiClassicDownloader

First, we download validation data using `NumeraiClassicDownloader`.

In [2]:
downloader = NumeraiClassicDownloader("numerframe_edu")
# Path variables
live_file = "v4.2/live_int8.parquet"
live_save_path = f"{str(downloader.dir)}/{live_file}"
# Download only validation parquet file
downloader.download_single_dataset(live_file,
                                   dest_path=live_save_path)

2023-09-13 13:32:03,530 INFO numerapi.utils: target file already exists
2023-09-13 13:32:03,532 INFO numerapi.utils: download complete


Loading in data and initializing a `NumerFrame` takes one line of code. It will automatically recognize the data format such as `.csv` or `.parquet`.

In [3]:
# Initialize NumerFrame from parquet file path
dataf = create_numerframe(live_save_path)

All features of Pandas DataFrames can still be used in a `NumerFrame`.

In [4]:
dataf.head(2)

Unnamed: 0_level_0,era,data_type,feature_honoured_observational_balaamite,feature_polaroid_vadose_quinze,feature_untidy_withdrawn_bargeman,feature_genuine_kyphotic_trehala,feature_unenthralled_sportful_schoolhouse,feature_divulsive_explanatory_ideologue,feature_ichthyotic_roofed_yeshiva,feature_waggly_outlandish_carbonisation,...,target_bravo_v4_20,target_bravo_v4_60,target_charlie_v4_20,target_charlie_v4_60,target_delta_v4_20,target_delta_v4_60,target_echo_v4_20,target_echo_v4_60,target_jeremy_v4_20,target_jeremy_v4_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n0005f1304ae2f69,X,live,3,4,0,4,3,2,3,0,...,,,,,,,,,,
n0006343014eb9bb,X,live,2,2,3,4,4,3,4,4,...,,,,,,,,,,


NumerFrame extends the Pandas DataFrame with convenient features for working with Numerai data.

For example, the `NumerFrame` groups columns and makes use of the fact that, for Numerai data, all feature column names start with `'feature'`, target columns start with `'target'`, etc. It also keeps track of the era column and parses it automatically for other parts of this library (`'era'` for Numerai Classic and `'date'` for Numerai Signals).

In [5]:
dataf.target_cols[-1]

'target_jeremy_v4_60'

In [6]:
dataf.get_single_target_data.head(2)

Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
n0005f1304ae2f69,
n0006343014eb9bb,


In [7]:
dataf.feature_cols[0]

'feature_honoured_observational_balaamite'

In [8]:
dataf.get_feature_data.head(2)

Unnamed: 0_level_0,feature_honoured_observational_balaamite,feature_polaroid_vadose_quinze,feature_untidy_withdrawn_bargeman,feature_genuine_kyphotic_trehala,feature_unenthralled_sportful_schoolhouse,feature_divulsive_explanatory_ideologue,feature_ichthyotic_roofed_yeshiva,feature_waggly_outlandish_carbonisation,feature_floriated_amish_sprite,feature_iconoclastic_parietal_agonist,...,feature_disclosed_mnemonic_ineffaceability,feature_suspended_intracranial_fischer,feature_shimmering_coverable_congolese,feature_biserial_fulfilled_harpoon,feature_pitiable_authoritative_clangor,feature_abdominal_subtriplicate_fin,feature_centenarian_ileac_caschrom,feature_expected_beatified_coparcenary,feature_unread_isopodan_ethic,feature_china_fistular_phenylketonuria
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n0005f1304ae2f69,3,4,0,4,3,2,3,0,2,4,...,1,0,1,0,0,0,4,1,2,0
n0006343014eb9bb,2,2,3,4,4,3,4,4,2,3,...,3,1,4,3,0,0,2,1,2,1


In [9]:
dataf.prediction_cols

[]

`aux_cols` are all columns that are not a feature, target or prediction column.

In [10]:
dataf.aux_cols

['era', 'data_type']

In [11]:
dataf.get_aux_data.head(2)

Unnamed: 0_level_0,era,data_type
id,Unnamed: 1_level_1,Unnamed: 2_level_1
n0005f1304ae2f69,X,live
n0006343014eb9bb,X,live


In [12]:
dataf.get_prediction_data.head(2)

n0005f1304ae2f69
n0006343014eb9bb


In [13]:
dataf.meta.era_col

'era'

A split of features and target(s) can be retrieved in 1 line of code.

In [14]:
X, y = dataf.get_feature_target_pair(multi_target=True)

In [15]:
X.head(2)

Unnamed: 0_level_0,feature_honoured_observational_balaamite,feature_polaroid_vadose_quinze,feature_untidy_withdrawn_bargeman,feature_genuine_kyphotic_trehala,feature_unenthralled_sportful_schoolhouse,feature_divulsive_explanatory_ideologue,feature_ichthyotic_roofed_yeshiva,feature_waggly_outlandish_carbonisation,feature_floriated_amish_sprite,feature_iconoclastic_parietal_agonist,...,feature_disclosed_mnemonic_ineffaceability,feature_suspended_intracranial_fischer,feature_shimmering_coverable_congolese,feature_biserial_fulfilled_harpoon,feature_pitiable_authoritative_clangor,feature_abdominal_subtriplicate_fin,feature_centenarian_ileac_caschrom,feature_expected_beatified_coparcenary,feature_unread_isopodan_ethic,feature_china_fistular_phenylketonuria
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n0005f1304ae2f69,3,4,0,4,3,2,3,0,2,4,...,1,0,1,0,0,0,4,1,2,0
n0006343014eb9bb,2,2,3,4,4,3,4,4,2,3,...,3,1,4,3,0,0,2,1,2,1


In [16]:
y.head(2)

Unnamed: 0_level_0,target,target_nomi_v4_20,target_nomi_v4_60,target_tyler_v4_20,target_tyler_v4_60,target_victor_v4_20,target_victor_v4_60,target_ralph_v4_20,target_ralph_v4_60,target_waldo_v4_20,...,target_bravo_v4_20,target_bravo_v4_60,target_charlie_v4_20,target_charlie_v4_60,target_delta_v4_20,target_delta_v4_60,target_echo_v4_20,target_echo_v4_60,target_jeremy_v4_20,target_jeremy_v4_60
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
n0005f1304ae2f69,,,,,,,,,,,...,,,,,,,,,,
n0006343014eb9bb,,,,,,,,,,,...,,,,,,,,,,


### Custom TensorFlow generator

To illustrate a practical example, we will build a Tensorflow DataGenerator that returns a batch containing feature and target for 1 or multiple eras. Features and targets columns will be found automatically if not specified.

In [17]:
import numpy as np
import tensorflow as tf
from typing import Tuple

class TFGenerator(tf.keras.utils.Sequence):
    """
    Tensorflow generator for Numerai era batches.
    :param dataf: A NumerFrame
    :param features: Features to select. All by default.
    :param targets: Targets to select. All by default.
    :param n_eras_batch: How many eras per batch.
    :param shuffle: Shuffle eras of not.
    """
    def __init__(self, dataf: NumerFrame,
                 features: list = None,
                 targets: list = None,
                 n_eras_batch=1,
                 shuffle=True):
        self.dataf = dataf
        self.features, self.targets = features, targets
        self.n_eras_batch = n_eras_batch
        self.shuffle = shuffle
        self.eras = dataf[dataf.meta.era_col].unique()
        self.on_epoch_end()

    def __len__(self):
        return int(np.ceil(len(self.eras) / self.n_eras_batch))

    def on_epoch_end(self):
        np.random.shuffle(self.eras) if self.shuffle else ...

    def __getitem__(self, idx: int) -> Tuple[tf.Tensor, tf.Tensor]:
        eras = self.eras[idx:idx+self.n_eras_batch]
        return dataf.get_era_batch(eras=eras,
                                   targets=self.targets,
                                   features=self.features,
                                   convert_to_tf=True
                                   )

2023-09-13 13:32:04.529382: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Note that `NumerFrame` has its own method for getting batches of eras. It can therefore be implemented with one line for a new generator. If features and targets are not specified it will automatically retrieve all features and targets in the `NumerFrame`.

We can now use this generator for TensorFlow model training, evaluation and inference. This example shows how to get batches of 2 eras each for both nomi targets.

In [18]:
gen = TFGenerator(dataf=dataf, n_eras_batch=2, targets=['target'])
print(f"Total era batches: {len(gen)}")
X, y = gen[0]

Total era batches: 1


2023-09-13 13:32:06.818705: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:268] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected


In [19]:
X

<tf.Tensor: shape=(4885, 2132), dtype=int8, numpy=
array([[3, 4, 0, ..., 1, 2, 0],
       [2, 2, 3, ..., 1, 2, 1],
       [0, 4, 3, ..., 1, 0, 3],
       ...,
       [4, 1, 3, ..., 1, 3, 3],
       [4, 3, 4, ..., 1, 1, 3],
       [2, 4, 4, ..., 0, 0, 0]], dtype=int8)>

Because we are looking at live data naturally our targets are nans in this case.

In [20]:
y

<tf.Tensor: shape=(4885, 1), dtype=float64, numpy=
array([[nan],
       [nan],
       [nan],
       ...,
       [nan],
       [nan],
       [nan]])>

In this tutorial we have created a flexible Tensorflow generator leveraging convenient properties of `NumerFrame`. `NumerFrame` will automatically recognize features, targets and prediction columns, provides tested code for splitting data, has an implementation to get batches of eras and makes sure our era column is correct.

------------------------------------------------------

After we are done we can easily clean up our downloaded data with one line of code called from the downloader.

In [21]:
# Clean up environment
downloader.remove_base_directory()