# Tune Dataset

`TuneDataset` contains searching space and all related dataframes with metadata for a tuning task.

`TuneDataset` should not to be constructed by users directly. Instead, you should use `TuneDatasetBuilder` or the factory method to construct `TuneDataset`.

In [1]:
from fugue_notebook import setup

setup(is_lab=True)

import pandas as pd
from tune import TUNE_OBJECT_FACTORY, TuneDatasetBuilder, Space, Grid
from fugue import FugueWorkflow

`TUNE_OBJECT_FACTORY.make_dataset` is a wrapper of `TuneDatasetBuilder`, making the dataset construction even easier. But `TuneDatasetBuilder` still has the most flexibility. For example, it can add multiple dataframes with different join types while `TUNE_OBJECT_FACTORY.make_dataset` can add at most two dataframes (nomrally train and validations dataframes).

In [2]:
with FugueWorkflow() as dag:
    builder = TuneDatasetBuilder(Space(a=1, b=2))
    dataset = builder.build(dag)
    dataset.data.show();

with FugueWorkflow() as dag:
    dataset = TUNE_OBJECT_FACTORY.make_dataset(dag, Space(a=1, b=2))
    dataset.data.show();

Unnamed: 0,__tune_trials__
0,"[{""trial_id"": ""df8d686f-374b-509d-b8af-3a83899..."


Unnamed: 0,__tune_trials__
0,"[{""trial_id"": ""df8d686f-374b-509d-b8af-3a83899..."


Here are the equivalent ways to construct `TuneDataset` with space and two dataframes.

In `TuneDataset`, every dataframe will be partition by certain keys, and each partition will be saved into a temp parquet file. The temp path must be specified. Using the factory, you can call `set_temp_path` once so you no longer need to provide the temp path explicitly, if you still provide a path, it will be used.

In [30]:
pdf1 = pd.DataFrame([[0,1],[1,1],[0,2]], columns = ["a", "b"])
pdf2 = pd.DataFrame([[0,0.5],[2,0.1],[0,0.1],[1,0.3]], columns = ["a", "c"])
space = Space(a=1, b=Grid(1,2,3))

with FugueWorkflow() as dag:
    builder = TuneDatasetBuilder(space, path="/tmp")
    # here we must make pdf1 pdf2 the FugueWorkflowDataFrame, and they
    # both need to be partitioned by the same keys so each partition
    # will be saved to a temp parquet file, and the chunks of data are
    # replaced by file paths before join.
    builder.add_df("df1", dag.df(pdf1).partition_by("a"))
    builder.add_df("df2", dag.df(pdf2).partition_by("a"), how="inner")
    dataset = builder.build(dag)
    dataset.data.show();


TUNE_OBJECT_FACTORY.set_temp_path("/tmp")

with FugueWorkflow() as dag:
    # this method is significantly simpler, as long as you don't have more
    # than 2 dataframes for a tuning task, use this.
    dataset = TUNE_OBJECT_FACTORY.make_dataset(
        dag, space,
        df_name="df1", df=pdf1,
        test_df_name="df2", test_df=pdf2,
        partition_keys=["a"],
    )
    dataset.data.show();

Unnamed: 0,a,__tune_df__df1,__tune_df__df2,__tune_trials__
0,0,/tmp/4d8e91ca-0d0c-46f5-965f-e72cc672b13e.parquet,/tmp/24d12a86-313d-472e-ac40-7fb743d6f25c.parquet,"[{""trial_id"": ""35e1bdd1-424e-532d-b788-09fbf54..."
1,0,/tmp/4d8e91ca-0d0c-46f5-965f-e72cc672b13e.parquet,/tmp/24d12a86-313d-472e-ac40-7fb743d6f25c.parquet,"[{""trial_id"": ""26eba2e7-a331-531a-8576-db1a2c6..."
2,0,/tmp/4d8e91ca-0d0c-46f5-965f-e72cc672b13e.parquet,/tmp/24d12a86-313d-472e-ac40-7fb743d6f25c.parquet,"[{""trial_id"": ""806b49f1-c1fc-5023-8b81-835dd8a..."
3,1,/tmp/8f597ad5-f96e-4d2b-a27f-428c56508cfc.parquet,/tmp/2a305b16-cc8f-47c8-a11f-79196624bc88.parquet,"[{""trial_id"": ""bb5aa50f-913b-501d-8158-9afc1f2..."
4,1,/tmp/8f597ad5-f96e-4d2b-a27f-428c56508cfc.parquet,/tmp/2a305b16-cc8f-47c8-a11f-79196624bc88.parquet,"[{""trial_id"": ""20ab107e-8c69-51d6-8b2f-6a5466d..."
5,1,/tmp/8f597ad5-f96e-4d2b-a27f-428c56508cfc.parquet,/tmp/2a305b16-cc8f-47c8-a11f-79196624bc88.parquet,"[{""trial_id"": ""66664ecc-14ad-5d02-a273-ef53d4d..."


Unnamed: 0,a,__tune_df__df1,__tune_df__df2,__tune_trials__
0,0,/tmp/502c5c14-31e3-482e-90e7-ec9db27486d1.parquet,/tmp/e74ed78a-333b-4166-afb0-84c6c021d8f2.parquet,"[{""trial_id"": ""35e1bdd1-424e-532d-b788-09fbf54..."
1,0,/tmp/502c5c14-31e3-482e-90e7-ec9db27486d1.parquet,/tmp/e74ed78a-333b-4166-afb0-84c6c021d8f2.parquet,"[{""trial_id"": ""26eba2e7-a331-531a-8576-db1a2c6..."
2,0,/tmp/502c5c14-31e3-482e-90e7-ec9db27486d1.parquet,/tmp/e74ed78a-333b-4166-afb0-84c6c021d8f2.parquet,"[{""trial_id"": ""806b49f1-c1fc-5023-8b81-835dd8a..."
3,1,/tmp/9ac2387f-b99b-4c0f-9797-397f393600f4.parquet,/tmp/e80978e6-a50b-4835-9cc1-e2863b00bd44.parquet,"[{""trial_id"": ""bb5aa50f-913b-501d-8158-9afc1f2..."
4,1,/tmp/9ac2387f-b99b-4c0f-9797-397f393600f4.parquet,/tmp/e80978e6-a50b-4835-9cc1-e2863b00bd44.parquet,"[{""trial_id"": ""20ab107e-8c69-51d6-8b2f-6a5466d..."
5,1,/tmp/9ac2387f-b99b-4c0f-9797-397f393600f4.parquet,/tmp/e80978e6-a50b-4835-9cc1-e2863b00bd44.parquet,"[{""trial_id"": ""66664ecc-14ad-5d02-a273-ef53d4d..."


We got 6 rows, because the space will contain 3 configurations. And since for the dataframes, we partitioned by `a` and inner joined, there will be 2 rows. So in total there are 6 rows in the `TuneDataset`.

**Notice, the number of rows of TuneDataset determines max parallelism.** For this case, if you assign 10 workers, 4 will always be idle.

Actually, a more common case is that for each of the dataframe, we don't partition at all. For `TUNE_OBJECT_FACTORY.make_dataset` we just need to remove the `partition_keys`.

In [31]:
with FugueWorkflow() as dag:
    dataset = TUNE_OBJECT_FACTORY.make_dataset(
        dag, space,
        df_name="df1", df=pdf1,
        test_df_name="df2", test_df=pdf2,
    )
    dataset.data.show();

Unnamed: 0,__tune_df__df1,__tune_df__df2,__tune_trials__
0,/tmp/1de411ba-c485-41b7-88a2-d98b7a81d4ec.parquet,/tmp/58541ce1-115c-40b0-988c-faa5b735cb32.parquet,"[{""trial_id"": ""94bc461d-9632-5f2d-bc9c-eeacc47..."
1,/tmp/1de411ba-c485-41b7-88a2-d98b7a81d4ec.parquet,/tmp/58541ce1-115c-40b0-988c-faa5b735cb32.parquet,"[{""trial_id"": ""dcf70308-3959-5ae6-8c4d-d10bc8a..."
2,/tmp/1de411ba-c485-41b7-88a2-d98b7a81d4ec.parquet,/tmp/58541ce1-115c-40b0-988c-faa5b735cb32.parquet,"[{""trial_id"": ""df8d686f-374b-509d-b8af-3a83899..."


But what if we want to partition on `df1` but not on `df2`? Then again, you can use `TuneDatasetBuilder`

In [32]:
with FugueWorkflow() as dag:
    builder = TuneDatasetBuilder(space, path="/tmp")
    builder.add_df("df1", dag.df(pdf1).partition_by("a"))
    # use cross join because there no common key
    builder.add_df("df2", dag.df(pdf2), how="cross")  
    dataset = builder.build(dag)
    dataset.data.show();

Unnamed: 0,a,__tune_df__df1,__tune_df__df2,__tune_trials__
0,0,/tmp/dfd4c42f-9b81-4f56-884f-e3ea00f2b977.parquet,/tmp/f33767e9-a48a-4245-b15c-c4e0ec2a367d.parquet,"[{""trial_id"": ""35e1bdd1-424e-532d-b788-09fbf54..."
1,0,/tmp/dfd4c42f-9b81-4f56-884f-e3ea00f2b977.parquet,/tmp/f33767e9-a48a-4245-b15c-c4e0ec2a367d.parquet,"[{""trial_id"": ""26eba2e7-a331-531a-8576-db1a2c6..."
2,0,/tmp/dfd4c42f-9b81-4f56-884f-e3ea00f2b977.parquet,/tmp/f33767e9-a48a-4245-b15c-c4e0ec2a367d.parquet,"[{""trial_id"": ""806b49f1-c1fc-5023-8b81-835dd8a..."
3,1,/tmp/253b8b53-1c20-4c8c-ba36-4a2a8ef926a2.parquet,/tmp/f33767e9-a48a-4245-b15c-c4e0ec2a367d.parquet,"[{""trial_id"": ""bb5aa50f-913b-501d-8158-9afc1f2..."
4,1,/tmp/253b8b53-1c20-4c8c-ba36-4a2a8ef926a2.parquet,/tmp/f33767e9-a48a-4245-b15c-c4e0ec2a367d.parquet,"[{""trial_id"": ""20ab107e-8c69-51d6-8b2f-6a5466d..."
5,1,/tmp/253b8b53-1c20-4c8c-ba36-4a2a8ef926a2.parquet,/tmp/f33767e9-a48a-4245-b15c-c4e0ec2a367d.parquet,"[{""trial_id"": ""66664ecc-14ad-5d02-a273-ef53d4d..."
