# `ProteinWorkshop` Tutorial, Part 2 - Customizing an Existing Dataset
![Datasets](../docs/source/_static/box_datasets.png)

## Repurpose an existing dataset within the `ProteinWorkshop`

In [None]:
%load_ext autoreload
%autoreload 2
# %load_ext blackcellmagic

Welcome back to the `ProteinWorkshop` tutorial series!

To reuse existing data components available in the `ProteinWorkshop`, you can follow the following 5-step procedure:

1. Use an existing dataset as either a pre-training or fine-tuning corpus, with or without full-atom context
2. Load the requested dataset using the designed config
3. Either pre-train or fine-tune a model using the selected dataset
4. Reconfigure the selected dataset to use e.g., side-chain atom context
5. Verify that e.g., side-chain torsions are now available as feature inputs

### 1. Use an existing dataset as either a pre-training or fine-tuning corpus, with or without full-atom context

One can switch out the dataset for any other available option by replacing the value of `dataset` in `overrides`:

`cfg = hydra.compose("template", overrides=["encoder=schnet", "task=inverse_folding", "dataset=cath", "features=ca_angles", "+aux_task=none"], return_hydra_config=True)`

In [None]:
# Misc. tools
import os

# Hydra tools
import hydra

from hydra.compose import GlobalHydra
from hydra.core.hydra_config import HydraConfig

from proteinworkshop.constants import HYDRA_CONFIG_PATH
from proteinworkshop.utils.notebook import init_hydra_singleton

version_base = "1.2"  # Note: Need to update whenever Hydra is upgraded
init_hydra_singleton(reload=True, version_base=version_base)

path = HYDRA_CONFIG_PATH
rel_path = os.path.relpath(path, start=".")

GlobalHydra.instance().clear()
hydra.initialize(rel_path, version_base=version_base)

cfg = hydra.compose(
    config_name="train",
    overrides=[
        "encoder=schnet",
        "task=inverse_folding",
        "dataset=cath",
        "features=ca_angles",
        "+aux_task=none",
    ],
    return_hydra_config=True,
)

# Note: Customize as needed e.g., when running a sweep
cfg.hydra.job.num = 0
cfg.hydra.job.id = 0
cfg.hydra.hydra_help.hydra_help = False
cfg.hydra.runtime.output_dir = "outputs"

HydraConfig.instance().set_config(cfg)

### 2. Load the requested dataset using the designed config

In [None]:
from proteinworkshop.configs import config

cfg = config.validate_config(cfg)

datamodule = hydra.utils.instantiate(cfg.dataset.datamodule)
datamodule.setup("fit")
dl = datamodule.train_dataloader()

for i in dl:
    print(i)
    break

### 3. Either pre-train or fine-tune a model using the selected dataset

In [None]:
from proteinworkshop.finetune import finetune
from proteinworkshop.train import train_model

# train_model(cfg)  # Pre-train a model using the selected data
# finetune(cfg)  # Fine-tune a model using the selected data

### 4. Reconfigure the selected dataset to use e.g., side-chain atom context

In [None]:
version_base = "1.2"  # Note: Need to update whenever Hydra is upgraded
init_hydra_singleton(reload=True, version_base=version_base)

path = HYDRA_CONFIG_PATH
rel_path = os.path.relpath(path, start=".")

GlobalHydra.instance().clear()
hydra.initialize(rel_path, version_base=version_base)

cfg = hydra.compose(
    config_name="train",
    overrides=[
        "encoder=schnet",
        "task=inverse_folding",
        "dataset=cath",
        "features=ca_sc",
        "+aux_task=none",
    ],
    return_hydra_config=True,
)

# Note: Customize as needed e.g., when running a sweep
cfg.hydra.job.num = 0
cfg.hydra.job.id = 0
cfg.hydra.hydra_help.hydra_help = False
cfg.hydra.runtime.output_dir = "outputs"

HydraConfig.instance().set_config(cfg)

### 5. Verify that e.g., side-chain torsions are now available as feature inputs

In [None]:
from proteinworkshop.configs import config

cfg = config.validate_config(cfg)

datamodule = hydra.utils.instantiate(cfg)
datamodule.setup("fit")
dl = datamodule.train_dataloader()

for i in dl:
    print(i)
    break

### 6. Wrapping up

Have any additional questions about using the existing data components provided in the `ProteinWorkshop`? [Create a new issue](https://github.com/a-r-j/ProteinWorkshop/issues/new/choose) on our [GitHub repository](https://github.com/a-r-j/ProteinWorkshop). We would be happy to work with you to leverage the full power of the repository!