# `ProteinWorkshop` Tutorial, Part 2 - Customizing an Existing Dataset
![Datasets](../docs/source/_static/box_datasets.png)

## Repurpose an existing dataset within the `ProteinWorkshop`

In [1]:
%load_ext autoreload
%autoreload 2
# %load_ext blackcellmagic

Welcome back to the `ProteinWorkshop` tutorial series!

To reuse existing data components available in the `ProteinWorkshop`, you can follow the following 5-step procedure:

1. Use an existing dataset as either a pre-training or fine-tuning corpus, with or without full-atom context
2. Load the requested dataset using the designed config
3. Either pre-train or fine-tune a model using the selected dataset
4. Reconfigure the selected dataset to use e.g., side-chain atom context
5. Verify that e.g., side-chain torsions are now available as feature inputs

### 1. Use an existing dataset as either a pre-training or fine-tuning corpus, with or without full-atom context

One can switch out the dataset for any other available option by replacing the value of `dataset` in `overrides`:

`cfg = hydra.compose("template", overrides=["encoder=schnet", "task=inverse_folding", "dataset=cath", "features=ca_angles", "+aux_task=none"], return_hydra_config=True)`

In [2]:
# Misc. tools
import os

# Hydra tools
import hydra

from hydra.compose import GlobalHydra
from hydra.core.hydra_config import HydraConfig

from proteinworkshop.constants import HYDRA_CONFIG_PATH
from proteinworkshop.utils.notebook import init_hydra_singleton

version_base = "1.2"  # Note: Need to update whenever Hydra is upgraded
init_hydra_singleton(reload=True, version_base=version_base)

path = HYDRA_CONFIG_PATH
rel_path = os.path.relpath(path, start=".")

GlobalHydra.instance().clear()
hydra.initialize(rel_path, version_base=version_base)

cfg = hydra.compose(
    config_name="train",
    overrides=[
        "encoder=schnet",
        "task=inverse_folding",
        "dataset=cath",
        "features=ca_angles",
        "+aux_task=none",
    ],
    return_hydra_config=True,
)

# Note: Customize as needed e.g., when running a sweep
cfg.hydra.job.num = 0
cfg.hydra.job.id = 0
cfg.hydra.hydra_help.hydra_help = False
cfg.hydra.runtime.output_dir = "outputs"

HydraConfig.instance().set_config(cfg)

### 2. Load the requested dataset using the designed config

In [3]:
from proteinworkshop.configs import config

cfg = config.validate_config(cfg)

datamodule = hydra.utils.instantiate(cfg.dataset.datamodule)
datamodule.setup("fit")
dl = datamodule.train_dataloader()

for i in dl:
    print(i)
    break


PEP 484 type hint typing.Iterable[typing.Callable] deprecated by PEP 585. This hint is scheduled for removal in the first Python version released after October 5th, 2025. To resolve this, import this hint from "beartype.typing" rather than "typing". For further commentary and alternatives, see also:
    https://beartype.readthedocs.io/en/latest/api_roar/#pep-585-deprecations




You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.

100%|██████████| 18006/18006 [00:11<00:00, 1553.87it/s]


100%|██████████| 607/607 [00:00<00:00, 3165.19it/s]


DataProteinBatch(fill_value=1e-05, atom_list=[37], coords=[6364, 37, 3], residues=[32], id=[32], residue_id=[32], residue_type=[6364], chains=[6364], x=[6364], amino_acid_one_hot=[6364, 23], seq_pos=[6364, 1], batch=[6364], ptr=[33])


### 3. Either pre-train or fine-tune a model using the selected dataset

In [8]:
from proteinworkshop.finetune import finetune
from proteinworkshop.train import train_model

train_model(cfg)  # Pre-train a model using the selected data
# finetune(cfg)  # Fine-tune a model using the selected data


PEP 484 type hint typing.Tuple[torch.Tensor, torch.Tensor] deprecated by PEP 585. This hint is scheduled for removal in the first Python version released after October 5th, 2025. To resolve this, import this hint from "beartype.typing" rather than "typing". For further commentary and alternatives, see also:
    https://beartype.readthedocs.io/en/latest/api_roar/#pep-585-deprecations


PEP 484 type hint typing.Callable deprecated by PEP 585. This hint is scheduled for removal in the first Python version released after October 5th, 2025. To resolve this, import this hint from "beartype.typing" rather than "typing". For further commentary and alternatives, see also:
    https://beartype.readthedocs.io/en/latest/api_roar/#pep-585-deprecations


PEP 484 type hint typing.Dict[str, torch.Tensor] deprecated by PEP 585. This hint is scheduled for removal in the first Python version released after October 5th, 2025. To resolve this, import this hint from "beartype.typing" rather than "typing". 

InstantiationException: Error in call to target 'lightning.pytorch.trainer.trainer.Trainer':
MisconfigurationException('`Trainer(devices=0)` value is not a valid input using mps accelerator.')
full_key: trainer

### 4. Reconfigure the selected dataset to use e.g., side-chain atom context

In [6]:
version_base = "1.2"  # Note: Need to update whenever Hydra is upgraded
init_hydra_singleton(reload=True, version_base=version_base)

path = HYDRA_CONFIG_PATH
rel_path = os.path.relpath(path, start=".")

GlobalHydra.instance().clear()
hydra.initialize(rel_path, version_base=version_base)

cfg = hydra.compose(
    config_name="train",
    overrides=[
        "encoder=schnet",
        "task=sequence_denoising",
        "dataset=cath",
        "features=ca_sc",
        "+aux_task=none",
    ],
    return_hydra_config=True,
)

# Note: Customize as needed e.g., when running a sweep
cfg.hydra.job.num = 0
cfg.hydra.job.id = 0
cfg.hydra.hydra_help.hydra_help = False
cfg.hydra.runtime.output_dir = "outputs"

HydraConfig.instance().set_config(cfg)

### 5. Verify that e.g., side-chain torsions are now available as feature inputs

In [7]:
from proteinworkshop.configs import config

cfg = config.validate_config(cfg)

datamodule = hydra.utils.instantiate(cfg)
datamodule.setup("fit")
dl = datamodule.train_dataloader()

for i in dl:
    print(i)
    break

ReadonlyConfigError: Cannot set value of read-only config node

### 6. Wrapping up

Have any additional questions about using the existing data components provided in the `ProteinWorkshop`? [Create a new issue](https://github.com/a-r-j/ProteinWorkshop/issues/new/choose) on our [GitHub repository](https://github.com/a-r-j/ProteinWorkshop). We would be happy to work with you to leverage the full power of the repository!