### Quick start : Run a RNA task easily

First, generate the necessary index files

You can use the following command: 
```
$ rnaglib_index
```

Choose the task appropriate to your model. Here, we chose _RNA-Site_, a task instance called `LigandBindingSite` for illustration.
When instantiating the task, custom splitters or other arguments can be passed if needed.

In [None]:
from rnaglib.tasks import BindingSiteDetection
from rnaglib.transforms import FeaturesComputer, GraphRepresentation

task = BindingSiteDetection(root='tutorial') # You can pass arguments to use a custom splitter or dataset etc. if desired.

Choose your features and your targets (here, the feature is the nucleotide code and the target is the binding site)

In [2]:
from rnaglib.encoders import BoolEncoder
from rnaglib.transforms import FeaturesComputer
task.dataset.features_computer = FeaturesComputer(nt_features=['nt_code'], nt_targets='binding_site', custom_encoders= {'binding_site' : BoolEncoder()})

Choose your representation (here Pytorch Geometric graphs)

In [3]:
task.dataset.add_representation(GraphRepresentation(framework='pyg'))

>>> Adding <rnaglib.transforms.represent.graph.GraphRepresentation object at 0x177197510> representations.


Set train, validation and test loaders

In [4]:
task.set_loaders()

Define your model and training (here, a dummy model)

In [5]:
model = task.dummy_model

Evaluate the model on the defined task

In [6]:
final_result = task.evaluate(model)
print(final_result)

{'accuracy': 0.4894837476099426, 'mcc': -0.015326770874804748, 'f1': 0.47337278106508873}


### Getting familiar with RNAglib objects

#### RNADataset

RNADataset objects represent a set of RNAs, each one being represented by its 3D structure.

Each item of the RNADataset is encoded by a dictionary containing (under the key "rna") the networkx Graph representing the RNA.

It is also possible to add Representation and FeaturesComputer objects to a RNADataset.

To create a default RNA Dataset, you can run the code below

In [48]:
from rnaglib.data_loading import RNADataset

dataset = RNADataset()

Database was found and not overwritten


The default value will be changed to `edges="edges" in NetworkX 3.6.


  nx.node_link_graph(data, edges="links") to preserve current behavior, or
  nx.node_link_graph(data, edges="edges") for forward compatibility.


#### Transform

The Transform class groups all the functions which map the dictionaries representing RNAs (i.e. the items of a RNADataset object) into other objects (other dictionqries or objects of a different nature).

A specific tutorial gives further details about this class: https://rnaglib.org/en/latest/rnaglib.transforms.html

Below are detailed some subclasses of Transform: Representation, FeaturesComputer, FilterTransform, AnnotationTransform, PartitionTransform, Compose and ComposeFilters.

##### Representation

A Representation object is a Transform that maps a RNA dictionary (as defined above) to a mathematical representation of this RNA. In the current version of RNAGlib, 4 representations are already implemented: GraphRepresentation, PointCloudRepresentation, VoxelRepresentation and RingRepresentation

GraphRepresentation converts RNA into a Leontis-Westhof graph (2.5D) where nodes are residues and edges are either base pairs or backbones.

PointCloudRepresentation converts RNA into a 3D point cloud based representation.

VoxelRepresentation converts RNA into a voxel/3D grid representation.

RingRepresentation converts RNA into a ring-based representation.

##### FeaturesComputer

A FeaturesComputer is a Transform that maps a RNA Dictionary to a dictionary of features and targets (both RNA-level and node-level features and targets) of this RNA.

##### FilterTransform

A FilterTransform returns the RNAs of the RNADataset that pass a certain filter.

##### AnnotationTransform

An AnnotationTransform is a transform computing additional node features within each RNA graph.

##### PartitionTransform

A PartitionTransform is a transform which breaks up each RNA structure into substructures.

##### Compose

A Compose object is a Transform which consists in the composition of a series of transforms.

##### ComposeFilters

A ComposeFilters object is a Transform consisting in the composition of a series of filters (objects of type FilterTransform)

#### Tasks

A Task is an object representing a benchmarking task to be performed on the RNA. It is associated with a specific RNADataset. Once implemented, the Task object can be called to evaluate the performance on the defined Task of various models. One particular category of Tasks is already implemented as a subclass of Tasks: ResidueClassificationTask, which groups all the tasks consisting in classifying the amino-acids of the RNA.

#### Encoders

Encoders are objects that vectorize features with a specific encoding. Indeed, the features available in the RNA NetworkX graph might have different types, including text, therefore it is necessary to vectorize them to perform learning using them.

### Advanced functionalities

#### Adding a representation to a RNADataset

A Representation object (or several representations grouped in a list) can be passed as argument to RNADataset during its creation or added to a RNADataset which has already been created using the following code:

In [50]:
from rnaglib.transforms import GraphRepresentation, PointCloudRepresentation
representations_list = [
    GraphRepresentation(), 
    PointCloudRepresentation()
]
dataset.add_representation(representations_list)

>>> Adding [<rnaglib.transforms.represent.graph.GraphRepresentation object at 0x5917a2790>, <rnaglib.transforms.represent.point_cloud.PointCloudRepresentation object at 0x4995c7910>] representations.


#### Removing a representation

It is possible to remove one or several Representation object(s) from a RNADataset using the code below by using their names (it could be "graph" for GraphRepresentation, "point_cloud" for PointCloudRepresentation, "voxel" for VoxelRepresentation or "ring" for RingRepresentation)

In [53]:
dataset.remove_representation(["point_cloud"])

#### Adding features to the features computer of a RNADataset

If you want to use a (node-level or RNA-level) feature to perform a task on the RNA, it is necessary to add it to the features computer of the RNA.

The features can be chosen among the list of features available in the RNA graph: 'index', 'index_chain', 'chain_name', 'nt_resnum', 'nt_name', 'nt_code', 'nt_id', 'nt_type', 'dbn', 'summary', 'alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta', 'epsilon_zeta', 'bb_type', 'chi', 'glyco_bond', 'C5prime_xyz', 'P_xyz', 'form', 'ssZp', 'Dp', 'splay_angle', 'splay_distance', 'splay_ratio', 'eta', 'theta', 'eta_prime', 'theta_prime', 'eta_base', 'theta_base', 'v0', 'v1', 'v2', 'v3', 'v4', 'amplitude', 'phase_angle', 'puckering', 'sugar_class', 'bin', 'cluster', 'suiteness', 'filter_rmsd', 'frame', 'sse', 'binding_protein', 'binding_ion', 'binding_small-molecule'.

When adding a feature to the features computer, you have to specify a dictionary named `custom_encoders` mapping each feature to the encoder chosen to encode the feature. Canonical encoders corresponding to each feature are available in [NODE_FEATURE_MAP](https://github.com/cgoliver/rnaglib/blob/30bded91462f655c235ef57efc07e834456615a4/src/rnaglib/config/feature_encoders.py#L7)

In the example below, we add the feature named `"phase_angle"` to the features computer of the dataset and specify that it should be encoded using the pre-implemented FloatEncoder.

In [82]:
from rnaglib.encoders import FloatEncoder

dataset.features_computer.add_feature(feature_names="phase_angle", custom_encoders={"phase_angle":FloatEncoder()})

### Customization

#### Create a custom Task

In order to create a custom task, you have to define it as a subclass of a task category (for instance ResidueClassificationClass or a subclass you have created by yourself) and to specify the following:

* a target variable: the variable which has to be predicted by the model
* an input variable or a list of input variables: the inputs of the model
* a method `get_tasks_var` specifying the FeaturesComputer to build to perform the task (in general, it will call the aforementioned target and input variables)
* a method `process` creqting the dataset and applying some preprocessing to the dataset (especially annotation and filtering transforms) if needed

If the task belongs to another task category than ResidueClassificationClass (that is to say, node-level classification task), you have to define a new Task subclass corresponding to this task category and to specify:
* a method named `dummy_model` returning a dummy model to use to check the task is working well without any effort to define a model
* a method named `evaluate` which, given a model, outputs a dictionary containing performace metrics of this model on the task of interest.

For instance, in the cell below, we define a toy task called AnglePrediction consisting in predicting the phase angle by using the nucleotide code.

Since it is a regression and not a classification task, we first need to define a new subclass of Tasks class which we will call `ResidueRegression`.

In [110]:
import torch
from rnaglib.tasks import Task
from rnaglib.utils import DummyResidueModel
from sklearn.metrics import root_mean_squared_error, mean_absolute_error

class ResidueRegression(Task):
    def __init__(self, root, splitter=None, **kwargs):
        super().__init__(root=root, splitter=splitter, **kwargs)

    @property
    def dummy_model(self) -> torch.nn:
        return DummyResidueModel()

    def evaluate(self, model: torch.nn, device: str = "cpu") -> dict:
        model.eval()
        all_probs = []
        all_preds = []
        all_labels = []

        with torch.no_grad():
            for batch in self.test_dataloader:
                graph = batch["graph"]
                graph = graph.to(device)
                out = model(graph)

                preds = out > 0.5
                all_probs.extend(out.cpu().flatten().tolist())
                all_preds.extend(preds.cpu().flatten().tolist())
                all_labels.extend(graph.cpu().y.flatten().tolist())

        # Compute performance metrics
        RMSE = root_mean_squared_error(all_labels, all_preds)
        MAE = mean_absolute_error(all_labels, preds)


        return {"RMSE": RMSE, "MAE": MAE}

Once the subclass `ResidueRegression` is defined, one can define the specific task `AnglePrediction`

In [111]:
from rnaglib.transforms import PDBIDNameTransform

class AnglePrediction(ResidueRegression):
    # Target variable
    target_var = "phase_angle"
    # Input variable
    input_var = "nt_code"

    def __init__(self, root, splitter=None, **kwargs):
        super().__init__(root=root, splitter=splitter, **kwargs)
        
    # Creation and preprocessing of the dataset
    def process(self) -> RNADataset:
        rnas = RNADataset(debug=False, redundancy='all', rna_id_subset=SPLITTING_VARS['PDB_TO_CHAIN_TR60_TE18'].keys())
        dataset = RNADataset(rnas=[r["rna"] for r in rnas])
        # TODO: remove wrong chains using  SPLITTING_VARS["PDB_TO_CHAIN_TR60_TE18"]
        rnas = PDBIDNameTransform()(rnas)
        dataset = RNADataset(rnas=[r["rna"] for r in rnas]) 
        return dataset
    
    # Computation of the FeaturesComputer
    def get_task_vars(self) -> FeaturesComputer:
        return FeaturesComputer(
            nt_features=[self.input_var],
            nt_targets=self.target_var,
            custom_encoders={self.target_var: BoolEncoder()},
        )

#### Create a custom representation

In order to create a custom Representation object, you have to define a subclass of `Representation` an to specify the methods `__init__`, `__call__`, `name` and `batch`.

#### Create custom features

The strategy to create custom features consists in creating a Transform object which takes as input a RNADataset and transforms it by adding the new features to the graphs representing all of the items of the RNADataset.

To do so, you have to build a subclass of `Transform` and specify:

* its `name`

* its associated `encoder`

* its `forward` method taking as input the dictionary representing one RNA and returning the updated RNA dictionary (containing its additional features)

Once the custom features have been created, you still have to add them to the FeaturesComputer of the graph. To do so, you can check the documentation above (cf. section "Adding features to the features computer of a RNADataset).

Below is the structure to write such a transform:

In [None]:
from rnaglib.transforms import Transform

class CustomTransform(Transform):
    name = "custom_transform"
    encoder = ...
    def __init__(
            self, **kwargs
    ):
        super().__init__(**kwargs)
    def forward(self, rna_dict: Dict) -> Dict:

        ... # compute and add additional features

        return rna_dict

#### Create a custom model

Some models have already been implemented and can be accessed to in `rnaglib.learning.models`. The available models include `Embedder`, `Classifier`, `DotPredictor`, `BasePairPredictor`. However, you might want to define a custom ML model, be it to perform a custom task or to challenge the performance of already implemented models on an existing task. Here is the structure of the code to write in order to implement a new model.

In [None]:
class CustomModel(torch.nn.Module):
    def __init__(self, num_node_features, num_classes, num_unique_edge_attrs, num_layers=2, hidden_channels=16):
        super().__init__()
        
        ... # Instantiate layers

    def forward(self, data):
        x, edge_index, edge_type, batch = data.x, data.edge_index, data.edge_attr, data.batch

        ...

        return x

#### Create a custom annotator

You might want to create a custom annotator to add new features to the nodes of the graphs, for instance to perform a new task  using those new annotations. The custom annotator will typically be called in the `process` method of a `Task` object.

In [None]:
from rnaglib.transforms import AnnotationTransform
from networkx import set_node_attributes

class CustomAnnotator(AnnotationTransform):
    def forward (self, rna_dict: dict) -> dict:        
        custom_annotation = {
            node: self._custom_annotation(nodedata)
            for node, nodedata in rna_dict['rna'].nodes(data=True)
        }
        set_node_attributes(rna_dict['rna'], custom_annotation, "custom_annotation")
        return rna_dict
    @staticmethod
    def _has_binding_site(nodedata: dict) -> bool:
        return ... # RNA dictionary-wise formula to compute the custom annotation

#### Create a custom filter

Several filters are already implemented and available in `rnaglib.transforms`: `SizeFilter` which rejects RNAs which are not in the given size bounds, `RNAAttributeFilter` that rejects RNAs that lack a certain annotation at the whole RNA level, `ResidueAttributeFilter` which rejects RNAs that lack a certain annotation at the whole residue-level, `RibosomalFilter` that rejects ribsosomal RNA and `NameFilter` that filters RNA based on their names. However, you might want to create your own filter. This one could be for instance called in the `process` method of a new `Task` object.

In [None]:
from rnaglib.transforms import FilterTransform

class CustomFilter(FilterTransform):

    def __init__(self, ..., **kwargs):
        ...
        super().__init__(**kwargs)

    def forward(self, rna_dict: dict) -> bool:

        ...

        return ... # should return a Boolean (True if the RNA described by rna_dict passes the filter, False otherwise)