In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_triazine")

# Task: Triazine
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> *Data Model (Relational Schema)*
> 
> The Triazine dataset consists of two tables:
> 
> - **Position**: Contains molecular position data with attributes like `branch`, `flex`, `h_acceptor`, `h_doner`, `pi_acceptor`, `pi_doner`, `polar`, `polarisable`, `sigma`, and `size`.
> - **Molecule**: Contains information about molecules, specifically the `activity` level.
> 
> *Task and Target Column*
> 
> The primary task is *regression*, with the target column being *activity* in the *molecule* table.
> 
> *Types of the Columns*
> 
> - **Numeric**: All columns are numeric, including `branch`, `flex`, `activity`, etc.
> 
> *Metadata about the Dataset*
> 
> - **Size**: 200 KB
> - **Number of Tables**: 2
> - **Number of Rows**: 1,302
> - **Number of Columns**: 14
> - **Missing Values**: No
> - **Compound Keys**: No
> - **Loops**: No
> - **Type**: Real
> - **Instance Count**: 186
> 
> This dataset is used in the medical domain to predict the inhibition of dihydrofolate reductase by pyrimidines, providing valuable insights for drug development.

### Tables
Population table: molecule

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/Triazine.svg" alt="Triazine ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
molecule, peripheral = load_ctu_dataset("Triazine")

(
    position,
) = peripheral.values()

Analyzing schema:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/2 [00:00<?, ?it/s]

Building data:   0%|          | 0/2 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`molecule`). We already set the `target` role for the target (`activity`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `activity`.

In [3]:
# TODO: Annotate remaining columns with roles
molecule

name,activity,molecule_id,split
role,target,unused_string,unused_string
0.0,0.564,1,train
1.0,0.772,2,train
2.0,0.639,3,train
3.0,0.647,4,train
4.0,0.564,5,train
,...,...,...
181.0,0.817,182,val
182.0,0.417,183,val
183.0,0.74,184,val
184.0,0.637,185,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
position

name,molecule_id,position,branch,flex,h_acceptor,h_doner,pi_acceptor,pi_doner,polar,polarisable,sigma,size
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
1.0,1,2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
2.0,1,3,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
3.0,1,4,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
4.0,1,5,0.1,0,0.1,0,0.1,0,0.1,0.1,0.1,0.1
,...,...,...,...,...,...,...,...,...,...,...,...
1111.0,186,2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
1112.0,186,3,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
1113.0,186,4,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
1114.0,186,5,0.1,0,0.1,0,0.1,0,0.1,0.1,0.1,0.1


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/Triazine](https://relational.fel.cvut.cz/dataset/Triazine)
for a description of the dataset.

In [5]:
dm = getml.data.DataModel(population=molecule.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [6]:
container = getml.data.Container(population=molecule, split=molecule.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,molecule,131,View
1,val,molecule,55,View

Unnamed: 0,name,rows,type
0,position,1116,DataFrame
