In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_toxicology")

# Task: Toxicology
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> It seems that the *Toxicology* dataset is not available in the repository. However, I can provide a description based on the data model you provided.
> 
> *Data Model:*
> 
> The *Toxicology* dataset consists of four tables: `connected`, `atom`, `bond`, and `molecule`. These tables provide information about molecular structures and their toxicological properties.
> 
> - **connected**: Contains `atom_id` (varchar), `atom_id2` (varchar), and `bond_id` (varchar). This table represents the connections between atoms in a molecule.
> 
> - **atom**: Includes `atom_id` (varchar), `molecule_id` (varchar), and `element` (varchar). It provides information about individual atoms within molecules.
> 
> - **bond**: Contains `bond_id` (varchar), `molecule_id` (varchar), and `bond_type` (varchar). This table details the types of bonds between atoms in a molecule.
> 
> - **molecule**: Includes `molecule_id` (varchar) and `label` (varchar). This table classifies molecules based on their toxicological properties.
> 
> *Task and Target Column:*
> 
> The primary task is *classification*, with the target column being `label` in the `molecule` table. The goal is to classify molecules based on their toxicological properties.
> 
> *Column Types:*
> 
> - Varchar: `atom_id`, `atom_id2`, `bond_id`, `molecule_id`, `element`, `bond_type`, `label`
> 
> *Metadata:*
> 
> - **Number of Tables**: 4
> - **Target Table**: `molecule`
> - **Target Column**: `label`
> 
> This dataset is used in the field of toxicology to analyze and classify chemical compounds based on their potential toxic effects.

### Tables
Population table: molecule

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/Toxicology.svg" alt="Toxicology ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
molecule, peripheral = load_ctu_dataset("Toxicology")

(
    atom,
    bond,
    connected,
) = peripheral.values()

Analyzing schema:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/4 [00:00<?, ?it/s]

Building data:   0%|          | 0/4 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`molecule`). We already set the `target` role for the target (`label`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `label`.

In [3]:
# TODO: Annotate remaining columns with roles
molecule

name,label,molecule_id,split
role,target,unused_string,unused_string
0.0,0,TR000,val
1.0,0,TR001,val
2.0,1,TR002,train
3.0,1,TR004,train
4.0,0,TR006,train
,...,...,...
338.0,0,TR494,train
339.0,1,TR495,train
340.0,0,TR496,train
341.0,0,TR499,val


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
atom

name,atom_id,atom_id2,bond_id
role,unused_string,unused_string,unused_string
0.0,TR000_1,TR000_2,TR000_1_2
1.0,TR000_2,TR000_1,TR000_1_2
2.0,TR000_2,TR000_3,TR000_2_3
3.0,TR000_3,TR000_2,TR000_2_3
4.0,TR000_2,TR000_4,TR000_2_4
,...,...,...
24753.0,TR502_5,TR502_3,TR502_3_5
24754.0,TR502_6,TR502_9,TR502_6_9
24755.0,TR502_9,TR502_6,TR502_6_9
24756.0,TR502_10,TR502_7,TR502_7_10


In [5]:
# TODO: Annotate columns with roles
bond

name,bond_id,molecule_id,bond_type
role,unused_string,unused_string,unused_string
0.0,TR000_1_2,TR000,-
1.0,TR000_2_3,TR000,-
2.0,TR000_2_4,TR000,-
3.0,TR000_2_5,TR000,-
4.0,TR001_10_11,TR001,=
,...,...,...
12374.0,TR502_2_3,TR502,-
12375.0,TR502_3_4,TR502,-
12376.0,TR502_3_5,TR502,-
12377.0,TR502_6_9,TR502,-


In [6]:
# TODO: Annotate columns with roles
connected

name,atom_id,molecule_id,element
role,unused_string,unused_string,unused_string
0.0,TR000_1,TR000,cl
1.0,TR000_2,TR000,c
2.0,TR000_3,TR000,cl
3.0,TR000_4,TR000,cl
4.0,TR000_5,TR000,h
,...,...,...
12328.0,TR502_5,TR502,cl
12329.0,TR502_6,TR502,o
12330.0,TR502_7,TR502,o
12331.0,TR502_8,TR502,h


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/Toxicology](https://relational.fel.cvut.cz/dataset/Toxicology)
for a description of the dataset.

In [7]:
dm = getml.data.DataModel(population=molecule.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [8]:
container = getml.data.Container(population=molecule, split=molecule.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,molecule,241,View
1,val,molecule,102,View

Unnamed: 0,name,rows,type
0,connected,24758,DataFrame
1,bond,12379,DataFrame
2,atom,12333,DataFrame
