In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_biodegradability")

# Task: Biodegradability
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> *Biodegradability Dataset Description*
> 
> - *Data Model (Relational Schema)*:
>   - **bond**: Contains columns `atom_id` (varchar), `atom_id2` (varchar), and `type` (varchar).
>   - **gmember**: Contains columns `atom_id` (varchar), `group_id` (varchar), and `type` (varchar).
>   - **atom**: Contains columns `atom_id` (varchar), `molecule_id` (varchar), and `type` (varchar).
>   - **group**: Contains columns `group_id` (varchar) and `type` (varchar).
>   - **molecule**: Contains columns `molecule_id` (varchar), `activity` (float), `logp` (float), and `mweight` (float).
> 
> - *Task*: Regression
>   - *Target Column*: `activity` in the `molecule` table.
> 
> - *Types of the Columns*:
>   - *Varchar*: Used for identifiers and types in `bond`, `gmember`, `atom`, `group`, and `molecule` tables.
>   - *Float*: Used for `activity`, `logp`, and `mweight` in the `molecule` table.
> 
> - *Metadata*:
>   - *Size*: 3.3 MB
>   - *Number of Tables*: 5
>   - *Number of Rows*: 21,875
>   - *Number of Columns*: 14
>   - *Missing Values*: No missing values
>   - *Instance Count*: 328
>   - *Target Table*: `molecule`
>   - *Target ID*: `molecule_id`
> 
> This dataset is used for predicting the half-life for aerobic aqueous biodegradation of chemical compounds.

### Tables
Population table: molecule

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/Biodegradability.svg" alt="Biodegradability ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
molecule, peripheral = load_ctu_dataset("Biodegradability")

(
    gmember,
    atom,
    bond,
    group,
) = peripheral.values()

Analyzing schema:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/5 [00:00<?, ?it/s]

Building data:   0%|          | 0/5 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`molecule`). We already set the `target` role for the target (`activity`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `activity`.

In [3]:
# TODO: Annotate remaining columns with roles
molecule

name,activity,molecule_id,logp,mweight,split
role,target,unused_string,unused_string,unused_string,unused_string
0.0,4.5337,i100_02_7i,1.91,139.11,train
1.0,4.5644,i100_21_0i,1.76,166.131,train
2.0,5.0499,i100_41_4i,3.03,106.167,train
3.0,6.2226,i100_42_5i,2.89,104.151,train
4.0,6.0402,i100_44_7i,2.79,126.585,train
,...,...,...,...,...
323.0,6.0402,i98_88_4i,1.44,140.569,train
324.0,7.834,i98_95_3i,1.81,123.111,val
325.0,5.8522,i99_55_8i,2.02,152.152,train
326.0,5.8522,i99_59_2i,1.55,168.151,val


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
gmember

name,group_id,type
role,unused_string,unused_string
0.0,g0,sulfo
1.0,g1,sulfo
2.0,g10,nitro
3.0,g100,methyl
4.0,g1000,c2n
,...,...
1731.0,g995,n2n
1732.0,g996,n2n
1733.0,g997,n2n
1734.0,g998,n2n


In [5]:
# TODO: Annotate columns with roles
atom

name,atom_id,molecule_id,type
role,unused_string,unused_string,unused_string
0.0,i100_02_7_10i,i100_02_7i,c
1.0,i100_02_7_10_1i,i100_02_7i,h
2.0,i100_02_7_1i,i100_02_7i,o
3.0,i100_02_7_2i,i100_02_7i,n
4.0,i100_02_7_3i,i100_02_7i,o
,...,...,...
6563.0,i99_65_0_6_1i,i99_65_0i,h
6564.0,i99_65_0_7i,i99_65_0i,c
6565.0,i99_65_0_7_1i,i99_65_0i,h
6566.0,i99_65_0_8i,i99_65_0i,c


In [6]:
# TODO: Annotate columns with roles
bond

name,atom_id,group_id
role,unused_string,unused_string
0.0,i100_02_7_10i,g1011
1.0,i100_02_7_10i,g1321
2.0,i100_02_7_1i,g7
3.0,i100_02_7_2i,g7
4.0,i100_02_7_3i,g7
,...,...
6642.0,i99_65_0_7i,g1278
6643.0,i99_65_0_7i,g1638
6644.0,i99_65_0_8i,g1278
6645.0,i99_65_0_8i,g1638


In [7]:
# TODO: Annotate columns with roles
group

name,atom_id,atom_id2,type
role,unused_string,unused_string,unused_string
0.0,i100_02_7_10i,i100_02_7_10_1i,1
1.0,i100_02_7_1i,i100_02_7_2i,2
2.0,i100_02_7_2i,i100_02_7_3i,2
3.0,i100_02_7_2i,i100_02_7_4i,1
4.0,i100_02_7_4i,i100_02_7_10i,7
,...,...,...
6611.0,i99_65_0_7i,i99_65_0_8i,7
6612.0,i99_65_0_8i,i99_65_0_12i,7
6613.0,i99_65_0_8i,i99_65_0_9i,1
6614.0,i99_65_0_9i,i99_65_0_10i,2


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/Biodegradability](https://relational.fel.cvut.cz/dataset/Biodegradability)
for a description of the dataset.

In [8]:
dm = getml.data.DataModel(population=molecule.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [9]:
container = getml.data.Container(population=molecule, split=molecule.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,molecule,230,View
1,val,molecule,98,View

Unnamed: 0,name,rows,type
0,group,1736,DataFrame
1,atom,6568,DataFrame
2,gmember,6647,DataFrame
3,bond,6616,DataFrame
