In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("mutagenesis")

# Task: mutagenesis
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *mutagenesis* dataset consists of 230 molecules tested for mutagenicity on *Salmonella typhimurium*. It is used for a classification task, with the target column being *mutagenic* in the *molecule* table.
> 
> **Data Model:**
> 
> - **bond** table:
>   - *atom1_id*: char
>   - *atom2_id*: char
>   - *type*: int
> 
> - **atom** table:
>   - *atom_id*: char
>   - *molecule_id*: char
>   - *element*: char
>   - *type*: int
>   - *charge*: double
> 
> - **molecule** table:
>   - *molecule_id*: char
>   - *ind1*: int
>   - *inda*: int
>   - *logp*: float
>   - *lumo*: double
>   - *mutagenic*: char (target column)
> 
> **Metadata:**
> 
> - Size: 900 KB
> - Number of tables: 3
> - Number of rows: 10,324
> - Number of columns: 14
> - Missing values: No
> - Compound keys: No
> - Loops: Yes
> - Type: Real
> - Instance count: 188
> 
> The dataset is commonly used in medicinal research to study the mutagenic properties of chemical compounds. It has been referenced in various studies focusing on relational learning and classification in the context of chemical and biological data.

### Tables
Population table: molecule

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/mutagenesis.svg" alt="mutagenesis ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
molecule, peripheral = load_ctu_dataset("mutagenesis")

(
    bond,
    atom,
) = peripheral.values()

Analyzing schema:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/3 [00:00<?, ?it/s]

Building data:   0%|          | 0/3 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`molecule`).

We already set the `target` role for the target (`mutagenic`).


mutagenic is the target column for a binary classification task.

In [3]:
# TODO: Annotate remaining columns with roles
molecule

name,mutagenic,molecule_id,ind1,inda,logp,lumo,split
role,target,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,0,d1,1,0,4.23,-1.246,train
1.0,0,d10,1,0,4.62,-1.387,train
2.0,1,d100,0,0,2.68,-1.034,train
3.0,0,d101,1,0,6.26,-1.598,train
4.0,0,d102,1,0,2.4,-3.172,train
,...,...,...,...,...,...,...
183.0,0,d95,0,0,2.55,-2.434,train
184.0,0,d96,1,0,4.18,-2.871,train
185.0,0,d97,1,0,3.95,-1.361,train
186.0,1,d98,0,0,1.65,-1.598,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
bond

name,atom1_id,atom2_id,type
role,unused_string,unused_string,unused_string
0.0,d100_1,d100_2,7
1.0,d100_1,d100_7,1
2.0,d100_11,d100_12,7
3.0,d100_12,d100_13,7
4.0,d100_12,d100_17,1
,...,...,...
5238.0,d9_7,d9_8,7
5239.0,d9_8,d9_16,1
5240.0,d9_8,d9_9,7
5241.0,d9_9,d9_17,1


In [5]:
# TODO: Annotate columns with roles
atom

name,atom_id,molecule_id,element,type,charge
role,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,d100_1,d100,c,22,-0.128
1.0,d100_10,d100,h,3,0.132
2.0,d100_11,d100,c,29,0.002
3.0,d100_12,d100,c,22,-0.128
4.0,d100_13,d100,c,22,-0.128
,...,...,...,...,...
4888.0,d9_5,d9,c,22,-0.102
4889.0,d9_6,d9,c,22,-0.102
4890.0,d9_7,d9,n,34,-0.511
4891.0,d9_8,d9,c,21,0.298


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/mutagenesis](https://relational.fel.cvut.cz/dataset/mutagenesis)
for a description of the dataset.

In [6]:
dm = getml.data.DataModel(population=molecule.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [7]:
container = getml.data.Container(population=molecule, split=molecule.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,molecule,132,View
1,val,molecule,56,View

Unnamed: 0,name,rows,type
0,bond,5243,DataFrame
1,atom,4893,DataFrame
