In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_carcinogenesis")

# Task: Carcinogenesis
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> *Data Model:*
> 
> The *Carcinogenesis* dataset consists of six tables: `sbond_1`, `sbond_2`, `sbond_3`, `sbond_7`, `atom`, and `canc`. These tables are interconnected through various attributes related to chemical compounds and their carcinogenic properties.
> 
> - **sbond_1, sbond_2, sbond_3, sbond_7**: Each table contains columns `id` (int), `drug` (char), `atomid` (char), and `atomid_2` (char). These tables represent different types of bonds between atoms in a drug.
>   
> - **atom**: This table includes `atomid` (char), `drug` (char), `atomtype` (char), `charge` (char), and `name` (char). It details the properties of atoms within a drug.
> 
> - **canc**: Contains `drug_id` (char) and `class` (char). This table is used to classify drugs as carcinogenic or not.
> 
> *Task and Target Column:*
> 
> The primary task is *classification*, with the target column being `class` in the `canc` table. The goal is to predict whether a given molecule is carcinogenic.
> 
> *Column Types:*
> 
> - Integer: `id`
> - Character: `drug`, `atomid`, `atomid_2`, `atomtype`, `charge`, `name`, `drug_id`, `class`
> 
> *Metadata:*
> 
> - **Size**: 21 MB
> - **Number of Tables**: 6
> - **Number of Rows**: 27,570
> - **Number of Columns**: 23
> - **Missing Values**: No missing values
> - **Instance Count**: 329
> - **Target Table**: `canc`
> - **Target Column**: `class`
> 
> This dataset is used in the domain of medicine to evaluate the carcinogenic potential of chemical compounds.

### Tables
Population table: canc

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/Carcinogenesis.svg" alt="Carcinogenesis ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
canc, peripheral = load_ctu_dataset("Carcinogenesis")

(
    atom,
    sbond_3,
    sbond_2,
    sbond_7,
    sbond_1,
) = peripheral.values()

Analyzing schema:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/6 [00:00<?, ?it/s]

Building data:   0%|          | 0/6 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`canc`). We already set the `target` role for the target (`class`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `class`.

In [3]:
# TODO: Annotate remaining columns with roles
canc

name,class,drug_id,split
role,target,unused_string,unused_string
0.0,0,d1,train
1.0,0,d10,train
2.0,0,d100,train
3.0,0,d101,train
4.0,0,d102,val
,...,...,...
324.0,0,d95,train
325.0,0,d96,val
326.0,0,d97,train
327.0,0,d98,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
atom

name,id,drug,atomid,atomid_2
role,unused_string,unused_string,unused_string,unused_string
0.0,1,d1,d1_1,d1_7
1.0,2,d1,d1_10,d1_6
2.0,3,d1,d1_11,d1_12
3.0,4,d1,d1_11,d1_3
4.0,5,d1,d1_12,d1_11
,...,...,...,...
13557.0,13558,d99,d99_5,d99_14
13558.0,13559,d99,d99_6,d99_15
13559.0,13560,d99,d99_7,d99_13
13560.0,13561,d99,d99_8,d99_19


In [5]:
# TODO: Annotate columns with roles
sbond_3

name,id,drug,atomid,atomid_2
role,unused_string,unused_string,unused_string,unused_string
0.0,1,d1,d1_11,d1_23
1.0,2,d1,d1_14,d1_22
2.0,3,d1,d1_22,d1_14
3.0,4,d1,d1_23,d1_11
4.0,5,d10,d10_13,d10_16
,...,...,...,...
921.0,922,d95,d95_23,d95_13
922.0,923,d96,d96_1,d96_2
923.0,924,d96,d96_2,d96_1
924.0,925,d97,d97_23,d97_25


In [6]:
# TODO: Annotate columns with roles
sbond_2

name,atomid,drug,atomtype,charge,name
role,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,d100_1,d100,22,a0=-0_1355<x<=-0_0175,c
1.0,d100_10,d100,3,a0=0_1375<x<=+inf,h
2.0,d100_11,d100,22,a0=-0_1355<x<=-0_0175,c
3.0,d100_12,d100,22,a0=-0_1355<x<=-0_0175,c
4.0,d100_13,d100,22,a0=-0_1355<x<=-0_0175,c
,...,...,...,...,...
9059.0,d9_5,d9,22,a0=0_1375<x<=+inf,c
9060.0,d9_6,d9,22,a0=-inf<x<=-0_1355,c
9061.0,d9_7,d9,3,a0=0_0615<x<=0_1375,h
9062.0,d9_8,d9,3,a0=0_0615<x<=0_1375,h


In [7]:
# TODO: Annotate columns with roles
sbond_7

name,id,drug,atomid,atomid_2
role,unused_string,unused_string,unused_string,unused_string
0.0,1,d130,d130_15,d130_17
1.0,2,d130,d130_16,d130_18
2.0,3,d130,d130_17,d130_15
3.0,4,d130,d130_18,d130_16
4.0,5,d262,d262_2,d262_3
,...,...,...,...
7.0,8,d321,d321_4,d321_3
8.0,9,d98,d98_10,d98_14
9.0,10,d98,d98_13,d98_8
10.0,11,d98,d98_14,d98_10


In [8]:
# TODO: Annotate columns with roles
sbond_1

name,id,drug,atomid,atomid_2
role,unused_string,unused_string,unused_string,unused_string
0.0,1,d1,d1_1,d1_2
1.0,2,d1,d1_1,d1_6
2.0,3,d1,d1_12,d1_13
3.0,4,d1,d1_12,d1_15
4.0,5,d1,d1_13,d1_12
,...,...,...,...
4129.0,4130,d99,d99_7,d99_8
4130.0,4131,d99,d99_8,d99_7
4131.0,4132,d99,d99_8,d99_9
4132.0,4133,d99,d99_9,d99_10


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/Carcinogenesis](https://relational.fel.cvut.cz/dataset/Carcinogenesis)
for a description of the dataset.

In [9]:
dm = getml.data.DataModel(population=canc.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [10]:
container = getml.data.Container(population=canc, split=canc.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,canc,231,View
1,val,canc,98,View

Unnamed: 0,name,rows,type
0,sbond_1,13562,DataFrame
1,sbond_2,926,DataFrame
2,atom,9064,DataFrame
3,sbond_3,12,DataFrame
4,sbond_7,4134,DataFrame
