In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_pub_med_diabetes")

# Task: PubMed_Diabetes
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *PubMed_Diabetes* dataset consists of 19,717 scientific publications from the PubMed database related to diabetes, classified into three classes. It includes a citation network with 44,338 links and uses TF/IDF weighted word vectors from a dictionary of 500 unique words.
> 
> - **Data Model:**
>   - *cites* table:
>     - `citing_paper_id` (int): ID of the citing paper.
>     - `cites_paper_id` (int): ID of the cited paper.
>   - *content* table:
>     - `paper_id` (int): ID of the paper.
>     - `word` (varchar): Word from the dictionary.
>     - `tf_idf` (double): TF/IDF weight of the word.
>   - *paper* table:
>     - `paper_id` (int): Unique identifier for each paper.
>     - `class_label` (int): Classification label for the paper.
> 
> - **Task:**
>   - The primary task is *classification*, with the target column being `class_label` in the *paper* table.
> 
> - **Column Types:**
>   - Integer (`int`) for paper IDs and class labels.
>   - Variable character (`varchar`) for words.
>   - Double (`double`) for TF/IDF weights.
> 
> - **Metadata:**
>   - The dataset is real and contains no missing values.
>   - It consists of 3 tables with a total of 1,051,972 rows and 7 columns.
>   - The dataset size is approximately 44.1 MB.
>   - There are 20,055 instances, with the target table being *paper*.
> 
> This dataset is useful for educational purposes, focusing on text classification and citation network analysis in the context of diabetes research.

### Tables
Population table: paper

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/PubMed_Diabetes.svg" alt="PubMed_Diabetes ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
paper, peripheral = load_ctu_dataset("PubMed_Diabetes")

(
    content,
    cites,
) = peripheral.values()

Analyzing schema:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/3 [00:00<?, ?it/s]

Building data:   0%|          | 0/3 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`paper`). We already set the `target` role for the target (`class_label`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `class_label`.

In [3]:
# TODO: Annotate remaining columns with roles
paper

name,class_label=0,class_label=1,class_label=2,class_label,paper_id,split
role,target,target,target,unused_float,unused_string,unused_string
0.0,1,0,0,0,7145,val
1.0,1,0,0,0,29094,train
2.0,1,0,0,0,34420,train
3.0,1,0,0,0,34548,train
4.0,1,0,0,0,37920,train
,...,...,...,...,...,...
19712.0,0,0,1,2,19960641,train
19713.0,0,0,1,2,20003208,train
19714.0,0,1,0,1,20011163,train
19715.0,0,0,1,2,20061358,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
content

name,citing_paper_id,cites_paper_id
role,unused_string,unused_string
0.0,99048,131313
1.0,99048,767184
2.0,99048,826063
3.0,99138,5704813
4.0,106781,339853
,...,...
44333.0,20061360,17415551
44334.0,20061360,18215172
44335.0,20061360,18539916
44336.0,20061360,18539917


In [5]:
# TODO: Annotate columns with roles
cites

name,paper_id,word,tf_idf
role,unused_string,unused_string,unused_string
0.0,7145,w-0,0.022914358702847797
1.0,7145,w-001,0.029584017811803448
2.0,7145,w-01,0.030809013942411815
3.0,7145,w-4,0.017804017343450683
4.0,7145,w-60,0.04075573793098892
,...,...,...
988026.0,20061360,w-trial,0.04843448370984682
988027.0,20061360,w-type,0.0029572852363957142
988028.0,20061360,w-use,0.004835939879500289
988029.0,20061360,w-women,0.012812494692770574


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/PubMed_Diabetes](https://relational.fel.cvut.cz/dataset/PubMed_Diabetes)
for a description of the dataset.

In [6]:
dm = getml.data.DataModel(population=paper.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [7]:
container = getml.data.Container(population=paper, split=paper.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,paper,13802,View
1,val,paper,5915,View

Unnamed: 0,name,rows,type
0,cites,44338,DataFrame
1,content,988031,DataFrame
