In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("pub_med_diabetes")

# Task: PubMed_Diabetes
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *PubMed_Diabetes* dataset is structured to analyze and classify academic papers related to diabetes. The task is a *multiclass classification*, with the target column being `class_label` in the `paper` table, which categorizes each paper into different classes.
> 
> **Data Model:**
> - **Tables:** 3 (cites, content, paper)
> - **Columns:**
>   - **cites:**
>     - `citing_paper_id`: *int* - ID of the paper that cites another.
>     - `cites_paper_id`: *int* - ID of the cited paper.
>   - **content:**
>     - `paper_id`: *int* - ID of the paper.
>     - `word`: *varchar* - Word found in the paper.
>     - `tf_idf`: *double* - Term frequency-inverse document frequency value for the word.
>   - **paper:**
>     - `paper_id`: *int* - Unique identifier for each paper.
>     - `class_label`: *int* - Class label for the paper, used as the target for classification.
> 
> **Task and Target:**
> - **Task:** Multiclass Classification
> - **Target Column:** `class_label` (in the paper table)
> 
> **Metadata:**
> - **Size:** Not specified
> - **Number of Tables:** 3
> - **Number of Rows:** Not specified
> - **Number of Columns:** 7
> - **Missing Values:** Not specified
> - **Compound Keys:** Not specified
> - **Loops:** Not specified
> - **Type:** Real
> 
> This dataset is typically used in research for text classification and citation network analysis, particularly in the context of medical literature related to diabetes.

### Tables
Population table: paper

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/PubMed_Diabetes.svg" alt="PubMed_Diabetes ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
paper, peripheral = load_ctu_dataset("PubMed_Diabetes")

(
    cites,
    content,
) = peripheral.values()

Analyzing schema:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/3 [00:00<?, ?it/s]

Building data:   0%|          | 0/3 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`paper`).

We already set the `target` role for the target (`class_label`).



In [3]:
# TODO: Annotate remaining columns with roles
paper

name,paper_id,class_label,split
role,unused_float,unused_string,unused_string
0.0,7145,0,val
1.0,29094,0,train
2.0,34420,0,train
3.0,34548,0,train
4.0,37920,0,train
,...,...,...
19712.0,19960641,2,train
19713.0,20003208,2,train
19714.0,20011163,1,train
19715.0,20061358,2,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
cites

name,citing_paper_id,cites_paper_id
role,unused_float,unused_float
0.0,99048,131313
1.0,99048,767184
2.0,99048,826063
3.0,99138,5704813
4.0,106781,339853
,...,...
44333.0,20061360,17415551
44334.0,20061360,18215172
44335.0,20061360,18539916
44336.0,20061360,18539917


In [5]:
# TODO: Annotate columns with roles
content

name,paper_id,tf_idf,word
role,unused_float,unused_float,unused_string
0.0,7145,0.02291,w-0
1.0,7145,0.02958,w-001
2.0,7145,0.03081,w-01
3.0,7145,0.0178,w-4
4.0,7145,0.04076,w-60
,...,...,...
988026.0,20061360,0.04843,w-trial
988027.0,20061360,0.002957,w-type
988028.0,20061360,0.004836,w-use
988029.0,20061360,0.01281,w-women


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/PubMed_Diabetes](https://relational.fel.cvut.cz/dataset/PubMed_Diabetes)
for a description of the dataset.

In [6]:
dm = getml.data.DataModel(population=paper.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [7]:
container = getml.data.Container(population=paper, split=paper.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,paper,13802,View
1,val,paper,5915,View

Unnamed: 0,name,rows,type
0,cites,44338,DataFrame
1,content,988031,DataFrame
