In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_web_kp")

# Task: WebKP
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> *Data Model:*
> 
> The *WebKP* dataset consists of three tables: `cites`, `content`, and `webpage`. These tables provide information about scientific publications, their content, and citation relationships.
> 
> - **cites**: Contains `cited_paper_id` (varchar) and `citing_paper_id` (varchar). This table represents the citation network between papers.
> 
> - **content**: Includes `webpage_id` (varchar) and `word_cited_id` (varchar). It details the presence of specific words in each webpage.
> 
> - **webpage**: Contains `webpage_id` (varchar) and `class_label` (varchar). This table classifies webpages into one of five classes.
> 
> *Task and Target Column:*
> 
> The primary task is *classification*, with the target column being `class_label` in the `webpage` table. The goal is to classify webpages based on their content and citation relationships.
> 
> *Column Types:*
> 
> - Varchar: `cited_paper_id`, `citing_paper_id`, `webpage_id`, `word_cited_id`, `class_label`
> 
> *Metadata:*
> 
> - **Size**: 12.8 MB
> - **Number of Tables**: 3
> - **Number of Rows**: 80,592
> - **Number of Columns**: 6
> - **Missing Values**: No
> - **Instance Count**: 877
> - **Target Table**: `webpage`
> - **Target Column**: `class_label`
> 
> This dataset is used in the education domain to analyze and classify scientific publications based on their content and citation patterns.

### Tables
Population table: webpage

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/WebKP.svg" alt="WebKP ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
webpage, peripheral = load_ctu_dataset("WebKP")

(
    content,
    cites,
) = peripheral.values()

Analyzing schema:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/3 [00:00<?, ?it/s]

Building data:   0%|          | 0/3 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`webpage`). We already set the `target` role for the target (`class_label`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `class_label`.

In [3]:
# TODO: Annotate remaining columns with roles
webpage

name,class_label=0,class_label=1,class_label=2,class_label=3,class_label=4,class_label,webpage_id,split
role,target,target,target,target,target,unused_float,unused_string,unused_string
0.0,1,0,0,0,0,0,http://cam.cornell.edu/ph/index....,val
1.0,1,0,0,0,0,0,http://cam.cornell.edu/~baggett/...,train
2.0,0,1,0,0,0,1,http://cs-tr.cs.cornell.edu,train
3.0,0,0,1,0,0,2,http://cs.cornell.edu/info/cours...,train
4.0,0,0,1,0,0,2,http://cs.cornell.edu/info/cours...,train
,...,...,...,...,...,...,...,...
872.0,0,1,0,0,0,1,http://www.ma.utexas.edu/users/b...,train
873.0,0,0,1,0,0,2,http://www.tc.cornell.edu/visual...,val
874.0,0,0,1,0,0,2,http://www.tc.cornell.edu/visual...,train
875.0,0,0,0,1,0,3,http://www.tc.cornell.edu/~anne,val


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
content

name,webpage_id,word_cited_id
role,unused_string,unused_string
0.0,http://cam.cornell.edu/ph/index....,word1020
1.0,http://cam.cornell.edu/ph/index....,word1042
2.0,http://cam.cornell.edu/ph/index....,word1059
3.0,http://cam.cornell.edu/ph/index....,word1089
4.0,http://cam.cornell.edu/ph/index....,word1112
,...,...
79360.0,http://www.tc.cornell.edu/~bruce,word669
79361.0,http://www.tc.cornell.edu/~bruce,word682
79362.0,http://www.tc.cornell.edu/~bruce,word72
79363.0,http://www.tc.cornell.edu/~bruce,word740


In [5]:
# TODO: Annotate columns with roles
cites

name,cited_paper_id,citing_paper_id
role,unused_string,unused_string
0.0,http://cam.cornell.edu/ph/index....,http://www.cs.cornell.edu/info/p...
1.0,http://cam.cornell.edu/~baggett/...,http://www.cs.cornell.edu/info/p...
2.0,http://cs-tr.cs.cornell.edu,http://dri.cornell.edu/pub/peopl...
3.0,http://cs-tr.cs.cornell.edu,http://www.cs.cornell.edu
4.0,http://cs-tr.cs.cornell.edu,http://www.cs.cornell.edu/info/p...
,...,...
1603.0,http://www.tc.cornell.edu/~anne,http://www.cs.cornell.edu/info/p...
1604.0,http://www.tc.cornell.edu/~anne,http://www.tc.cornell.edu/~anne
1605.0,http://www.tc.cornell.edu/~bruce,http://www.tc.cornell.edu/visual...
1606.0,http://www.tc.cornell.edu/~bruce,http://www.tc.cornell.edu/visual...


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/WebKP](https://relational.fel.cvut.cz/dataset/WebKP)
for a description of the dataset.

In [6]:
dm = getml.data.DataModel(population=webpage.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [7]:
container = getml.data.Container(population=webpage, split=webpage.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,webpage,614,View
1,val,webpage,263,View

Unnamed: 0,name,rows,type
0,content,79365,DataFrame
1,cites,1608,DataFrame
