In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_dcg")

# Task: DCG
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *DCG* dataset is structured into two tables: *terms* and *sentences*. 
> 
> - **Data Model:**
>   - *terms* table:
>     - `id_sentence` (int): Identifier for the sentence.
>     - `id_term` (int): Identifier for the term.
>     - `term` (varchar): The term itself.
>   - *sentences* table:
>     - `id` (int): Unique identifier for each sentence.
>     - `class` (char): Classification label indicating whether the sentence is positive or negative.
> 
> - **Task:**
>   - The primary task is *classification*, with the target column being `class` in the *sentences* table.
> 
> - **Column Types:**
>   - Integer (`int`) for identifiers.
>   - Character (`char`) for classification labels.
>   - Variable character (`varchar`) for terms.
> 
> - **Metadata:**
>   - The dataset is synthetic and contains no missing values.
>   - It consists of 2 tables with a total of 8,258 rows and 5 columns.
>   - The dataset size is approximately 300 KB.
>   - There are 1,130 instances, with the target table being *sentences*.
> 
> This dataset is used for educational purposes, focusing on generating and classifying sentences based on a defined grammar.

### Tables
Population table: sentences

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/DCG.svg" alt="DCG ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
sentences, peripheral = load_ctu_dataset("DCG")

(
    terms,
) = peripheral.values()

Analyzing schema:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/2 [00:00<?, ?it/s]

Building data:   0%|          | 0/2 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`sentences`). We already set the `target` role for the target (`class`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `class`.

In [3]:
# TODO: Annotate remaining columns with roles
sentences

name,class,id,split
role,target,unused_string,unused_string
0.0,0,1,train
1.0,0,2,val
2.0,0,3,train
3.0,0,4,train
4.0,0,5,val
,...,...,...
1125.0,1,1126,train
1126.0,1,1127,train
1127.0,1,1128,train
1128.0,1,1129,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
terms

name,id_sentence,id_term,term
role,unused_string,unused_string,unused_string
0.0,1,1,john
1.0,1,2,paints
2.0,2,1,annie
3.0,2,2,paints
4.0,3,1,monet
,...,...,...
7123.0,1130,3,a
7124.0,1130,4,woman
7125.0,1130,5,john
7126.0,1130,6,admires


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/DCG](https://relational.fel.cvut.cz/dataset/DCG)
for a description of the dataset.

In [5]:
dm = getml.data.DataModel(population=sentences.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [6]:
container = getml.data.Container(population=sentences, split=sentences.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,sentences,791,View
1,val,sentences,339,View

Unnamed: 0,name,rows,type
0,terms,7128,DataFrame
