In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_uw_std")

# Task: UW_std
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> It seems that the *UW_std* dataset is not available in the repository. However, I can provide a description based on the data model you provided.
> 
> *Data Model:*
> 
> The *UW_std* dataset consists of four tables: `advisedBy`, `taughtBy`, `person`, and `course`. These tables provide information about academic relationships and roles within a university setting.
> 
> - **advisedBy**: Contains `p_id` (int) and `p_id_dummy` (int). This table represents advisory relationships between individuals.
> 
> - **taughtBy**: Includes `course_id` (int) and `p_id` (int). It details which individuals teach specific courses.
> 
> - **person**: Contains `p_id` (int), `professor` (varchar), `student` (varchar), `hasPosition` (varchar), `inPhase` (varchar), and `yearsInProgram` (varchar). This table provides information about individuals, including their roles and status in the program.
> 
> - **course**: Includes `course_id` (int) and `courseLevel` (varchar). This table provides information about courses and their levels.
> 
> *Task and Target Column:*
> 
> The dataset can be used for various tasks such as analyzing academic relationships or roles, but a specific target column is not defined in the provided model.
> 
> *Column Types:*
> 
> - Integer: `p_id`, `p_id_dummy`, `course_id`
> - Varchar: `professor`, `student`, `hasPosition`, `inPhase`, `yearsInProgram`, `courseLevel`
> 
> *Metadata:*
> 
> - **Number of Tables**: 4
> 
> This dataset is used in the educational domain to analyze academic roles, relationships, and course information within a university.

### Tables
Population table: person

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/UW_std.svg" alt="UW_std ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
person, peripheral = load_ctu_dataset("UW_std")

(
    course,
    advised_by,
    taught_by,
) = peripheral.values()

Analyzing schema:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/4 [00:00<?, ?it/s]

Building data:   0%|          | 0/4 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`person`). We already set the `target` role for the target (`inPhase`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `inPhase`.

In [3]:
# TODO: Annotate remaining columns with roles
person

name,inPhase=0,inPhase=1,inPhase=2,inPhase=3,inPhase,p_id,professor,student,hasPosition,yearsInProgram,split
role,target,target,target,target,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,0,0,0,0,3,0,1,0,0,train
1.0,1,0,0,0,0,4,0,1,0,0,train
2.0,1,0,0,0,0,5,1,0,Faculty,0,train
3.0,0,1,0,0,1,6,0,1,0,Year_2,train
4.0,1,0,0,0,0,7,1,0,Faculty_adj,0,train
,...,...,...,...,...,...,...,...,...,...,...
273.0,1,0,0,0,0,428,0,1,0,0,train
274.0,0,1,0,0,1,429,0,1,0,Year_5,train
275.0,0,0,0,1,3,431,0,1,0,Year_2,train
276.0,0,1,0,0,1,432,0,1,0,Year_5,val


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
course

name,course_id,p_id
role,unused_string,unused_string
0.0,0,40
1.0,1,40
2.0,2,180
3.0,3,279
4.0,4,107
,...,...
184.0,170,407
185.0,172,46
186.0,172,335
187.0,173,171


In [5]:
# TODO: Annotate columns with roles
advised_by

name,course_id,courseLevel
role,unused_string,unused_string
0.0,5,Level_300
1.0,11,Level_300
2.0,18,Level_300
3.0,104,Level_300
4.0,124,Level_300
,...,...
127.0,168,Level_500
128.0,169,Level_500
129.0,170,Level_500
130.0,172,Level_500


In [6]:
# TODO: Annotate columns with roles
taught_by

name,p_id,p_id_dummy
role,unused_string,unused_string
0.0,96,5
1.0,118,5
2.0,183,5
3.0,263,5
4.0,362,5
,...,...
108.0,45,415
109.0,63,415
110.0,262,415
111.0,314,415


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/UW_std](https://relational.fel.cvut.cz/dataset/UW_std)
for a description of the dataset.

In [7]:
dm = getml.data.DataModel(population=person.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [8]:
container = getml.data.Container(population=person, split=person.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,person,195,View
1,val,person,83,View

Unnamed: 0,name,rows,type
0,taught_by,189,DataFrame
1,course,132,DataFrame
2,advised_by,113,DataFrame
