In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_voc")

# Task: voc
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *VOC* dataset provides insights into the administrative system of the Vereenigde geoctrooieerde Oostindische Compagnie (VOC), also known as the Dutch East India Company, established in 1602.
> 
> - **Data Model:**
>   - *craftsmen, impotenten, passengers, seafarers, soldiers, total* tables:
>     - `number` (int): Identifier for the record.
>     - `number_sup` (char): Supplementary identifier.
>     - `trip` (int): Trip identifier.
>     - `trip_sup` (char): Supplementary trip identifier.
>     - Various columns for onboard, death, and left counts at different stages (int).
>   - *invoices* table:
>     - `number` (int): Invoice number.
>     - `number_sup` (char): Supplementary number.
>     - `invoice` (int): Invoice identifier.
>     - `chamber` (char): Chamber identifier.
>   - *voyages* table:
>     - `artificial_id` (char): Artificial identifier.
>     - `number` (int): Voyage number.
>     - `number_sup` (char): Supplementary number.
>     - `trip` (int): Trip identifier.
>     - `trip_sup` (char): Supplementary trip identifier.
>     - Various columns for ship details, dates, and harbors (varchar, int, date, text).
> 
> - **Task:**
>   - The primary task is *classification*, with the target column being `arrival_harbour` in the *voyages* table.
> 
> - **Column Types:**
>   - Integer (`int`) for identifiers and counts.
>   - Character (`char`) for supplementary identifiers.
>   - Variable character (`varchar`) for names and types.
>   - Date (`date`) for temporal data.
>   - Text (`text`) for particulars.
> 
> - **Metadata:**
>   - The dataset is real and contains missing values.
>   - It consists of 8 tables with a total of 29,067 rows and 89 columns.
>   - The dataset size is approximately 2.7 MB.
>   - There are 8,073 instances, with the target table being *voyages*.
> 
> This dataset is valuable for historical and retail analysis, focusing on the operations and logistics of a major historical trading company.

### Tables
Population table: voyages

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/voc.svg" alt="voc ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
voyages, peripheral = load_ctu_dataset("voc")

(
    soldiers,
    total,
    impotenten,
    seafarers,
    invoices,
    craftsmen,
    passengers,
) = peripheral.values()

Analyzing schema:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/8 [00:00<?, ?it/s]

Building data:   0%|          | 0/8 [00:00<?, ?it/s]

You are splitting the column into more than 10 target columns. This might take a long time to fit.


Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`voyages`). We already set the `target` role for the target (`arrival_harbour`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `arrival_harbour`.

In [3]:
# TODO: Annotate remaining columns with roles
voyages

name,arrival_harbour=-1,arrival_harbour=0,arrival_harbour=1,arrival_harbour=10,arrival_harbour=11,arrival_harbour=12,arrival_harbour=13,arrival_harbour=14,arrival_harbour=15,arrival_harbour=16,arrival_harbour=17,arrival_harbour=18,arrival_harbour=19,arrival_harbour=2,arrival_harbour=20,arrival_harbour=21,arrival_harbour=22,arrival_harbour=23,arrival_harbour=24,arrival_harbour=25,arrival_harbour=26,arrival_harbour=27,arrival_harbour=28,arrival_harbour=29,arrival_harbour=3,arrival_harbour=30,arrival_harbour=31,arrival_harbour=32,arrival_harbour=33,arrival_harbour=34,arrival_harbour=35,arrival_harbour=36,arrival_harbour=37,arrival_harbour=38,arrival_harbour=39,arrival_harbour=4,arrival_harbour=40,arrival_harbour=41,arrival_harbour=42,arrival_harbour=43,arrival_harbour=44,arrival_harbour=45,arrival_harbour=46,arrival_harbour=47,arrival_harbour=48,arrival_harbour=49,arrival_harbour=5,arrival_harbour=50,arrival_harbour=51,arrival_harbour=52,arrival_harbour=53,arrival_harbour=54,arrival_harbour=55,arrival_harbour=56,arrival_harbour=57,arrival_harbour=58,arrival_harbour=59,arrival_harbour=6,arrival_harbour=60,arrival_harbour=7,arrival_harbour=8,arrival_harbour=9,arrival_harbour,artificial_id,number,number_sup,trip,trip_sup,boatname,master,tonnage,type_of_boat,built,bought,hired,yard,chamber,departure_date,departure_harbour,cape_arrival,cape_departure,cape_call,arrival_date,next_voyage,particulars,split
role,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,,1,,AMSTERDAM,Jan Jakobsz. Schellinger,260,,1594,,,A,,1595-04-02,Texel,,,t,1596-06-06,,from 04-08 till 11-08 in the Mos...,train
1.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,,1,,DUIFJE,Simon Lambrechtsz. Mau,50,pinas,1594,,,A,,1595-04-02,Texel,,,t,1596-06-06,5001,HOLLANDIA on 26-10-1595; he was ...,val
2.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,,1,,HOLLANDIA,Jan Dignumsz. van Kwadijk+,460,,1594,,,A,,1595-04-02,Texel,,,t,1596-06-06,5002,Jan Dignumsz. died on 29-05-1595...,val
3.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,4,4,,1,,MAURITIUS,Jan Jansz. Molenaar,460,,1594,,,A,,1595-04-02,Texel,,,t,1596-06-06,5003,Jan Jansz. died on 25-12-1596 an...,train
4.0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,5,5,,1,,LANGEBARK,Hans Huibrechtsz. Tonneman,300,,,,,,,1598-03-25,Zeeland,,,t,1599-03-01,5010,other.,train
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8126.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,60,8397,8397,,1,,ZUIDPOOL,J. A. van der Putten,900,pink,1791,,,Z,Z,1795-01-01,Ceylon,1795-04-25,1795-05-18,t,1795-09-18,4772,,train
8127.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,8398,8398,,2,,MEERMIN,Gerard Ewoud Overbeek,500,fregat,1782,,,A,A,1795-01-01,Batavia,1795-05-21,,t,,4674,to return because of adverse win...,train
8128.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,8399,8399,,1,,JONGE BONIFACIUS,Jan Nikolaas Kroese,488,,,,,Z,Z,1795-01-01,,1795-06-24,,t,,4774,Almost all data concerning this ...,train
8129.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,-1,8400,8400,,1,,LOUISA ANTHONY,Kersje Hillebrandsz.,640,fluit,,,,A,A,1795-01-01,,1795-08-15,,t,,4777,Almost all data concerning this ...,val


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
soldiers

name,number,number_sup,trip,trip_sup,onboard_at_departure,death_at_cape,left_at_cape,onboard_at_cape,death_during_voyage,onboard_at_arrival
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1376,,3,,19,,,,,19
1.0,1378,,5,,22,,,,,22
2.0,1379,,5,,18,,,,2,16
3.0,1380,,3,,11,,,,1,10
4.0,1381,,5,,11,,,,,11
,...,...,...,...,...,...,...,...,...,...
2344.0,8294,,1,,11,,,,,
2345.0,8320,,2,,2,,,,,
2346.0,8326,,1,,2,,,,,
2347.0,8329,,2,,5,,,,,


In [5]:
# TODO: Annotate columns with roles
total

name,number,number_sup,trip,trip_sup,onboard_at_departure,death_at_cape,left_at_cape,onboard_at_cape,death_during_voyage,onboard_at_arrival
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,310,,2,,66,,,,2,64
1.0,311,,2,,55,,,,5,55
2.0,318,,5,,17,,,,4,13
3.0,320,,3,,23,,,,,
4.0,321,,2,,10,,,,,
,...,...,...,...,...,...,...,...,...,...
2808.0,8354,,2,,4,,,,,
2809.0,8371,,3,,1,,,,,
2810.0,8384,,1,,5,,,,,
2811.0,8390,,3,,1,1,7,,,


In [6]:
# TODO: Annotate columns with roles
impotenten

name,number,number_sup,trip,trip_sup,onboard_at_departure,death_at_cape,left_at_cape,onboard_at_cape,death_during_voyage,onboard_at_arrival
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,196,,1,,65,,,,,
1.0,197,,1,,22,,,,,
2.0,296,,1,,126,,,,,
3.0,298,,1,,147,,,,,
4.0,299,,1,,89,,,,,
,...,...,...,...,...,...,...,...,...,...
4463.0,8351,,2,,89,22,20,,,
4464.0,8354,,2,,94,13,19,,,
4465.0,8371,,3,,120,,,,,
4466.0,8384,,1,,118,,,,,


In [7]:
# TODO: Annotate columns with roles
seafarers

name,number,number_sup,trip,trip_sup,onboard_at_departure,death_at_cape,left_at_cape,onboard_at_cape,death_during_voyage,onboard_at_arrival
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,296,,1,,90,,,,,
1.0,298,,1,,90,,,,,
2.0,299,,1,,55,,,,,
3.0,300,,1,,53,,,,,
4.0,301,,1,,90,,,,,
,...,...,...,...,...,...,...,...,...,...
4172.0,8348,,2,,4,,,,,
4173.0,8350,,1,,9,,,,,
4174.0,8351,,2,,6,,,,,
4175.0,8354,,2,,4,,,,,


In [8]:
# TODO: Annotate columns with roles
invoices

name,number,number_sup,trip,trip_sup,onboard_at_departure,death_at_cape,left_at_cape,onboard_at_cape,death_during_voyage,onboard_at_arrival
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,5733,,3,,18,,,,,
1.0,5734,,2,,19,,,,,
2.0,5735,,1,,18,,,,,
3.0,5736,,1,,20,,,,,
4.0,5737,,4,,20,,,,,
,...,...,...,...,...,...,...,...,...,...
933.0,8235,,1,,4,,,,,
934.0,8236,,1,,4,,,,,
935.0,8265,,1,,1,,,,,
936.0,8371,,3,,1,,,,,


In [9]:
# TODO: Annotate columns with roles
craftsmen

name,number,number_sup,trip,trip_sup,onboard_at_departure,death_at_cape,left_at_cape,onboard_at_cape,death_during_voyage,onboard_at_arrival
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,,1,,59,,,,,
1.0,2,,1,,20,,,,,
2.0,3,,1,,85,,,,,
3.0,4,,1,,85,,,,,
4.0,5,,1,,75,,,,,
,...,...,...,...,...,...,...,...,...,...
2462.0,8392,,1,,125,1,,,,
2463.0,8393,,1,,103,1,1,,,
2464.0,8394,,1,,100,2,3,,,
2465.0,8395,,2,,107,,3,,,


In [10]:
# TODO: Annotate columns with roles
passengers

name,number,number_sup,trip,trip_sup,invoice,chamber
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,5102,,2,,268964,A
1.0,5103,,2,,164562,Z
2.0,5105,,2,,110370,A
3.0,5106,,2,,127181,D
4.0,5107,,2,,101522,A
,...,...,...,...,...,...
3777.0,8396,,3,,457491,A
3778.0,8397,,1,,365373,Z
3779.0,8399,,1,,50232,Z
3780.0,8400,,1,,85711,A


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/voc](https://relational.fel.cvut.cz/dataset/voc)
for a description of the dataset.

In [11]:
dm = getml.data.DataModel(population=voyages.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [12]:
container = getml.data.Container(population=voyages, split=voyages.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,voyages,5692,View
1,val,voyages,2439,View

Unnamed: 0,name,rows,type
0,craftsmen,2349,DataFrame
1,passengers,2813,DataFrame
2,seafarers,4468,DataFrame
3,soldiers,4177,DataFrame
4,impotenten,938,DataFrame
5,total,2467,DataFrame
6,invoices,3782,DataFrame
