In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("same_gen")

# Task: Same_gen
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *Same_gen* dataset is used to predict whether two given people are from the same generation. It is a classification task, with the target column being *target* in the *target* table.
> 
> **Data Model:**
> 
> - **parent** table:
>   - *name1*: varchar
>   - *name2*: varchar
> 
> - **same_gen** table:
>   - *name1*: varchar
>   - *name2*: varchar
> 
> - **target** table:
>   - *name1*: varchar
>   - *name2*: varchar
>   - *target*: int (target column)
> 
> - **person** table:
>   - *name*: varchar
> 
> **Metadata:**
> 
> - Size: 300 KB
> - Number of tables: 4
> - Number of rows: 1,536
> - Number of columns: 8
> - Missing values: No
> - Compound keys: No
> - Loops: Yes
> - Type: Real
> - Instance count: 1,081
> 
> The dataset is commonly used in kinship research to analyze familial relationships and generational links. It has been referenced in studies focusing on relational learning and classification in the context of family data.

### Tables
Population table: target

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/Same_gen.svg" alt="Same_gen ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
target, peripheral = load_ctu_dataset("Same_gen")

(
    person,
    same_gen,
    parent,
) = peripheral.values()

Analyzing schema:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/4 [00:00<?, ?it/s]

Building data:   0%|          | 0/4 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`target`).

We already set the `target` role for the target (`target`).


target is the target column for a binary classification task.

In [3]:
# TODO: Annotate remaining columns with roles
target

name,target,name1,name2,split
role,target,unused_string,unused_string,unused_string
0.0,0,ali1,ali2,train
1.0,0,ali1,alp,train
2.0,0,ali1,anil,train
3.0,0,ali1,ayse,train
4.0,0,ali1,ayten,train
,...,...,...,...
1076.0,0,yusuf1,yusuf3,train
1077.0,0,yusuf1,zeynep,train
1078.0,0,yusuf2,yusuf3,val
1079.0,0,yusuf2,zeynep,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
person

name,name
role,unused_string
0.0,ali1
1.0,ali2
2.0,alp
3.0,anil
4.0,ayse
,...
42.0,yildirim
43.0,yusuf1
44.0,yusuf2
45.0,yusuf3


In [5]:
# TODO: Annotate columns with roles
same_gen

name,name1,name2
role,unused_string,unused_string
0.0,ali1,dilber
1.0,ali1,yusuf2
2.0,ayse,dilber
3.0,ayse,yusuf2
4.0,ayten,mediha2
,...,...
59.0,yusuf1,ismail
60.0,yusuf1,mehmet1
61.0,yusuf1,neriman
62.0,yusuf1,nesrin


In [6]:
# TODO: Annotate columns with roles
parent

name,name1,name2
role,unused_string,unused_string
0.0,ali1,fatma
1.0,ali1,ismail
2.0,ali1,mehmet1
3.0,ali1,neriman
4.0,ali1,nesrin
,...,...
339.0,zeynep,melis
340.0,zeynep,nida
341.0,zeynep,secil
342.0,zeynep,yavuz


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/Same_gen](https://relational.fel.cvut.cz/dataset/Same_gen)
for a description of the dataset.

In [7]:
dm = getml.data.DataModel(population=target.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [8]:
container = getml.data.Container(population=target, split=target.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,target,757,View
1,val,target,324,View

Unnamed: 0,name,rows,type
0,person,47,DataFrame
1,parent,64,DataFrame
2,same_gen,344,DataFrame
