In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_pima")

# Task: Pima
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> *Data Model:*
> 
> The *Pima* dataset consists of nine tables: `age`, `bmi`, `diastolic`, `numPreg`, `pedigree`, `plasma`, `serum`, `tricepts`, and `pima`. These tables provide information about various health metrics of adult female Pima Indians.
> 
> - **age, bmi, diastolic, numPreg, pedigree, plasma, serum, tricepts**: Each table contains `arg1` (varchar) and `arg2` (decimal). These tables represent different health metrics such as age, body mass index, blood pressure, number of pregnancies, diabetes pedigree function, plasma glucose concentration, serum insulin, and triceps skinfold thickness.
> 
> - **pima**: Contains `arg1` (varchar) and `arg2` (varchar). This table is used to classify individuals based on the presence or absence of diabetes.
> 
> *Task and Target Column:*
> 
> The primary task is *classification*, with the target column being `arg2` in the `pima` table. The goal is to classify individuals as diabetic or non-diabetic.
> 
> *Column Types:*
> 
> - Varchar: `arg1`, `arg2` (in `pima`)
> - Decimal: `arg2` (in other tables)
> 
> *Metadata:*
> 
> - **Size**: 700 KB
> - **Number of Tables**: 9
> - **Number of Rows**: 6,912
> - **Number of Columns**: 18
> - **Missing Values**: No
> - **Instance Count**: 768
> - **Target Table**: `pima`
> - **Target Column**: `arg2`
> 
> This dataset is used in the medicine domain to analyze and classify health data related to diabetes among Pima Indians.

### Tables
Population table: pima

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/Pima.svg" alt="Pima ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
pima, peripheral = load_ctu_dataset("Pima")

(
    tricepts,
    diastolic,
    age,
    pedigree,
    plasma,
    num_preg,
    bmi,
    serum,
) = peripheral.values()

Analyzing schema:   0%|          | 0/9 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/9 [00:00<?, ?it/s]

Building data:   0%|          | 0/9 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`pima`). We already set the `target` role for the target (`arg2`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `arg2`.

In [3]:
# TODO: Annotate remaining columns with roles
pima

name,arg2,arg1,split
role,target,unused_string,unused_string
0.0,0,A1,train
1.0,0,A10,train
2.0,0,A100,val
3.0,0,A101,train
4.0,1,A102,train
,...,...,...
763.0,1,A95,train
764.0,1,A96,train
765.0,1,A97,val
766.0,1,A98,val


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
tricepts

name,arg1,arg2
role,unused_string,unused_string
0.0,A1,0.627
1.0,A10,0.232
2.0,A100,0.325
3.0,A101,1.222
4.0,A102,0.179
,...,...
763.0,A95,0.761
764.0,A96,0.255
765.0,A97,0.13
766.0,A98,0.323


In [5]:
# TODO: Annotate columns with roles
diastolic

name,arg1,arg2
role,unused_string,unused_string
0.0,A1,148
1.0,A10,125
2.0,A100,122
3.0,A101,163
4.0,A102,151
,...,...
763.0,A95,142
764.0,A96,144
765.0,A97,92
766.0,A98,71


In [6]:
# TODO: Annotate columns with roles
age

name,arg1,arg2
role,unused_string,unused_string
0.0,A1,35
1.0,A10,0
2.0,A100,51
3.0,A101,0
4.0,A102,0
,...,...
763.0,A95,18
764.0,A96,27
765.0,A97,28
766.0,A98,18


In [7]:
# TODO: Annotate columns with roles
pedigree

name,arg1,arg2
role,unused_string,unused_string
0.0,A1,33.6
1.0,A10,0
2.0,A100,49.7
3.0,A101,39
4.0,A102,26.1
,...,...
763.0,A95,24.7
764.0,A96,33.9
765.0,A97,31.6
766.0,A98,20.4


In [8]:
# TODO: Annotate columns with roles
plasma

name,arg1,arg2
role,unused_string,unused_string
0.0,A1,0
1.0,A10,0
2.0,A100,220
3.0,A101,0
4.0,A102,0
,...,...
763.0,A95,64
764.0,A96,228
765.0,A97,0
766.0,A98,76


In [9]:
# TODO: Annotate columns with roles
num_preg

name,arg1,arg2
role,unused_string,unused_string
0.0,A1,72
1.0,A10,96
2.0,A100,90
3.0,A101,72
4.0,A102,60
,...,...
763.0,A95,82
764.0,A96,72
765.0,A97,62
766.0,A98,48


In [10]:
# TODO: Annotate columns with roles
bmi

name,arg1,arg2
role,unused_string,unused_string
0.0,A1,50
1.0,A10,54
2.0,A100,31
3.0,A101,33
4.0,A102,22
,...,...
763.0,A95,21
764.0,A96,40
765.0,A97,24
766.0,A98,22


In [11]:
# TODO: Annotate columns with roles
serum

name,arg1,arg2
role,unused_string,unused_string
0.0,A1,6
1.0,A10,8
2.0,A100,1
3.0,A101,1
4.0,A102,1
,...,...
763.0,A95,2
764.0,A96,6
765.0,A97,2
766.0,A98,1


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/Pima](https://relational.fel.cvut.cz/dataset/Pima)
for a description of the dataset.

In [12]:
dm = getml.data.DataModel(population=pima.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [13]:
container = getml.data.Container(population=pima, split=pima.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,pima,538,View
1,val,pima,230,View

Unnamed: 0,name,rows,type
0,pedigree,768,DataFrame
1,plasma,768,DataFrame
2,tricepts,768,DataFrame
3,bmi,768,DataFrame
4,serum,768,DataFrame
5,diastolic,768,DataFrame
6,age,768,DataFrame
7,num_preg,768,DataFrame
