In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("pima")

# Task: Pima
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *Pima* dataset is a well-known dataset used for medical research, specifically focusing on diabetes prediction among Pima Indian women. It originates from a study conducted by the National Institute of Diabetes and Digestive and Kidney Diseases.
> 
> **Data Model:**
> - The dataset is organized into multiple tables: `age`, `bmi`, `diastolic`, `numPreg`, `pedigree`, `plasma`, `serum`, `tricepts`, and `pima`.
> - Each table contains two columns: `arg1` (varchar) and `arg2` (decimal for most tables, varchar for `pima`).
> 
> **Task and Target Column:**
> - The primary task is *classification*, aiming to predict the presence of diabetes.
> - The target column is `arg2` in the `pima` table.
> 
> **Column Types:**
> - The dataset includes:
>   - *Varchar* (e.g., `arg1`)
>   - *Decimal* (e.g., `arg2` in most tables)
> 
> **Metadata:**
> - Size: 700 KB
> - Number of tables: 9
> - Total number of rows: 6,912
> - Total number of columns: 18
> - Missing values: No
> - Target table: `pima`
> - Target ID: `arg1`
> 
> **Research and Usage:**
> - The dataset is widely used in medical research for developing models to predict diabetes.
> - It is a popular choice for testing classification algorithms due to its real-world medical data.
> 
> This dataset provides a valuable resource for exploring medical data analysis, particularly in predicting diabetes based on various health indicators.

### Tables
Population table: pima

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/Pima.svg" alt="Pima ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
pima, peripheral = load_ctu_dataset("Pima")

(
    age,
    bmi,
    diastolic,
    numPreg,
    pedigree,
    plasma,
    serum,
    tricepts,
) = peripheral.values()

Analyzing schema:   0%|          | 0/9 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/9 [00:00<?, ?it/s]

Building data:   0%|          | 0/9 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`pima`).

We already set the `target` role for the target (`arg2`).


arg2 is the target column for a binary classification task.

In [3]:
# TODO: Annotate remaining columns with roles
pima

name,arg2,arg1,split
role,target,unused_string,unused_string
0.0,0,A1,train
1.0,0,A10,train
2.0,0,A100,val
3.0,0,A101,train
4.0,1,A102,train
,...,...,...
763.0,1,A95,train
764.0,1,A96,train
765.0,1,A97,val
766.0,1,A98,val


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
age

name,arg2,arg1
role,unused_float,unused_string
0.0,50,A1
1.0,54,A10
2.0,31,A100
3.0,33,A101
4.0,22,A102
,...,...
763.0,21,A95
764.0,40,A96
765.0,24,A97
766.0,22,A98


In [5]:
# TODO: Annotate columns with roles
bmi

name,arg2,arg1
role,unused_float,unused_string
0.0,33.6,A1
1.0,0,A10
2.0,49.7,A100
3.0,39,A101
4.0,26.1,A102
,...,...
763.0,24.7,A95
764.0,33.9,A96
765.0,31.6,A97
766.0,20.4,A98


In [6]:
# TODO: Annotate columns with roles
diastolic

name,arg2,arg1
role,unused_float,unused_string
0.0,72,A1
1.0,96,A10
2.0,90,A100
3.0,72,A101
4.0,60,A102
,...,...
763.0,82,A95
764.0,72,A96
765.0,62,A97
766.0,48,A98


In [7]:
# TODO: Annotate columns with roles
numPreg

name,arg2,arg1
role,unused_float,unused_string
0.0,6,A1
1.0,8,A10
2.0,1,A100
3.0,1,A101
4.0,1,A102
,...,...
763.0,2,A95
764.0,6,A96
765.0,2,A97
766.0,1,A98


In [8]:
# TODO: Annotate columns with roles
pedigree

name,arg2,arg1
role,unused_float,unused_string
0.0,0.627,A1
1.0,0.232,A10
2.0,0.325,A100
3.0,1.222,A101
4.0,0.179,A102
,...,...
763.0,0.761,A95
764.0,0.255,A96
765.0,0.13,A97
766.0,0.323,A98


In [9]:
# TODO: Annotate columns with roles
plasma

name,arg2,arg1
role,unused_float,unused_string
0.0,148,A1
1.0,125,A10
2.0,122,A100
3.0,163,A101
4.0,151,A102
,...,...
763.0,142,A95
764.0,144,A96
765.0,92,A97
766.0,71,A98


In [10]:
# TODO: Annotate columns with roles
serum

name,arg2,arg1
role,unused_float,unused_string
0.0,0,A1
1.0,0,A10
2.0,220,A100
3.0,0,A101
4.0,0,A102
,...,...
763.0,64,A95
764.0,228,A96
765.0,0,A97
766.0,76,A98


In [11]:
# TODO: Annotate columns with roles
tricepts

name,arg2,arg1
role,unused_float,unused_string
0.0,35,A1
1.0,0,A10
2.0,51,A100
3.0,0,A101
4.0,0,A102
,...,...
763.0,18,A95
764.0,27,A96
765.0,28,A97
766.0,18,A98


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/Pima](https://relational.fel.cvut.cz/dataset/Pima)
for a description of the dataset.

In [12]:
dm = getml.data.DataModel(population=pima.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [13]:
container = getml.data.Container(population=pima, split=pima.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,pima,538,View
1,val,pima,230,View

Unnamed: 0,name,rows,type
0,age,768,DataFrame
1,bmi,768,DataFrame
2,diastolic,768,DataFrame
3,num_preg,768,DataFrame
4,pedigree,768,DataFrame
5,plasma,768,DataFrame
6,serum,768,DataFrame
7,tricepts,768,DataFrame
