In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("fnhk")

# Task: FNHK
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *FNHK* dataset contains anonymized data from a hospital in Hradec Kralove, Czech Republic, focusing on treatment and medication. It is used for a regression task, with the target column being *Delka_hospitalizace* in the *pripady* table.
> 
> **Data Model:**
> 
> - **vykony** table:
>   - *Identifikace_pripadu*: int
>   - *Datum_provedeni_vykonu*: date
>   - *Typ_polozky*: int
>   - *Kod_polozky*: int
>   - *Pocet*: int
>   - *Body*: int
> 
> - **zup** table:
>   - *Identifikace_pripadu*: int
>   - *Datum_provedeni_vykonu*: date
>   - *Typ_polozky*: int
>   - *Kod_polozky*: int
>   - *Pocet*: decimal
>   - *Cena*: decimal
> 
> - **pripady** table:
>   - *Identifikace_pripadu*: int
>   - *Identifikator_pacienta*: int
>   - *Kod_zdravotni_pojistovny*: int
>   - *Datum_prijeti*: date
>   - *Datum_propusteni*: date
>   - *Delka_hospitalizace*: int (target column)
>   - *Vekovy_Interval_Pacienta*: varchar
>   - *Pohlavi_pacienta*: char
>   - *Zakladni_diagnoza*: varchar
>   - *Seznam_vedlejsich_diagnoz*: varchar
>   - *DRG_skupina*: int
>   - *PSC*: char
> 
> **Metadata:**
> 
> - Size: 130.8 MB
> - Number of tables: 3
> - Number of rows: 2,108,356
> - Number of columns: 24
> - Missing values: Yes
> - Compound keys: No
> - Loops: No
> - Type: Real
> - Instance count: 41,392
> 
> The dataset is used in medical research to analyze hospital treatment patterns and predict hospitalization duration. It provides insights into healthcare management and patient care strategies.

### Tables
Population table: pripady

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/FNHK.svg" alt="FNHK ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
pripady, peripheral = load_ctu_dataset("FNHK")

(
    vykony,
    zup,
) = peripheral.values()

Analyzing schema:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/3 [00:00<?, ?it/s]

Building data:   0%|          | 0/3 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`pripady`).

We already set the `target` role for the target (`Delka_hospitalizace`).


Delka_hospitalizace is the target column for a regression task.

In [3]:
# TODO: Annotate remaining columns with roles
pripady

name,Delka_hospitalizace,Identifikace_pripadu,Identifikator_pacienta,Kod_zdravotni_pojistovny,Datum_prijeti,Datum_propusteni,Vekovy_Interval_Pacienta,Pohlavi_pacienta,Zakladni_diagnoza,Seznam_vedlejsich_diagnoz,DRG_skupina,PSC,split
role,target,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,61,1829493,104337,207,2014-11-01,2014-12-31,50-60,M,S8230,,8111,50352,train
1.0,41,1840525,324424,111,2013-11-22,2014-01-01,60-70,F,C240,K720 C767 I10 E46 E118,88873,50346,val
2.0,3,1840526,30854,111,2013-12-30,2014-01-01,80+,F,N23,N131 N201,11342,50347,val
3.0,4,1840527,1343226,111,2013-12-29,2014-01-01,0-10,M,Z380,,15751,28126,train
4.0,4,1840528,1343217,205,2013-12-29,2014-01-01,0-10,M,Z380,,15751,50332,train
,...,...,...,...,...,...,...,...,...,...,...,...,...
41514.0,2,1882040,260206,111,2014-12-30,2014-12-31,70-80,M,S0650,I958 R001 R402 G936,1443,50002,train
41515.0,5,1882041,1372657,111,2014-12-27,2014-12-31,0-10,F,Z380,,15751,50801,val
41516.0,3,1882042,129640,207,2014-12-29,2014-12-31,40-50,F,S8270,S8230S8280,8111,50324,val
41517.0,34,1958501,1363126,611,2014-08-30,2014-10-02,0-10,F,P364,G008 Z290,15743,99999,val


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
vykony

name,Identifikace_pripadu,Datum_provedeni_vykonu,Typ_polozky,Kod_polozky,Pocet,Cena
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1829493,2014-11-01,3,2370,4,458.48
1.0,1829493,2014-11-01,1,58092,0.1,23
2.0,1829493,2014-11-01,3,73679,4,7395.48
3.0,1829493,2014-11-01,3,99861,6,11024.82
4.0,1829493,2014-11-01,3,99862,1,841.53
,...,...,...,...,...,...
192414.0,1958501,2014-09-21,1,83050,3,134.82
192415.0,1958501,2014-09-22,1,83050,3,134.82
192416.0,1958501,2014-09-23,1,83050,3,134.82
192417.0,1958501,2014-09-24,1,83050,1,44.94


In [5]:
# TODO: Annotate columns with roles
zup

name,Identifikace_pripadu,Datum_provedeni_vykonu,Typ_polozky,Kod_polozky,Pocet,Body
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1829493,2014-11-01,0,38210,1,79
1.0,1829493,2014-11-01,0,51859,1,300
2.0,1829493,2014-11-01,0,53021,1,348
3.0,1829493,2014-11-01,0,53022,1,234
4.0,1829493,2014-11-01,0,66819,1,4067
,...,...,...,...,...,...
1879332.0,1958505,2014-08-30,0,63115,1,239
1879333.0,1958505,2014-08-30,0,63117,4,1508
1879334.0,1958505,2014-08-30,0,63120,1,2987
1879335.0,1958505,2014-08-31,0,602,1,1381


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/FNHK](https://relational.fel.cvut.cz/dataset/FNHK)
for a description of the dataset.

In [6]:
dm = getml.data.DataModel(population=pripady.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [7]:
container = getml.data.Container(population=pripady, split=pripady.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,pripady,29064,View
1,val,pripady,12455,View

Unnamed: 0,name,rows,type
0,zup,192419,DataFrame
1,vykony,1879337,DataFrame
