In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("financial")

# Task: financial
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *financial* dataset, also known as the loan application dataset, is part of the PKDD'99 Financial dataset. It includes information on 606 successful and 76 unsuccessful loans, along with related transactions. The primary task is multiclass classification, aiming to predict the loan outcome based on the *status* column in the *loan* table.
> 
> **Data Model:**
> 
> - **loan** table:
>   - *loan_id*: int
>   - *account_id*: int
>   - *date*: date
>   - *amount*: int
>   - *duration*: int
>   - *payments*: decimal
>   - *status*: varchar (target column)
> 
> - **order** table:
>   - *order_id*: int
>   - *account_id*: int
>   - *bank_to*: varchar
>   - *account_to*: int
>   - *amount*: decimal
>   - *k_symbol*: varchar
> 
> - **trans** table:
>   - *trans_id*: int
>   - *account_id*: int
>   - *date*: date
>   - *type*: varchar
>   - *operation*: varchar
>   - *amount*: int
>   - *balance*: int
>   - *k_symbol*: varchar
>   - *bank*: varchar
>   - *account*: int
> 
> - **disp** table:
>   - *disp_id*: int
>   - *client_id*: int
>   - *account_id*: int
>   - *type*: varchar
> 
> - **card** table:
>   - *card_id*: int
>   - *disp_id*: int
>   - *type*: varchar
>   - *issued*: date
> 
> - **account** table:
>   - *account_id*: int
>   - *district_id*: int
>   - *frequency*: varchar
>   - *date*: date
> 
> - **client** table:
>   - *client_id*: int
>   - *gender*: varchar
>   - *birth_date*: date
>   - *district_id*: int
> 
> - **district** table:
>   - *district_id*: int
>   - *A2* to *A16*: various data types
> 
> **Metadata:**
> 
> - Size: 78.8 MB
> - Number of tables: 8
> - Number of rows: 1,090,086
> - Number of columns: 55
> - Missing values: Yes
> - Compound keys: No
> - Loops: Yes
> - Type: Real
> - Instance count: 682
> 
> The dataset is widely used in financial research to analyze loan outcomes and predict financial behaviors. It has been referenced in numerous studies focusing on relational classification and financial risk assessment.

### Tables
Population table: loan

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/financial.svg" alt="financial ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
loan, peripheral = load_ctu_dataset("financial")

(
    card,
    account,
    trans,
    disp,
    client,
    order,
    district,
) = peripheral.values()

Analyzing schema:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/8 [00:00<?, ?it/s]

Building data:   0%|          | 0/8 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`loan`).

We already set the `target` role for the target (`status`).



In [3]:
# TODO: Annotate remaining columns with roles
loan

name,status,loan_id,account_id,date,amount,duration,payments,split
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,0,4959,2,1994-01-05,80952,24,3373,val
1.0,1,4961,19,1996-04-29,30276,12,2523,train
2.0,0,4962,25,1997-12-08,30276,12,2523,train
3.0,2,4967,37,1998-10-14,318480,60,5308,train
4.0,3,4968,38,1998-04-19,110736,48,2307,train
,...,...,...,...,...,...,...,...
677.0,3,7294,11327,1998-09-27,39168,24,1632,train
678.0,3,7295,11328,1998-07-18,280440,60,4674,val
679.0,3,7304,11349,1995-10-29,419880,60,6998,train
680.0,0,7305,11359,1996-08-06,54024,12,4502,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
card

name,card_id,disp_id,type,issued
role,unused_string,unused_string,unused_string,unused_string
0.0,1,9,gold,1998-10-16
1.0,2,19,classic,1998-03-13
2.0,3,41,gold,1995-09-03
3.0,4,42,classic,1998-11-26
4.0,5,51,junior,1995-04-24
,...,...,...,...
887.0,1230,13312,classic,1998-03-08
888.0,1233,13382,classic,1996-07-06
889.0,1234,13386,classic,1997-11-28
890.0,1239,13442,junior,1998-02-02


In [5]:
# TODO: Annotate columns with roles
account

name,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,1,1995-03-24,PRIJEM,VKLAD,1000,1000,,,
1.0,5,1,1995-04-13,PRIJEM,PREVOD Z UCTU,3679,4679,,AB,41403269
2.0,6,1,1995-05-13,PRIJEM,PREVOD Z UCTU,3679,20977,,AB,41403269
3.0,7,1,1995-06-13,PRIJEM,PREVOD Z UCTU,3679,26835,,AB,41403269
4.0,8,1,1995-07-13,PRIJEM,PREVOD Z UCTU,3679,30415,,AB,41403269
,...,...,...,...,...,...,...,...,...,...
1056315.0,3682983,10451,1998-08-31,PRIJEM,,62,17300,UROK,,
1056316.0,3682984,10451,1998-09-30,PRIJEM,,49,13442,UROK,,
1056317.0,3682985,10451,1998-10-31,PRIJEM,,34,10118,UROK,,
1056318.0,3682986,10451,1998-11-30,PRIJEM,,26,8398,UROK,,


In [6]:
# TODO: Annotate columns with roles
trans

name,account_id,district_id,frequency,date
role,unused_string,unused_string,unused_string,unused_string
0.0,1,18,POPLATEK MESICNE,1995-03-24
1.0,2,1,POPLATEK MESICNE,1993-02-26
2.0,3,5,POPLATEK MESICNE,1997-07-07
3.0,4,12,POPLATEK MESICNE,1996-02-21
4.0,5,15,POPLATEK MESICNE,1997-05-30
,...,...,...,...
4495.0,11333,8,POPLATEK MESICNE,1994-05-26
4496.0,11349,1,POPLATEK TYDNE,1995-05-26
4497.0,11359,61,POPLATEK MESICNE,1994-10-01
4498.0,11362,67,POPLATEK MESICNE,1995-10-14


In [7]:
# TODO: Annotate columns with roles
disp

name,disp_id,client_id,account_id,type
role,unused_string,unused_string,unused_string,unused_string
0.0,1,1,1,OWNER
1.0,2,2,2,OWNER
2.0,3,3,2,DISPONENT
3.0,4,4,3,OWNER
4.0,5,5,3,DISPONENT
,...,...,...,...
5364.0,13647,13955,11349,OWNER
5365.0,13648,13956,11349,DISPONENT
5366.0,13660,13968,11359,OWNER
5367.0,13663,13971,11362,OWNER


In [8]:
# TODO: Annotate columns with roles
client

name,client_id,gender,birth_date,district_id
role,unused_string,unused_string,unused_string,unused_string
0.0,1,F,1970-12-13,18
1.0,2,M,1945-02-04,1
2.0,3,F,1940-10-09,1
3.0,4,M,1956-12-01,5
4.0,5,F,1960-07-03,5
,...,...,...,...
5364.0,13955,F,1945-10-30,1
5365.0,13956,M,1943-04-06,1
5366.0,13968,M,1968-04-13,61
5367.0,13971,F,1962-10-19,67


In [9]:
# TODO: Annotate columns with roles
order

name,district_id,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100,12541,0.2,0.43,167,85677,99107
1.0,2,Benesov,central Bohemia,88884,80,26,6,2,5,46.7,8507,1.6,1.85,132,2159,2674
2.0,3,Beroun,central Bohemia,75232,55,26,4,1,5,41.7,8980,1.9,2.21,111,2824,2813
3.0,4,Kladno,central Bohemia,149893,63,29,6,2,6,67.4,9753,4.6,5.05,109,5244,5892
4.0,5,Kolin,central Bohemia,95616,65,30,4,1,6,51.4,9307,3.8,4.43,118,2616,3040
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72.0,73,Opava,north Moravia,182027,17,49,12,2,7,56.4,8746,3.3,3.74,90,4355,4433
73.0,74,Ostrava - mesto,north Moravia,323870,0,0,0,1,1,100,10673,4.7,5.44,100,18782,18347
74.0,75,Prerov,north Moravia,138032,67,30,4,2,5,64.6,8819,5.3,5.66,99,4063,4505
75.0,76,Sumperk,north Moravia,127369,31,32,13,2,7,51.2,8369,4.7,5.88,107,3736,2807


In [10]:
# TODO: Annotate columns with roles
district

name,order_id,account_id,bank_to,account_to,amount,k_symbol
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,29401,1,YZ,87144583,2452,SIPO
1.0,29402,2,ST,89597016,3372.7,UVER
2.0,29403,2,QR,13943797,7266,SIPO
3.0,29404,3,WX,83084338,1135,SIPO
4.0,29405,3,CD,24485939,327,
,...,...,...,...,...,...
6466.0,46334,11362,YZ,70641225,4780,SIPO
6467.0,46335,11362,MN,78507822,56,
6468.0,46336,11362,ST,40799850,330,POJISTNE
6469.0,46337,11362,KL,20009470,129,


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/financial](https://relational.fel.cvut.cz/dataset/financial)
for a description of the dataset.

In [11]:
dm = getml.data.DataModel(population=loan.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [12]:
container = getml.data.Container(population=loan, split=loan.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,loan,478,View
1,val,loan,204,View

Unnamed: 0,name,rows,type
0,card,892,DataFrame
1,trans,1056320,DataFrame
2,account,4500,DataFrame
3,disp,5369,DataFrame
4,client,5369,DataFrame
5,district,77,DataFrame
6,order,6471,DataFrame
