In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("financial")

# Task: financial
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *financial* dataset, also known as the PKDD'99 Financial dataset, is a comprehensive relational dataset used primarily for multiclass classification tasks. The dataset consists of information related to loan applications, including successful and unsuccessful loans, along with associated transactions.
> 
> **Data Model:**
> - The dataset is structured into multiple tables: `loan`, `order`, `trans`, `disp`, `account`, `client`, `card`, and `district`.
> - Each table contains various attributes, such as `loan_id`, `account_id`, `amount`, `date`, `status`, etc.
> 
> **Task and Target Column:**
> - The primary task is *multiclass classification*, aiming to predict the loan outcome.
> - The target column is `status` in the `loan` table, which indicates the outcome of the loan (e.g., successful or not).
> 
> **Column Types:**
> - The dataset includes a mix of data types:
>   - *Integer* (e.g., `loan_id`, `account_id`)
>   - *Varchar* (e.g., `status`, `type`)
>   - *Date* (e.g., `date`, `birth_date`)
>   - *Decimal* (e.g., `payments`, `amount`)
> 
> **Metadata:**
> - Size: 78.8 MB
> - Number of tables: 8
> - Total number of rows: 1,090,086
> - Total number of columns: 55
> - Missing values: Yes
> - Target table: `loan`
> - Target ID: `account_id`
> - Target timestamp: `date`
> 
> **Research and Usage:**
> - The dataset has been widely used in research for developing and testing classification algorithms.
> - It has been referenced in various studies, such as "Constrained Sequential Pattern Knowledge in Multi-Relational Learning" and "Wordification: Propositionalization by unfolding relational data into bags of words."
> - Algorithms like Aleph, CoTReC, and CrossMine have been applied to this dataset, achieving varying levels of accuracy.
> 
> This dataset is a valuable resource for exploring financial data analysis, particularly in predicting loan outcomes based on historical transaction data.

### Tables
Population table: loan

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/financial.svg" alt="financial ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
loan, peripheral = load_ctu_dataset("financial")

(
    account,
    card,
    client,
    disp,
    district,
    order,
    trans,
) = peripheral.values()

Analyzing schema:   0%|          | 0/8 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/8 [00:00<?, ?it/s]

Building data:   0%|          | 0/8 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`loan`).

We already set the `target` role for the target (`status`).



In [3]:
# TODO: Annotate remaining columns with roles
loan

name,loan_id,account_id,amount,duration,payments,status,date,split
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string
0.0,4959,2,80952,24,3373,0,1994-01-05,val
1.0,4961,19,30276,12,2523,1,1996-04-29,train
2.0,4962,25,30276,12,2523,0,1997-12-08,train
3.0,4967,37,318480,60,5308,2,1998-10-14,train
4.0,4968,38,110736,48,2307,3,1998-04-19,train
,...,...,...,...,...,...,...,...
677.0,7294,11327,39168,24,1632,3,1998-09-27,train
678.0,7295,11328,280440,60,4674,3,1998-07-18,val
679.0,7304,11349,419880,60,6998,3,1995-10-29,train
680.0,7305,11359,54024,12,4502,0,1996-08-06,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
account

name,account_id,district_id,frequency,date
role,unused_float,unused_float,unused_string,unused_string
0.0,1,18,POPLATEK MESICNE,1995-03-24
1.0,2,1,POPLATEK MESICNE,1993-02-26
2.0,3,5,POPLATEK MESICNE,1997-07-07
3.0,4,12,POPLATEK MESICNE,1996-02-21
4.0,5,15,POPLATEK MESICNE,1997-05-30
,...,...,...,...
4495.0,11333,8,POPLATEK MESICNE,1994-05-26
4496.0,11349,1,POPLATEK TYDNE,1995-05-26
4497.0,11359,61,POPLATEK MESICNE,1994-10-01
4498.0,11362,67,POPLATEK MESICNE,1995-10-14


In [5]:
# TODO: Annotate columns with roles
card

name,card_id,disp_id,type,issued
role,unused_float,unused_float,unused_string,unused_string
0.0,1,9,gold,1998-10-16
1.0,2,19,classic,1998-03-13
2.0,3,41,gold,1995-09-03
3.0,4,42,classic,1998-11-26
4.0,5,51,junior,1995-04-24
,...,...,...,...
887.0,1230,13312,classic,1998-03-08
888.0,1233,13382,classic,1996-07-06
889.0,1234,13386,classic,1997-11-28
890.0,1239,13442,junior,1998-02-02


In [6]:
# TODO: Annotate columns with roles
client

name,client_id,district_id,gender,birth_date
role,unused_float,unused_float,unused_string,unused_string
0.0,1,18,F,1970-12-13
1.0,2,1,M,1945-02-04
2.0,3,1,F,1940-10-09
3.0,4,5,M,1956-12-01
4.0,5,5,F,1960-07-03
,...,...,...,...
5364.0,13955,1,F,1945-10-30
5365.0,13956,1,M,1943-04-06
5366.0,13968,61,M,1968-04-13
5367.0,13971,67,F,1962-10-19


In [7]:
# TODO: Annotate columns with roles
disp

name,disp_id,client_id,account_id,type
role,unused_float,unused_float,unused_float,unused_string
0.0,1,1,1,OWNER
1.0,2,2,2,OWNER
2.0,3,3,2,DISPONENT
3.0,4,4,3,OWNER
4.0,5,5,3,DISPONENT
,...,...,...,...
5364.0,13647,13955,11349,OWNER
5365.0,13648,13956,11349,DISPONENT
5366.0,13660,13968,11359,OWNER
5367.0,13663,13971,11362,OWNER


In [8]:
# TODO: Annotate columns with roles
district

name,district_id,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,A2,A3
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string
0.0,1,1204953,0,0,0,1,1,100,12541,0.2,0.43,167,85677,99107,Hl.m. Praha,Prague
1.0,2,88884,80,26,6,2,5,46.7,8507,1.6,1.85,132,2159,2674,Benesov,central Bohemia
2.0,3,75232,55,26,4,1,5,41.7,8980,1.9,2.21,111,2824,2813,Beroun,central Bohemia
3.0,4,149893,63,29,6,2,6,67.4,9753,4.6,5.05,109,5244,5892,Kladno,central Bohemia
4.0,5,95616,65,30,4,1,6,51.4,9307,3.8,4.43,118,2616,3040,Kolin,central Bohemia
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72.0,73,182027,17,49,12,2,7,56.4,8746,3.3,3.74,90,4355,4433,Opava,north Moravia
73.0,74,323870,0,0,0,1,1,100,10673,4.7,5.44,100,18782,18347,Ostrava - mesto,north Moravia
74.0,75,138032,67,30,4,2,5,64.6,8819,5.3,5.66,99,4063,4505,Prerov,north Moravia
75.0,76,127369,31,32,13,2,7,51.2,8369,4.7,5.88,107,3736,2807,Sumperk,north Moravia


In [9]:
# TODO: Annotate columns with roles
order

name,order_id,account_id,account_to,amount,bank_to,k_symbol
role,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string
0.0,29401,1,87144583,2452,YZ,SIPO
1.0,29402,2,89597016,3372.7,ST,UVER
2.0,29403,2,13943797,7266,QR,SIPO
3.0,29404,3,83084338,1135,WX,SIPO
4.0,29405,3,24485939,327,CD,
,...,...,...,...,...,...
6466.0,46334,11362,70641225,4780,YZ,SIPO
6467.0,46335,11362,78507822,56,MN,
6468.0,46336,11362,40799850,330,ST,POJISTNE
6469.0,46337,11362,20009470,129,KL,


In [10]:
# TODO: Annotate columns with roles
trans

name,trans_id,account_id,amount,balance,account,date,type,operation,k_symbol,bank
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,1,1000,1000,,1995-03-24,PRIJEM,VKLAD,,
1.0,5,1,3679,4679,41403269,1995-04-13,PRIJEM,PREVOD Z UCTU,,AB
2.0,6,1,3679,20977,41403269,1995-05-13,PRIJEM,PREVOD Z UCTU,,AB
3.0,7,1,3679,26835,41403269,1995-06-13,PRIJEM,PREVOD Z UCTU,,AB
4.0,8,1,3679,30415,41403269,1995-07-13,PRIJEM,PREVOD Z UCTU,,AB
,...,...,...,...,...,...,...,...,...,...
1056315.0,3682983,10451,62,17300,,1998-08-31,PRIJEM,,UROK,
1056316.0,3682984,10451,49,13442,,1998-09-30,PRIJEM,,UROK,
1056317.0,3682985,10451,34,10118,,1998-10-31,PRIJEM,,UROK,
1056318.0,3682986,10451,26,8398,,1998-11-30,PRIJEM,,UROK,


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/financial](https://relational.fel.cvut.cz/dataset/financial)
for a description of the dataset.

In [11]:
dm = getml.data.DataModel(population=loan.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [12]:
container = getml.data.Container(population=loan, split=loan.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,loan,478,View
1,val,loan,204,View

Unnamed: 0,name,rows,type
0,account,4500,DataFrame
1,card,892,DataFrame
2,client,5369,DataFrame
3,disp,5369,DataFrame
4,district,77,DataFrame
5,order,6471,DataFrame
6,trans,1056320,DataFrame
