In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_consumer_expenditures")

# Task: ConsumerExpenditures
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *ConsumerExpenditures* dataset is derived from the Consumer Expenditure Survey (CE), which collects data on expenditures, income, and demographics in the United States. It is designed to predict whether an expenditure is a gift.
> 
> - *Data Model*: The dataset consists of 3 tables: `EXPENDITURES`, `HOUSEHOLD_MEMBERS`, and `HOUSEHOLDS`. These tables capture various aspects of household expenditures and demographics.
> 
> - *Task*: The primary task associated with this dataset is *classification*, with the target column being `GIFT` in the `EXPENDITURES` table.
> 
> - *Column Types*: The dataset includes various data types:
>   - *Numeric*: `int`, `double`
>   - *String*: `varchar`
> 
> - *Metadata*:
>   - Size: 337.6 MB
>   - Number of Tables: 3
>   - Number of Rows: 2,241,548
>   - Number of Columns: 24
>   - Missing Values: Yes
>   - Compound Keys: No
>   - Loops: No
>   - Instance Count: 2,047,961
>   - Target Table: `EXPENDITURES`
>   - Target Column: `GIFT`
>   - Target ID: `EXPENDITURE_ID`
> 
> This dataset is publicly available and can be accessed through a MariaDB client. It is used for analyzing consumer spending patterns and predicting gift expenditures.

### Tables
Population table: expenditures

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/ConsumerExpenditures.svg" alt="ConsumerExpenditures ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
expenditures, peripheral = load_ctu_dataset("ConsumerExpenditures")

(
    household_members,
    households,
) = peripheral.values()

Analyzing schema:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/3 [00:00<?, ?it/s]

Building data:   0%|          | 0/3 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`expenditures`). We already set the `target` role for the target (`GIFT`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `GIFT`.

In [3]:
# TODO: Annotate remaining columns with roles
expenditures

name,GIFT,YEAR,MONTH,COST,IS_TRAINING,EXPENDITURE_ID,HOUSEHOLD_ID,PRODUCT_CODE,split
role,target,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string
0.0,0,2015,1,3.89,1,1,03111041,010210,train
1.0,0,2015,1,4.66,1,10,03111041,120310,val
2.0,0,2015,2,9.79,1,100,03111051,190211,train
3.0,0,2015,2,2.95,1,1000,03111402,040510,train
4.0,0,2015,1,2.12,1,10000,03114161,190321,train
,...,...,...,...,...,...,...,...,...
2020629.0,0,2017,6,1.99,1,999995,03708582,150110,train
2020630.0,0,2017,6,3.619,1,999996,03708582,150110,train
2020631.0,0,2017,6,5.2727,1,999997,03708582,150211,val
2020632.0,0,2017,6,4.6894,1,999998,03708582,150310,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
household_members

name,YEAR,INCOME_RANK,INCOME_RANK_1,INCOME_RANK_2,INCOME_RANK_3,INCOME_RANK_4,INCOME_RANK_5,INCOME_RANK_MEAN,AGE_REF,HOUSEHOLD_ID
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string
0.0,2015,0.3044,0.1448,0.1427,0.1432,0.1422,0.1382,0.127,66,03111041
1.0,2015,0.3063,0.1462,0.1444,0.1446,0.1435,0.1395,0.1283,66,03111042
2.0,2015,0.6931,0.6222,0.6204,0.623,0.6131,0.6123,0.6207,48,03111051
3.0,2015,0.6926,0.6216,0.6198,0.6224,0.6125,0.6117,0.6201,48,03111052
4.0,2015,0.2817,0.113,0.1128,0.1098,0.1116,0.1092,0.0951,37,03111061
,...,...,...,...,...,...,...,...,...,...
56807.0,2019,0.4828,0.4106,0.3603,0.3958,0.377,0.3984,0.3769,67,04362582
56808.0,2019,0.6644,0.5975,0.6026,0.5949,0.596,0.6002,0.6,52,04362661
56809.0,2019,0.6639,0.597,0.6021,0.5944,0.5955,0.5997,0.5995,52,04362662
56810.0,2019,0.162,0.05217,0.03955,0.04507,0.04607,0.02436,0.03558,72,04362671


In [5]:
# TODO: Annotate columns with roles
households

name,YEAR,AGE,HOUSEHOLD_ID,MARITAL,SEX,WORK_STATUS
role,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string
0.0,2015,66,03111041,1,1,
1.0,2015,66,03111042,1,1,
2.0,2015,56,03111091,1,1,
3.0,2015,56,03111092,1,1,
4.0,2015,50,03111111,1,1,1
,...,...,...,...,...,...
137350.0,2019,22,04362422,5,2,
137351.0,2019,11,04362431,5,2,
137352.0,2019,11,04362432,5,2,
137353.0,2019,72,04362671,5,2,


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/ConsumerExpenditures](https://relational.fel.cvut.cz/dataset/ConsumerExpenditures)
for a description of the dataset.

In [6]:
dm = getml.data.DataModel(population=expenditures.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [7]:
container = getml.data.Container(population=expenditures, split=expenditures.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,expenditures,1414444,View
1,val,expenditures,606190,View

Unnamed: 0,name,rows,type
0,households,56812,DataFrame
1,household_members,137355,DataFrame
