In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("seznam")

# Task: Seznam
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *Seznam* dataset represents online advertisement expenditures from Seznam.cz, a web portal and search engine in the Czech Republic. It is used for a regression task, with the target column being *kc_proklikano* in the *probehnuto* table.
> 
> **Data Model:**
> 
> - **dobito** table:
>   - *client_id*: int
>   - *month_year_datum_transakce*: date
>   - *sluzba*: varchar
>   - *kc_dobito*: decimal
> 
> - **probehnuto** table:
>   - *client_id*: int
>   - *month_year_datum_transakce*: date
>   - *sluzba*: varchar
>   - *kc_proklikano*: decimal (target column)
> 
> - **probehnuto_mimo_penezenku** table:
>   - *client_id*: int
>   - *Month/Year*: date
>   - *probehlá_inzerce_mimo_penezenku*: varchar
> 
> - **client** table:
>   - *client_id*: int
>   - *kraj*: varchar
>   - *obor*: varchar
> 
> **Metadata:**
> 
> - Size: 146.8 MB
> - Number of tables: 4
> - Number of rows: 2,681,983
> - Number of columns: 14
> - Missing values: Yes
> - Compound keys: No
> - Loops: No
> - Type: Real
> - Instance count: 1,458,233
> 
> The dataset is used in retail analytics to predict advertising expenditures based on client and service attributes. It provides insights into online advertising behavior and spending patterns.

### Tables
Population table: probehnuto

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/Seznam.svg" alt="Seznam ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
probehnuto, peripheral = load_ctu_dataset("Seznam")

(
    probehnuto_mimo_penezenku,
    client,
    dobito,
) = peripheral.values()

Analyzing schema:   0%|          | 0/4 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/4 [00:00<?, ?it/s]

Building data:   0%|          | 0/4 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`probehnuto`).

We already set the `target` role for the target (`kc_proklikano`).


kc_proklikano is the target column for a regression task.

In [3]:
# TODO: Annotate remaining columns with roles
probehnuto

name,kc_proklikano,client_id,month_year_datum_transakce,sluzba,split
role,target,unused_float,unused_string,unused_string,unused_string
0.0,-31.4,109145,2013-06-01,c,train
1.0,37.68,9804394,2015-10-01,h,train
2.0,725.34,9803353,2015-10-01,h,val
3.0,194.68,9801753,2015-10-01,h,train
4.0,1042.48,9800425,2015-10-01,h,train
,...,...,...,...,...
1462073.0,153.86,98857,2015-08-01,,train
1462074.0,153.86,95776,2015-09-01,,train
1462075.0,153.86,98857,2015-09-01,,train
1462076.0,310.86,90001,2015-10-01,,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
probehnuto_mimo_penezenku

name,client_id,kraj,obor
role,unused_float,unused_string,unused_string
0.0,3901,Vysočina,Vilma
1.0,3904,Jihomoravský kraj,Leona
2.0,3907,Zlínský kraj,Vladan
3.0,3912,Ústecký kraj,Sonja
4.0,3916,Ústecký kraj,Bohdana
,...,...,...
73442.0,9806237,Jihočeský kraj,
73443.0,9806258,,
73444.0,9806301,,
73445.0,9806350,Jihomoravský kraj,Leona


In [5]:
# TODO: Annotate columns with roles
client

name,client_id,Month/Year,probehla_inzerce_mimo_penezenku
role,unused_float,unused_string,unused_string
0.0,3901,2012-08-01,ANO
1.0,3901,2012-09-01,ANO
2.0,3901,2012-10-01,ANO
3.0,3901,2012-11-01,ANO
4.0,3901,2012-12-01,ANO
,...,...,...
599381.0,9804086,2015-10-01,ANO
599382.0,9804238,2015-10-01,ANO
599383.0,9804782,2015-10-01,ANO
599384.0,9804810,2015-10-01,ANO


In [6]:
# TODO: Annotate columns with roles
dobito

name,client_id,kc_dobito,month_year_datum_transakce,sluzba
role,unused_float,unused_float,unused_string,unused_string
0.0,7157857,1045.62,2012-10-01,c
1.0,109700,5187.28,2015-10-01,c
2.0,51508,408.2,2015-08-01,c
3.0,9573550,521.24,2012-10-01,c
4.0,9774621,386.22,2014-11-01,c
,...,...,...,...
554341.0,65283,7850,2012-09-01,g
554342.0,6091446,31400,2012-08-01,g
554343.0,1264806,-8220.52,2013-08-01,g
554344.0,101103,3140,2012-08-01,g


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/Seznam](https://relational.fel.cvut.cz/dataset/Seznam)
for a description of the dataset.

In [7]:
dm = getml.data.DataModel(population=probehnuto.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [8]:
container = getml.data.Container(population=probehnuto, split=probehnuto.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,probehnuto,1023455,View
1,val,probehnuto,438623,View

Unnamed: 0,name,rows,type
0,client,73447,DataFrame
1,probehnuto_mimo_penezenku,599386,DataFrame
2,dobito,554346,DataFrame
