In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_go_sales")

# Task: GOSales
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> *GOSales Dataset Description*
> 
> - *Data Model (Relational Schema)*:
>   - **go_1k**: Contains columns `Retailer code` (int), `Product number` (int), `Date` (date), and `Quantity` (int).
>   - **go_products**: Contains columns `Product number` (int), `Product line` (varchar), `Product type` (varchar), `Product` (varchar), `Product brand` (varchar), `Product color` (varchar), `Unit cost` (decimal), and `Unit price` (decimal).
>   - **go_retailers**: Contains columns `Retailer code` (int), `Retailer name` (varchar), `Type` (varchar), and `Country` (varchar).
>   - **go_daily_sales**: Contains columns `Retailer code` (int), `Product number` (int), `Order method code` (int), `Date` (date), `Quantity` (int), `Unit price` (decimal), and `Unit sale price` (decimal).
>   - **go_methods**: Contains columns `Order method code` (int) and `Order method type` (varchar).
> 
> - *Task*: Regression
>   - *Target Column*: `Quantity` in the `go_1k` table.
> 
> - *Types of the Columns*:
>   - *Int*: Used for identifiers and quantities.
>   - *Varchar*: Used for names, types, and descriptions.
>   - *Decimal*: Used for costs and prices.
>   - *Date*: Used for dates.
> 
> - *Metadata*:
>   - *Size*: 22.4 MB
>   - *Number of Tables*: 5
>   - *Number of Rows*: 150,989
>   - *Number of Columns*: 25
>   - *Missing Values*: No missing values
>   - *Instance Count*: 1,000
>   - *Target Table*: `go_1k`
>   - *Target ID*: `Retailer code`, `Product number`
>   - *Target Timestamp*: `Date`
> 
> This dataset contains information about daily sales, methods, retailers, and products for a fictitious outdoor equipment retail chain, "Great Outdoors" (GO), and is used to predict sale quantities.

### Tables
Population table: go_1k

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/GOSales.svg" alt="GOSales ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
go_1k, peripheral = load_ctu_dataset("GOSales")

(
    go_daily_sales,
    go_methods,
    go_products,
    go_retailers,
) = peripheral.values()

Analyzing schema:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/5 [00:00<?, ?it/s]

Building data:   0%|          | 0/5 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`go_1k`). We already set the `target` role for the target (`Quantity`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `Quantity`.

In [3]:
# TODO: Annotate remaining columns with roles
go_1k

name,Quantity,Retailer code,Product number,Date,split
role,target,unused_string,unused_string,unused_string,unused_string
0.0,46,1115,125110,2016-02-09,val
1.0,19,1115,144180,2016-04-21,train
2.0,11,1115,149140,2017-02-14,train
3.0,232,1132,92110,2017-06-13,train
4.0,37,1132,96110,2018-06-20,train
,...,...,...,...,...
995.0,336,1749,22110,2015-09-13,train
996.0,14,1756,144180,2015-02-13,train
997.0,12,1760,144180,2017-01-12,train
998.0,32,1762,144180,2015-02-17,val


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
go_daily_sales

name,Order method code,Order method type
role,unused_string,unused_string
0.0,1,Fax
1.0,2,Telephone
2.0,3,Mail
3.0,4,E-mail
4.0,5,Web
,...,...
7.0,8,Other
8.0,9,Other
9.0,10,Other
10.0,11,Other


In [5]:
# TODO: Annotate columns with roles
go_methods

name,Retailer code,Retailer name,Type,Country
role,unused_string,unused_string,unused_string,unused_string
0.0,1101,ActiForme,Equipment Rental Store,France
1.0,1115,SportsClub,Golf Shop,France
2.0,1123,Anapurna,Direct Marketing,France
3.0,1132,Cordages Discount,Warehouse Store,France
4.0,1133,Altitudes extrêmes,Outdoors Shop,France
,...,...,...,...
557.0,1766,Palácio do esporte,Sports Store,Brazil
558.0,1767,Golf Town,Golf Shop,United Kingdom
559.0,1768,Golf Exchange,Golf Shop,United Kingdom
560.0,1769,Meister,Sports Store,Switzerland


In [6]:
# TODO: Annotate columns with roles
go_products

name,Retailer code,Product number,Order method code,Date,Quantity,Unit price,Unit sale price
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1201,109110,4,2015-01-12,648,76.86,71.48
1.0,1201,112110,4,2015-01-12,799,10.64,10.21
2.0,1201,115110,4,2015-01-12,755,10.71,10.28
3.0,1205,70240,3,2015-01-12,70,122.7,114.11
4.0,1205,71110,3,2015-01-12,28,95.62,92.75
,...,...,...,...,...,...,...
149252.0,1258,52110,5,2018-07-09,1011,38,36.48
149253.0,1258,53110,5,2018-07-09,476,39.99,25.99
149254.0,1258,54110,5,2018-07-09,476,52.99,52.99
149255.0,1258,55110,5,2018-07-09,476,8,8


In [7]:
# TODO: Annotate columns with roles
go_retailers

name,Product number,Product line,Product type,Product,Product brand,Product color,Unit cost,Unit price
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1110,Camping Equipment,Cooking Gear,TrailChef Water Bag,TrailChef,Clear,2.77,6.59
1.0,2110,Camping Equipment,Cooking Gear,TrailChef Canteen,TrailChef,Brown,6.92,12.92
2.0,3110,Camping Equipment,Cooking Gear,TrailChef Kitchen Kit,TrailChef,Unspecified,15.78,23.8
3.0,4110,Camping Equipment,Cooking Gear,TrailChef Cup,TrailChef,Silver,0.85,3.66
4.0,5110,Camping Equipment,Cooking Gear,TrailChef Cook Set,TrailChef,Silver,34.41,54.93
,...,...,...,...,...,...,...,...
269.0,154110,Personal Accessories,Watches,Kodiak,Trakker,Blue,66.67,120.3
270.0,154120,Personal Accessories,Watches,Kodiak,Trakker,Brown,67.79,122.51
271.0,154130,Personal Accessories,Watches,Kodiak,Trakker,Green,66.83,120.76
272.0,154140,Personal Accessories,Watches,Kodiak,Trakker,Silver,74.11,136.2


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/GOSales](https://relational.fel.cvut.cz/dataset/GOSales)
for a description of the dataset.

In [8]:
dm = getml.data.DataModel(population=go_1k.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [9]:
container = getml.data.Container(population=go_1k, split=go_1k.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,go_1k,700,View
1,val,go_1k,300,View

Unnamed: 0,name,rows,type
0,go_methods,12,DataFrame
1,go_retailers,562,DataFrame
2,go_daily_sales,149257,DataFrame
3,go_products,274,DataFrame
