In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_craft_beer")

# Task: CraftBeer
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> *Data Model:*
> 
> The *CraftBeer* dataset consists of two tables: `beers` and `breweries`. These tables provide detailed information about various craft beers and the breweries that produce them.
> 
> - **beers**: This table includes columns `id` (int), `brewery_id` (int), `abv` (float), `ibu` (decimal), `name` (varchar), `style` (varchar), and `ounces` (decimal). It describes the characteristics of different beers, such as alcohol content, bitterness, and style.
> 
> - **breweries**: Contains `id` (int), `name` (varchar), `city` (varchar), and `state` (varchar). This table lists breweries along with their names and locations.
> 
> *Task and Target Column:*
> 
> The primary task is *classification*, with the target column being `state` in the `breweries` table. The goal is to classify breweries by their state.
> 
> *Column Types:*
> 
> - Integer: `id`, `brewery_id`
> - Float: `abv`
> - Decimal: `ibu`, `ounces`
> - Varchar: `name`, `style`, `city`, `state`
> 
> *Metadata:*
> 
> - **Size**: 300 KB
> - **Number of Tables**: 2
> - **Number of Rows**: 2,968
> - **Number of Columns**: 11
> - **Missing Values**: No missing values
> - **Instance Count**: 558
> - **Target Table**: `breweries`
> - **Target Column**: `state`
> 
> This dataset is used in the domain of entertainment to explore and classify craft beers and their breweries based on various attributes.

### Tables
Population table: breweries

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/CraftBeer.svg" alt="CraftBeer ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
breweries, peripheral = load_ctu_dataset("CraftBeer")

(
    beers,
) = peripheral.values()

Analyzing schema:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/2 [00:00<?, ?it/s]

Building data:   0%|          | 0/2 [00:00<?, ?it/s]

You are splitting the column into more than 10 target columns. This might take a long time to fit.


Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`breweries`). We already set the `target` role for the target (`state`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `state`.

In [3]:
# TODO: Annotate remaining columns with roles
breweries

name,state=0,state=1,state=10,state=11,state=12,state=13,state=14,state=15,state=16,state=17,state=18,state=19,state=2,state=20,state=21,state=22,state=23,state=24,state=25,state=26,state=27,state=28,state=29,state=3,state=30,state=31,state=32,state=33,state=34,state=35,state=36,state=37,state=38,state=39,state=4,state=40,state=41,state=42,state=43,state=44,state=45,state=46,state=47,state=48,state=49,state=5,state=50,state=6,state=7,state=8,state=9,state,id,name,city,split
role,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,target,unused_float,unused_string,unused_string,unused_string,unused_string
0.0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,NorthGate Brewing,Minneapolis,train
1.0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,Against the Grain Brewery,Louisville,train
2.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,2,Jack's Abby Craft Lagers,Framingham,val
3.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,3,Mike Hess Brewing Company,San Diego,train
4.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,4,Fort Point Beer Company,San Francisco,train
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
553.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,32,553,Covington Brewhouse,Covington,train
554.0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,11,554,Dave's Brewfarm,Wilson,train
555.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,555,Ukiah Brewing Company,Ukiah,train
556.0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,17,556,Butternuts Beer and Ale,Garrattsville,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
beers

name,id,brewery_id,abv,ibu,name,style,ounces
role,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,166,0.065,65,Dale's Pale Ale,American Pale Ale (APA),12
1.0,4,166,0.087,85,Gordon Ale (2009),American Double / Imperial IPA,12
2.0,5,166,0.08,35,Old Chub,Scottish Ale,12
3.0,6,166,0.099,100,GUBNA Imperial IPA,American Double / Imperial IPA,12
4.0,7,166,0.053,35,Mama's Little Yella Pils,Czech Pilsener,12
,...,...,...,...,...,...,...
2405.0,2688,0,0.06,25,Stronghold,American Porter,16
2406.0,2689,0,0.06,38,Pumpion,Pumpkin Ale,16
2407.0,2690,0,0.048,19,Wall's End,English Brown Ale,16
2408.0,2691,0,0.049,26,Maggie's Leap,Milk / Sweet Stout,16


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/CraftBeer](https://relational.fel.cvut.cz/dataset/CraftBeer)
for a description of the dataset.

In [5]:
dm = getml.data.DataModel(population=breweries.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [6]:
container = getml.data.Container(population=breweries, split=breweries.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,breweries,391,View
1,val,breweries,167,View

Unnamed: 0,name,rows,type
0,beers,2410,DataFrame
