In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("db_transformer_restbase")

# Task: restbase
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> *Data Model (Relational Schema)*
> 
> The *restbase* dataset consists of three tables:
> 
> - **Location**: Contains restaurant location details such as `street_num`, `street_name`, and `city`.
> - **GeneralInfo**: Provides general information about restaurants, including `label`, `food_type`, `city`, and `review`.
> - **Geographic**: Contains geographic information with `city`, `county`, and `region`.
> 
> *Task and Target Column*
> 
> The primary task is *regression*, with the target column being *review* in the *generalinfo* table.
> 
> *Types of the Columns*
> 
> - **Numeric**: Includes columns like `review`.
> - **String**: Includes `label`, `food_type`, `street_name`, `city`.
> 
> *Metadata about the Dataset*
> 
> - **Size**: 3 MB
> - **Number of Tables**: 3
> - **Number of Rows**: 19,148
> - **Number of Columns**: 12
> - **Missing Values**: Yes
> - **Compound Keys**: No
> - **Loops**: Yes
> - **Type**: Real
> - **Instance Count**: 9,524
> 
> This dataset provides insights into restaurants in San Francisco, aiming to predict customer satisfaction based on various attributes.

### Tables
Population table: generalinfo

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/restbase.svg" alt="restbase ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
generalinfo, peripheral = load_ctu_dataset("restbase")

(
    geographic,
    location,
) = peripheral.values()

Analyzing schema:   0%|          | 0/3 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/3 [00:00<?, ?it/s]

Building data:   0%|          | 0/3 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`generalinfo`). We already set the `target` role for the target (`review`). If the task is a multiclass classification,
we split the target column into multiple columns in an one-vs-all fashion. In this case, the original target is still avaiable as `review`.

In [3]:
# TODO: Annotate remaining columns with roles
generalinfo

name,review,id_restaurant,label,food_type,city,split
role,target,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,2.3,1,sparky's diner,24 hour diner,san francisco,val
1.0,3.8,2,kabul afghan cuisine,afghani,san carlos,train
2.0,4,3,helmand restaurant,afghani,san francisco,val
3.0,3.6,4,afghani house,afghani,sunnyvale,val
4.0,3.7,5,kabul afghan cusine,afghani,sunnyvale,train
,...,...,...,...,...,...
9585.0,3,9586,yogurt park,yogurt,danville,train
9586.0,2.3,9587,tcby,yogurt,mountain view,val
9587.0,2.7,9588,mongolian b.b.q.,all you can eat,santa clara,train
9588.0,3.6,9589,bob's donuts,donuts,san francisco,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
geographic

name,city,county,region
role,unused_string,unused_string,unused_string
0.0,alameda,alameda county,bay area
1.0,alamo,contra costa county,bay area
2.0,albany,alameda county,bay area
3.0,alviso,santa clara county,bay area
4.0,american canyon,unknown,bay area
,...,...,...
163.0,watsonville,santa cruz county,bay area
164.0,west pittsburg,contra costa county,bay area
165.0,winters,yolo county,sacramento area
166.0,woodside,san mateo county,bay area


In [5]:
# TODO: Annotate columns with roles
location

name,id_restaurant,street_num,street_name,city
role,unused_string,unused_string,unused_string,unused_string
0.0,1,242,church st,san francisco
1.0,2,135,el camino real,san carlos
2.0,3,430,broadway,san francisco
3.0,4,1103,e. el camino real,sunnyvale
4.0,5,833,w. el camino real,sunnyvale
,...,...,...,...
9534.0,9586,684,hartz ave,danville
9535.0,9587,,st,mountain view
9536.0,9588,3380,el camino real,santa clara
9537.0,9589,1621,polk (sacramento & clay),san francisco


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/restbase](https://relational.fel.cvut.cz/dataset/restbase)
for a description of the dataset.

In [6]:
dm = getml.data.DataModel(population=generalinfo.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [7]:
container = getml.data.Container(population=generalinfo, split=generalinfo.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,generalinfo,6713,View
1,val,generalinfo,2877,View

Unnamed: 0,name,rows,type
0,geographic,168,DataFrame
1,location,9539,DataFrame
