In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("sakila")

# Task: sakila
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> The *sakila* dataset is a synthetic database designed to simulate a movie rental business. The task is a *regression* task, with the target column being `amount` in the `payment` table, representing the payment amount for rentals.
> 
> **Data Model:**
> - **Tables:** 16 (payment, film_text, rental, customer, inventory, film_actor, film_category, store, actor, film, category, staff, address, city, country, language)
> - **Columns:**
>   - **payment:** Details of payments, including the target column `amount`.
>   - **film_text:** Textual information about films.
>   - **rental:** Rental transaction details.
>   - **customer:** Customer information.
>   - **inventory:** Inventory details.
>   - **film_actor:** Relationship between films and actors.
>   - **film_category:** Relationship between films and categories.
>   - **store:** Store information.
>   - **actor:** Actor details.
>   - **film:** Film information.
>   - **category:** Film categories.
>   - **staff:** Staff details.
>   - **address:** Address information.
>   - **city:** City details.
>   - **country:** Country information.
>   - **language:** Language details.
> 
> **Task and Target:**
> - **Task:** Regression
> - **Target Column:** `amount` (in the payment table)
> 
> **Metadata:**
> - **Size:** 6.4 MB
> - **Number of Rows:** 47,010
> - **Number of Columns:** 89
> - **Missing Values:** Yes
> - **Compound Keys:** No
> - **Loops:** Yes
> - **Type:** Synthetic
> - **Instance Count:** 15,991
> 
> This dataset is commonly used for testing and demonstrating database features, providing a comprehensive view of a movie rental business's operations.

### Tables
Population table: payment

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/sakila.svg" alt="sakila ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
payment, peripheral = load_ctu_dataset("sakila")

(
    actor,
    address,
    category,
    city,
    country,
    customer,
    film,
    film_actor,
    film_category,
    film_text,
    inventory,
    language,
    rental,
    staff,
    store,
) = peripheral.values()

Analyzing schema:   0%|          | 0/16 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/16 [00:00<?, ?it/s]



Building data:   0%|          | 0/16 [00:00<?, ?it/s]



Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`payment`).

We already set the `target` role for the target (`amount`).


amount is the target column for a regression task.

In [3]:
# TODO: Annotate remaining columns with roles
payment

name,amount,payment_id,customer_id,staff_id,rental_id,payment_date,last_update,split
role,target,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string
0.0,2.99,1,1,1,76,2005-05-25 11:30:37.000000,2006-02-15 21:12:30.000000,train
1.0,0.99,2,1,1,573,2005-05-28 10:35:23.000000,2006-02-15 21:12:30.000000,train
2.0,5.99,3,1,1,1185,2005-06-15 00:54:12.000000,2006-02-15 21:12:30.000000,train
3.0,0.99,4,1,2,1422,2005-06-15 18:02:53.000000,2006-02-15 21:12:30.000000,train
4.0,9.99,5,1,2,1476,2005-06-15 21:08:46.000000,2006-02-15 21:12:30.000000,val
,...,...,...,...,...,...,...,...
16044.0,4.99,16045,599,1,14599,2005-08-21 17:43:42.000000,2006-02-15 21:24:12.000000,train
16045.0,1.99,16046,599,1,14719,2005-08-21 21:41:57.000000,2006-02-15 21:24:12.000000,train
16046.0,8.99,16047,599,2,15590,2005-08-23 06:09:44.000000,2006-02-15 21:24:12.000000,train
16047.0,2.99,16048,599,2,15719,2005-08-23 11:08:46.000000,2006-02-15 21:24:13.000000,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
actor

name,actor_id,first_name,last_name,last_update
role,unused_float,unused_string,unused_string,unused_string
0.0,1,PENELOPE,GUINESS,2006-02-15 03:34:33.000000
1.0,2,NICK,WAHLBERG,2006-02-15 03:34:33.000000
2.0,3,ED,CHASE,2006-02-15 03:34:33.000000
3.0,4,JENNIFER,DAVIS,2006-02-15 03:34:33.000000
4.0,5,JOHNNY,LOLLOBRIGIDA,2006-02-15 03:34:33.000000
,...,...,...,...
195.0,196,BELA,WALKEN,2006-02-15 03:34:33.000000
196.0,197,REESE,WEST,2006-02-15 03:34:33.000000
197.0,198,MARY,KEITEL,2006-02-15 03:34:33.000000
198.0,199,JULIA,FAWCETT,2006-02-15 03:34:33.000000


In [5]:
# TODO: Annotate columns with roles
address

name,address_id,city_id,address,address2,district,postal_code,phone,last_update
role,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,300,47 MySakila Drive,,Alberta,,,2006-02-15 03:45:30.000000
1.0,2,576,28 MySQL Boulevard,,QLD,,,2006-02-15 03:45:30.000000
2.0,3,300,23 Workhaven Lane,,Alberta,,14033335568,2006-02-15 03:45:30.000000
3.0,4,576,1411 Lillydale Drive,,QLD,,6172235589,2006-02-15 03:45:30.000000
4.0,5,463,1913 Hanoi Way,,Nagasaki,35200,28303384290,2006-02-15 03:45:30.000000
,...,...,...,...,...,...,...,...
598.0,601,242,844 Bucuresti Place,,Liaoning,36603,935952366111,2006-02-15 03:45:30.000000
599.0,602,401,1101 Bucuresti Boulevard,,West Greece,97661,199514580428,2006-02-15 03:45:30.000000
600.0,603,503,1103 Quilmes Boulevard,,Piura,52137,644021380889,2006-02-15 03:45:30.000000
601.0,604,296,1331 Usak Boulevard,,Vaud,61960,145308717464,2006-02-15 03:45:30.000000


In [6]:
# TODO: Annotate columns with roles
category

name,category_id,name,last_update
role,unused_float,unused_string,unused_string
0.0,1,Action,2006-02-15 03:46:27.000000
1.0,2,Animation,2006-02-15 03:46:27.000000
2.0,3,Children,2006-02-15 03:46:27.000000
3.0,4,Classics,2006-02-15 03:46:27.000000
4.0,5,Comedy,2006-02-15 03:46:27.000000
,...,...,...
11.0,12,Music,2006-02-15 03:46:27.000000
12.0,13,New,2006-02-15 03:46:27.000000
13.0,14,Sci-Fi,2006-02-15 03:46:27.000000
14.0,15,Sports,2006-02-15 03:46:27.000000


In [7]:
# TODO: Annotate columns with roles
city

name,city_id,country_id,city,last_update
role,unused_float,unused_float,unused_string,unused_string
0.0,1,87,A Corua (La Corua),2006-02-15 03:45:25.000000
1.0,2,82,Abha,2006-02-15 03:45:25.000000
2.0,3,101,Abu Dhabi,2006-02-15 03:45:25.000000
3.0,4,60,Acua,2006-02-15 03:45:25.000000
4.0,5,97,Adana,2006-02-15 03:45:25.000000
,...,...,...,...
595.0,596,69,Zaria,2006-02-15 03:45:25.000000
596.0,597,80,Zeleznogorsk,2006-02-15 03:45:25.000000
597.0,598,51,Zhezqazghan,2006-02-15 03:45:25.000000
598.0,599,23,Zhoushan,2006-02-15 03:45:25.000000


In [8]:
# TODO: Annotate columns with roles
country

name,country_id,country,last_update
role,unused_float,unused_string,unused_string
0.0,1,Afghanistan,2006-02-15 03:44:00.000000
1.0,2,Algeria,2006-02-15 03:44:00.000000
2.0,3,American Samoa,2006-02-15 03:44:00.000000
3.0,4,Angola,2006-02-15 03:44:00.000000
4.0,5,Anguilla,2006-02-15 03:44:00.000000
,...,...,...
104.0,105,Vietnam,2006-02-15 03:44:00.000000
105.0,106,"Virgin Islands, U.S.",2006-02-15 03:44:00.000000
106.0,107,Yemen,2006-02-15 03:44:00.000000
107.0,108,Yugoslavia,2006-02-15 03:44:00.000000


In [9]:
# TODO: Annotate columns with roles
customer

name,customer_id,store_id,address_id,active,first_name,last_name,email,create_date,last_update
role,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string
0.0,1,1,5,1,MARY,SMITH,MARY.SMITH@sakilacustomer.org,2006-02-14 22:04:36.000000,2006-02-15 03:57:20.000000
1.0,2,1,6,1,PATRICIA,JOHNSON,PATRICIA.JOHNSON@sakilacustomer....,2006-02-14 22:04:36.000000,2006-02-15 03:57:20.000000
2.0,3,1,7,1,LINDA,WILLIAMS,LINDA.WILLIAMS@sakilacustomer.or...,2006-02-14 22:04:36.000000,2006-02-15 03:57:20.000000
3.0,4,2,8,1,BARBARA,JONES,BARBARA.JONES@sakilacustomer.org,2006-02-14 22:04:36.000000,2006-02-15 03:57:20.000000
4.0,5,1,9,1,ELIZABETH,BROWN,ELIZABETH.BROWN@sakilacustomer.o...,2006-02-14 22:04:36.000000,2006-02-15 03:57:20.000000
,...,...,...,...,...,...,...,...,...
594.0,595,1,601,1,TERRENCE,GUNDERSON,TERRENCE.GUNDERSON@sakilacustome...,2006-02-14 22:04:37.000000,2006-02-15 03:57:20.000000
595.0,596,1,602,1,ENRIQUE,FORSYTHE,ENRIQUE.FORSYTHE@sakilacustomer....,2006-02-14 22:04:37.000000,2006-02-15 03:57:20.000000
596.0,597,1,603,1,FREDDIE,DUGGAN,FREDDIE.DUGGAN@sakilacustomer.or...,2006-02-14 22:04:37.000000,2006-02-15 03:57:20.000000
597.0,598,1,604,1,WADE,DELVALLE,WADE.DELVALLE@sakilacustomer.org,2006-02-14 22:04:37.000000,2006-02-15 03:57:20.000000


In [10]:
# TODO: Annotate columns with roles
film

name,film_id,release_year,language_id,original_language_id,rental_duration,rental_rate,length,replacement_cost,title,description,last_update
role,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string
0.0,1,2006,1,,6,0.99,86,20.99,ACADEMY DINOSAUR,A Epic Drama of a Feminist And a...,2006-02-15 04:03:42.000000
1.0,2,2006,1,,3,4.99,48,12.99,ACE GOLDFINGER,A Astounding Epistle of a Databa...,2006-02-15 04:03:42.000000
2.0,3,2006,1,,7,2.99,50,18.99,ADAPTATION HOLES,A Astounding Reflection of a Lum...,2006-02-15 04:03:42.000000
3.0,4,2006,1,,5,2.99,117,26.99,AFFAIR PREJUDICE,A Fanciful Documentary of a Fris...,2006-02-15 04:03:42.000000
4.0,5,2006,1,,6,2.99,130,22.99,AFRICAN EGG,A Fast-Paced Documentary of a Pa...,2006-02-15 04:03:42.000000
,...,...,...,...,...,...,...,...,...,...,...
995.0,996,2006,1,,6,0.99,183,9.99,YOUNG LANGUAGE,A Unbelieveable Yarn of a Boat A...,2006-02-15 04:03:42.000000
996.0,997,2006,1,,4,0.99,179,14.99,YOUTH KICK,A Touching Drama of a Teacher An...,2006-02-15 04:03:42.000000
997.0,998,2006,1,,6,0.99,105,10.99,ZHIVAGO CORE,A Fateful Yarn of a Composer And...,2006-02-15 04:03:42.000000
998.0,999,2006,1,,5,2.99,101,28.99,ZOOLANDER FICTION,A Fateful Reflection of a Waitre...,2006-02-15 04:03:42.000000


In [11]:
# TODO: Annotate columns with roles
film_actor

name,actor_id,film_id,last_update
role,unused_float,unused_float,unused_string
0.0,1,1,2006-02-15 04:05:03.000000
1.0,1,23,2006-02-15 04:05:03.000000
2.0,1,25,2006-02-15 04:05:03.000000
3.0,1,106,2006-02-15 04:05:03.000000
4.0,1,140,2006-02-15 04:05:03.000000
,...,...,...
5457.0,200,879,2006-02-15 04:05:03.000000
5458.0,200,912,2006-02-15 04:05:03.000000
5459.0,200,945,2006-02-15 04:05:03.000000
5460.0,200,958,2006-02-15 04:05:03.000000


In [12]:
# TODO: Annotate columns with roles
film_category

name,film_id,category_id,last_update
role,unused_float,unused_float,unused_string
0.0,1,6,2006-02-15 04:07:09.000000
1.0,2,11,2006-02-15 04:07:09.000000
2.0,3,6,2006-02-15 04:07:09.000000
3.0,4,11,2006-02-15 04:07:09.000000
4.0,5,8,2006-02-15 04:07:09.000000
,...,...,...
995.0,996,6,2006-02-15 04:07:09.000000
996.0,997,12,2006-02-15 04:07:09.000000
997.0,998,11,2006-02-15 04:07:09.000000
998.0,999,3,2006-02-15 04:07:09.000000


In [13]:
# TODO: Annotate columns with roles
film_text

name,film_id,title,description
role,unused_float,unused_string,unused_string
0.0,1,ACADEMY DINOSAUR,A Epic Drama of a Feminist And a...
1.0,2,ACE GOLDFINGER,A Astounding Epistle of a Databa...
2.0,3,ADAPTATION HOLES,A Astounding Reflection of a Lum...
3.0,4,AFFAIR PREJUDICE,A Fanciful Documentary of a Fris...
4.0,5,AFRICAN EGG,A Fast-Paced Documentary of a Pa...
,...,...,...
995.0,996,YOUNG LANGUAGE,A Unbelieveable Yarn of a Boat A...
996.0,997,YOUTH KICK,A Touching Drama of a Teacher An...
997.0,998,ZHIVAGO CORE,A Fateful Yarn of a Composer And...
998.0,999,ZOOLANDER FICTION,A Fateful Reflection of a Waitre...


In [14]:
# TODO: Annotate columns with roles
inventory

name,inventory_id,film_id,store_id,last_update
role,unused_float,unused_float,unused_float,unused_string
0.0,1,1,1,2006-02-15 04:09:17.000000
1.0,2,1,1,2006-02-15 04:09:17.000000
2.0,3,1,1,2006-02-15 04:09:17.000000
3.0,4,1,1,2006-02-15 04:09:17.000000
4.0,5,1,2,2006-02-15 04:09:17.000000
,...,...,...,...
4576.0,4577,1000,1,2006-02-15 04:09:17.000000
4577.0,4578,1000,2,2006-02-15 04:09:17.000000
4578.0,4579,1000,2,2006-02-15 04:09:17.000000
4579.0,4580,1000,2,2006-02-15 04:09:17.000000


In [15]:
# TODO: Annotate columns with roles
language

name,language_id,name,last_update
role,unused_float,unused_string,unused_string
0,1,English,2006-02-15 04:02:19.000000
1,2,Italian,2006-02-15 04:02:19.000000
2,3,Japanese,2006-02-15 04:02:19.000000
3,4,Mandarin,2006-02-15 04:02:19.000000
4,4,Mandarin,2006-02-15 04:02:19.000000
5,5,French,2006-02-15 04:02:19.000000


In [16]:
# TODO: Annotate columns with roles
rental

name,rental_id,inventory_id,customer_id,staff_id,rental_date,return_date,last_update
role,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string
0.0,1,367,130,1,2005-05-24 22:53:30.000000,2005-05-26 22:04:30.000000,2006-02-15 20:30:53.000000
1.0,2,1525,459,1,2005-05-24 22:54:33.000000,2005-05-28 19:40:33.000000,2006-02-15 20:30:53.000000
2.0,3,1711,408,1,2005-05-24 23:03:39.000000,2005-06-01 22:12:39.000000,2006-02-15 20:30:53.000000
3.0,4,2452,333,2,2005-05-24 23:04:41.000000,2005-06-03 01:43:41.000000,2006-02-15 20:30:53.000000
4.0,5,2079,222,1,2005-05-24 23:05:21.000000,2005-06-02 04:33:21.000000,2006-02-15 20:30:53.000000
,...,...,...,...,...,...,...
16039.0,16045,772,14,1,2005-08-23 22:25:26.000000,2005-08-25 23:54:26.000000,2006-02-15 20:30:53.000000
16040.0,16046,4364,74,2,2005-08-23 22:26:47.000000,2005-08-27 18:02:47.000000,2006-02-15 20:30:53.000000
16041.0,16047,2088,114,2,2005-08-23 22:42:48.000000,2005-08-25 02:48:48.000000,2006-02-15 20:30:53.000000
16042.0,16048,2019,103,1,2005-08-23 22:43:07.000000,2005-08-31 21:33:07.000000,2006-02-15 20:30:53.000000


In [17]:
# TODO: Annotate columns with roles
staff

name,staff_id,address_id,store_id,active,first_name,last_name,email,username,password,last_update
role,unused_float,unused_float,unused_float,unused_float,unused_string,unused_string,unused_string,unused_string,unused_string,unused_string
0,1,3,1,1,Mike,Hillyer,Mike.Hillyer@sakilastaff.com,Mike,8cb2237d0679ca88db6464eac60da963...,2006-02-15 03:57:16.000000
1,2,4,2,1,Jon,Stephens,Jon.Stephens@sakilastaff.com,Jon,8cb2237d0679ca88db6464eac60da963...,2006-02-15 03:57:16.000000


In [18]:
# TODO: Annotate columns with roles
store

name,store_id,manager_staff_id,address_id,last_update
role,unused_float,unused_float,unused_float,unused_string
0,1,1,1,2006-02-15 03:57:12.000000
1,2,2,2,2006-02-15 03:57:12.000000


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/sakila](https://relational.fel.cvut.cz/dataset/sakila)
for a description of the dataset.

In [19]:
dm = getml.data.DataModel(population=payment.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [20]:
container = getml.data.Container(population=payment, split=payment.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,payment,11235,View
1,val,payment,4814,View

Unnamed: 0,name,rows,type
0.0,actor,200,DataFrame
1.0,address,603,DataFrame
2.0,category,16,DataFrame
3.0,city,600,DataFrame
4.0,country,109,DataFrame
,...,...,...
10.0,inventory,4581,DataFrame
11.0,language,6,DataFrame
12.0,rental,16044,DataFrame
13.0,staff,2,DataFrame
