In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("imdb_ijs")

# Task: imdb_ijs
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> It seems there is no additional information available for the *imdb_ijs* dataset. However, based on the provided data model, here is a description:
> 
> The *imdb_ijs* dataset is structured to represent information about movies, directors, actors, and their associated genres and roles. It is used for classification tasks, likely focusing on predicting categories such as movie genres or roles.
> 
> **Data Model:**
> - The dataset consists of several tables: `directors_genres`, `movies_directors`, `movies_genres`, `roles`, `directors`, `movies`, and `actors`.
> - Key attributes include `director_id`, `movie_id`, `actor_id`, `genre`, `role`, and `rank`.
> 
> **Task and Target Column:**
> - The primary task is *classification*. The specific target column isn't explicitly defined, but it could involve predicting genres or roles based on other attributes.
> 
> **Column Types:**
> - The dataset includes:
>   - *Int* (e.g., `director_id`, `movie_id`, `actor_id`)
>   - *Varchar* (e.g., `genre`, `role`, `name`)
>   - *Float* (e.g., `prob`, `rank`)
>   - *Char* (e.g., `gender`)
> 
> **Metadata:**
> - The dataset includes tables for directors, movies, actors, and their relationships, but specific metadata like size or number of rows is not provided.
> 
> This dataset is useful for exploring relationships in the film industry, such as analyzing the collaboration between directors and actors or the distribution of movie genres.

### Tables
Population table: actors

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/imdb_ijs.svg" alt="imdb_ijs ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
actors, peripheral = load_ctu_dataset("imdb_ijs")

(
    directors,
    directors_genres,
    movies,
    movies_directors,
    movies_genres,
    roles,
) = peripheral.values()

Analyzing schema:   0%|          | 0/7 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/7 [00:00<?, ?it/s]

Building data:   0%|          | 0/7 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`actors`).

We already set the `target` role for the target (`gender`).


gender is the target column for a binary classification task.

In [3]:
# TODO: Annotate remaining columns with roles
actors

name,gender,id,first_name,last_name,split
role,target,unused_float,unused_string,unused_string,unused_string
0.0,0,2,Michael,'babeepower' Viera,train
1.0,0,3,Eloy,'Chincheta',train
2.0,0,4,Dieguito,'El Cigala',val
3.0,0,5,Antonio,'El de Chipiona',train
4.0,0,6,José,'El Francés',val
,...,...,...,...,...
817713.0,1,845461,Herdís,Þorvaldsdóttir,train
817714.0,1,845462,Katla Margrét,Þorvaldsdóttir,val
817715.0,1,845463,Lilja Nótt,Þórarinsdóttir,val
817716.0,1,845464,Hólmfríður,Þórhallsdóttir,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
directors

name,id,first_name,last_name
role,unused_float,unused_string,unused_string
0.0,1,Todd,1
1.0,2,Les,12 Poissons
2.0,3,Lejaren,a'Hiller
3.0,4,Nian,A
4.0,5,Khairiya,A-Mansour
,...,...,...
86875.0,88797,Yusuf,Ünal
86876.0,88798,Ahmet,Ündag
86877.0,88799,Idil,Üner
86878.0,88800,Yüksel,Ünsal


In [5]:
# TODO: Annotate columns with roles
directors_genres

name,director_id,prob,genre
role,unused_float,unused_float,unused_string
0.0,2,1,Short
1.0,3,1,Drama
2.0,5,1,Documentary
3.0,6,1,Drama
4.0,6,1,Short
,...,...,...
156557.0,88797,1,Drama
156558.0,88798,1,Adventure
156559.0,88799,1,Short
156560.0,88800,1,Animation


In [6]:
# TODO: Annotate columns with roles
movies

name,id,year,rank,name
role,unused_float,unused_float,unused_float,unused_string
0.0,0,2002,,#28
1.0,1,2000,,"#7 Train: An Immigrant Journey, ..."
2.0,2,1971,6.4,$
3.0,3,1913,,"$1,000 Reward"
4.0,4,1915,,"$1,000 Reward"
,...,...,...,...
388264.0,412316,1991,,"""zem blch krlu"""
388265.0,412317,1995,,"""rgammk"""
388266.0,412318,2002,,"""zgnm Leyla"""
388267.0,412319,1983,,""" Istanbul"""


In [7]:
# TODO: Annotate columns with roles
movies_directors

name,director_id,movie_id
role,unused_float,unused_float
0.0,1,378879
1.0,2,281325
2.0,3,30621
3.0,3,304743
4.0,4,60570
,...,...
371175.0,88797,172648
371176.0,88798,350996
371177.0,88799,189713
371178.0,88800,105513


In [8]:
# TODO: Annotate columns with roles
movies_genres

name,movie_id,genre
role,unused_float,unused_string
0.0,1,Documentary
1.0,1,Short
2.0,2,Comedy
3.0,2,Crime
4.0,5,Western
,...,...
395114.0,378612,Adventure
395115.0,378612,Drama
395116.0,378613,Comedy
395117.0,378613,Drama


In [9]:
# TODO: Annotate columns with roles
roles

name,actor_id,movie_id,role
role,unused_float,unused_float,unused_string
0.0,2,280088,Stevie
1.0,2,396232,Various/lyricist
2.0,3,376687,Gitano 1
3.0,4,336265,El Cigala
4.0,5,135644,Himself
,...,...,...
3431961.0,845461,137097,Kata
3431962.0,845462,208838,Magga
3431963.0,845463,870,Gunna
3431964.0,845464,378123,Gudrun


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/imdb_ijs](https://relational.fel.cvut.cz/dataset/imdb_ijs)
for a description of the dataset.

In [10]:
dm = getml.data.DataModel(population=actors.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [11]:
container = getml.data.Container(population=actors, split=actors.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,actors,572403,View
1,val,actors,245315,View

Unnamed: 0,name,rows,type
0,directors,86880,DataFrame
1,directors_genres,156562,DataFrame
2,movies,388269,DataFrame
3,movies_directors,371180,DataFrame
4,movies_genres,395119,DataFrame
5,roles,3431966,DataFrame
