In [1]:
import getml
from challenge.utils.data import load_ctu_dataset

getml.set_project("imdb_ijs")

# Task: imdb_ijs
### Dataset Description
> <span style="font-weight: 500; color: #3b3b3b;">ⓘ️&nbsp; Generated by `gpt-4o`</span>
>
> It seems there is no additional information available for the *imdb_ijs* dataset. However, based on the provided data model, here is a description:
> 
> The *imdb_ijs* dataset is used for a classification task, though the specific target column is not defined in the provided schema.
> 
> **Data Model:**
> 
> - **directors_genres** table:
>   - *director_id*: int
>   - *genre*: varchar
>   - *prob*: float
> 
> - **movies_directors** table:
>   - *director_id*: int
>   - *movie_id*: int
> 
> - **movies_genres** table:
>   - *movie_id*: int
>   - *genre*: varchar
> 
> - **roles** table:
>   - *actor_id*: int
>   - *movie_id*: int
>   - *role*: varchar
> 
> - **directors** table:
>   - *id*: int
>   - *first_name*: varchar
>   - *last_name*: varchar
> 
> - **movies** table:
>   - *id*: int
>   - *name*: varchar
>   - *year*: int
>   - *rank*: float
> 
> - **actors** table:
>   - *id*: int
>   - *first_name*: varchar
>   - *last_name*: varchar
>   - *gender*: char
> 
> **Metadata:**
> 
> - The dataset likely involves analyzing relationships between movies, directors, actors, and genres.
> - It is used to classify or predict certain attributes related to movies or individuals in the film industry.
> 
> This dataset is typically used in research focusing on film industry analytics, such as predicting movie success or classifying films by genre or director style.

### Tables
Population table: actors

<h4>
  <details open>
     <summary>ER Diagram</summary>
       <img src="https://relational.fel.cvut.cz/assets/img/datasets-generated/imdb_ijs.svg" alt="imdb_ijs ER Diagram">
   </details>
</h4>

To load the dataset, we use the `load_ctu_dataset` function from the `utils`
module. This function returns a tuple with the population table as the first
element and the a dictionary of peripheral tables as the second element.

In [2]:
actors, peripheral = load_ctu_dataset("imdb_ijs")

(
    directors_genres,
    movies,
    movies_genres,
    movies_directors,
    directors,
    roles,
) = peripheral.values()

Analyzing schema:   0%|          | 0/7 [00:00<?, ?it/s]

Downloading tables:   0%|          | 0/7 [00:00<?, ?it/s]

Building data:   0%|          | 0/7 [00:00<?, ?it/s]

Now, we can inspect all tables and annotate the columns with [roles](https://getml.com/latest/user_guide/concepts/annotating_data/).

The population table (`actors`).

We already set the `target` role for the target (`gender`).


gender is the target column for a binary classification task.

In [3]:
# TODO: Annotate remaining columns with roles
actors

name,gender,id,first_name,last_name,split
role,target,unused_string,unused_string,unused_string,unused_string
0.0,0,2,Michael,'babeepower' Viera,train
1.0,0,3,Eloy,'Chincheta',train
2.0,0,4,Dieguito,'El Cigala',val
3.0,0,5,Antonio,'El de Chipiona',train
4.0,0,6,José,'El Francés',val
,...,...,...,...,...
817713.0,1,845461,Herdís,Þorvaldsdóttir,train
817714.0,1,845462,Katla Margrét,Þorvaldsdóttir,val
817715.0,1,845463,Lilja Nótt,Þórarinsdóttir,val
817716.0,1,845464,Hólmfríður,Þórhallsdóttir,train


Peripheral tables,

In [4]:
# TODO: Annotate columns with roles
directors_genres

name,director_id,genre,prob
role,unused_string,unused_string,unused_string
0.0,2,Short,1
1.0,3,Drama,1
2.0,5,Documentary,1
3.0,6,Drama,1
4.0,6,Short,1
,...,...,...
156557.0,88797,Drama,1
156558.0,88798,Adventure,1
156559.0,88799,Short,1
156560.0,88800,Animation,1


In [5]:
# TODO: Annotate columns with roles
movies

name,director_id,movie_id
role,unused_string,unused_string
0.0,1,378879
1.0,2,281325
2.0,3,30621
3.0,3,304743
4.0,4,60570
,...,...
371175.0,88797,172648
371176.0,88798,350996
371177.0,88799,189713
371178.0,88800,105513


In [6]:
# TODO: Annotate columns with roles
movies_genres

name,id,first_name,last_name
role,unused_string,unused_string,unused_string
0.0,1,Todd,1
1.0,2,Les,12 Poissons
2.0,3,Lejaren,a'Hiller
3.0,4,Nian,A
4.0,5,Khairiya,A-Mansour
,...,...,...
86875.0,88797,Yusuf,Ünal
86876.0,88798,Ahmet,Ündag
86877.0,88799,Idil,Üner
86878.0,88800,Yüksel,Ünsal


In [7]:
# TODO: Annotate columns with roles
movies_directors

name,actor_id,movie_id,role
role,unused_string,unused_string,unused_string
0.0,2,280088,Stevie
1.0,2,396232,Various/lyricist
2.0,3,376687,Gitano 1
3.0,4,336265,El Cigala
4.0,5,135644,Himself
,...,...,...
3431961.0,845461,137097,Kata
3431962.0,845462,208838,Magga
3431963.0,845463,870,Gunna
3431964.0,845464,378123,Gudrun


In [8]:
# TODO: Annotate columns with roles
directors

name,movie_id,genre
role,unused_string,unused_string
0.0,1,Documentary
1.0,1,Short
2.0,2,Comedy
3.0,2,Crime
4.0,5,Western
,...,...
395114.0,378612,Adventure
395115.0,378612,Drama
395116.0,378613,Comedy
395117.0,378613,Drama


In [9]:
# TODO: Annotate columns with roles
roles

name,id,name,year,rank
role,unused_string,unused_string,unused_string,unused_string
0.0,0,#28,2002,
1.0,1,"#7 Train: An Immigrant Journey, ...",2000,
2.0,2,$,1971,6.4
3.0,3,"$1,000 Reward",1913,
4.0,4,"$1,000 Reward",1915,
,...,...,...,...
388264.0,412316,"""zem blch krlu""",1991,
388265.0,412317,"""rgammk""",1995,
388266.0,412318,"""zgnm Leyla""",2002,
388267.0,412319,""" Istanbul""",1983,


The next step is to define the data model. Refer to [https://relational.fel.cvut.cz/dataset/imdb_ijs](https://relational.fel.cvut.cz/dataset/imdb_ijs)
for a description of the dataset.

In [10]:
dm = getml.data.DataModel(population=actors.to_placeholder())
dm.add(getml.data.to_placeholder(**peripheral))

# TODO
# dm.population.join(...)

Now we can create the container and add the tables to it.

In [11]:
container = getml.data.Container(population=actors, split=actors.split)
container.add(**peripheral)

container

Unnamed: 0,subset,name,rows,type
0,train,actors,572403,View
1,val,actors,245315,View

Unnamed: 0,name,rows,type
0,directors_genres,156562,DataFrame
1,movies_directors,371180,DataFrame
2,directors,86880,DataFrame
3,roles,3431966,DataFrame
4,movies_genres,395119,DataFrame
5,movies,388269,DataFrame
