# IMDb - Predicting actors' gender

In this tutorial, we demonstrate how getML can be applied to text fields. In relational databases, text fields are less structured and less standardized than categorical data, making it more difficult to extract useful information from them. Therefore, they are ignored in most data science projects on relational data. However, when using a relational learning tool such as getML, we can easily generate simple features from text fields and leverage the information contained therein.

The point of this exercise is not to compete with modern deep-learning-based NLP approaches. The point is to develop an approach by which we can leverage fields in relational databases that would otherwise be ignored.

As an example data set, we use the Internet Movie Database, which has been used by previous studies in the relational learning literature. This allows us to benchmark our approach to state-of-the-art algorithms in the relational learning literature. We demonstrate that getML outperforms these state-of-the-art algorithms.

Summary:

- Prediction type: __Classification model__
- Domain: __Entertainment__
- Prediction target: __The gender of an actor__ 
- Population size: __817718__

## Background

The data set contains about 800,000 actors. The goal is to predict the gender of said actors based on other information we have about them, such as the movies they have participated in and the roles they have played in these movies.

It has been downloaded from the [CTU Prague relational learning repository](https://relational.fit.cvut.cz/dataset/IMDb) (Motl and Schulte, 2015) (Now residing at [relational-data.org](https://relational-data.org/dataset/IMDb).).

## Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
from pathlib import Path

from urllib import request

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline  

import getml
from pyspark.sql import SparkSession

getml.engine.launch(home_directory=Path.home(), allow_remote_ips=True, token='token')
getml.engine.set_project('imdb')

getML engine is already running.

Connected to project 'imdb'


<span id='flags'></span>
In the following, we set some flags that affect execution of the notebook:
- We don't let the algorithms utilize the information on actors' first names (see [below](#first-names) for an explanation).

In [2]:
USE_FIRST_NAMES = False
RUN_SPARK = False

### 1. Loading data

#### 1.1 Download from source

We begin by downloading the data from the source file:

In [3]:
conn = getml.database.connect_mysql(
    host="db.relational-data.org",
    dbname="imdb_ijs",
    port=3306,
    user="guest",
    password="relational"
)

conn

Connection(dbname='imdb_ijs',
           dialect='mysql',
           host='db.relational-data.org',
           port=3306)

In [4]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if getml.data.exists(name):
        return getml.data.load_data_frame(name)
    data_frame = getml.data.DataFrame.from_db(
        name=name,
        table_name=name,
        conn=conn
    )
    data_frame.save()
    return data_frame

In [5]:
actors = load_if_needed("actors")
roles = load_if_needed("roles")
movies = load_if_needed("movies")
movies_genres = load_if_needed("movies_genres")

In [6]:
actors

name,id,first_name,last_name,gender
role,unused_float,unused_string,unused_string,unused_string
0.0,2,Michael,'babeepower' Viera,M
1.0,3,Eloy,'Chincheta',M
2.0,4,Dieguito,'El Cigala',M
3.0,5,Antonio,'El de Chipiona',M
4.0,6,José,'El Francés',M
,...,...,...,...
817713.0,845461,Herdís,Þorvaldsdóttir,F
817714.0,845462,Katla Margrét,Þorvaldsdóttir,F
817715.0,845463,Lilja Nótt,Þórarinsdóttir,F
817716.0,845464,Hólmfríður,Þórhallsdóttir,F


In [7]:
roles

name,actor_id,movie_id,role
role,unused_float,unused_float,unused_string
0.0,2,280088,Stevie
1.0,2,396232,Various/lyricist
2.0,3,376687,Gitano 1
3.0,4,336265,El Cigala
4.0,5,135644,Himself
,...,...,...
3431961.0,845461,137097,Kata
3431962.0,845462,208838,Magga
3431963.0,845463,870,Gunna
3431964.0,845464,378123,Gudrun


In [8]:
movies

name,id,year,rank,name
role,unused_float,unused_float,unused_float,unused_string
0.0,0,2002,,#28
1.0,1,2000,,"#7 Train: An Immigrant Journey, ..."
2.0,2,1971,6.4,$
3.0,3,1913,,"$1,000 Reward"
4.0,4,1915,,"$1,000 Reward"
,...,...,...,...
388264.0,412316,1991,,"""zem blch krlu"""
388265.0,412317,1995,,"""rgammk"""
388266.0,412318,2002,,"""zgnm Leyla"""
388267.0,412319,1983,,""" Istanbul"""


In [9]:
movies_genres

name,movie_id,genre
role,unused_float,unused_string
0.0,1,Documentary
1.0,1,Short
2.0,2,Comedy
3.0,2,Crime
4.0,5,Western
,...,...
395114.0,378612,Adventure
395115.0,378612,Drama
395116.0,378613,Comedy
395117.0,378613,Drama


#### 1.2 Prepare data for getML

getML requires that we define *roles* for each of the columns.

In [10]:
actors["target"] = (actors.gender == 'F')
actors.set_role("id", getml.data.roles.join_key)
actors.set_role("target", getml.data.roles.target)

<span id='first-names'></span>
The benchmark studies do not state clearly, whether it is fair game to use the first names of the actors. Using the first names, we can easily increase the predictive accuracy to above 90%. However, when doing so the problem basically becomes a first name identification problem rather than a relational learning problem. This would undermine the point of this notebook: Showcase relational learning. Therefore, our assumption is that using the first names is not allowed. Feel free to set this flag [above](#flags) to see how well getML incoporates such starightforward information into its feature logic.

In [11]:
if USE_FIRST_NAMES:
    actors.set_role("first_name", getml.data.roles.text)
actors

name,id,target,first_name,last_name,gender
role,join_key,target,unused_string,unused_string,unused_string
0.0,2,0,Michael,'babeepower' Viera,M
1.0,3,0,Eloy,'Chincheta',M
2.0,4,0,Dieguito,'El Cigala',M
3.0,5,0,Antonio,'El de Chipiona',M
4.0,6,0,José,'El Francés',M
,...,...,...,...,...
817713.0,845461,1,Herdís,Þorvaldsdóttir,F
817714.0,845462,1,Katla Margrét,Þorvaldsdóttir,F
817715.0,845463,1,Lilja Nótt,Þórarinsdóttir,F
817716.0,845464,1,Hólmfríður,Þórhallsdóttir,F


In [12]:
roles.set_role(["actor_id", "movie_id"], getml.data.roles.join_key)
roles.set_role("role", getml.data.roles.text)
roles

name,actor_id,movie_id,role
role,join_key,join_key,text
0.0,2,280088,Stevie
1.0,2,396232,Various/lyricist
2.0,3,376687,Gitano 1
3.0,4,336265,El Cigala
4.0,5,135644,Himself
,...,...,...
3431961.0,845461,137097,Kata
3431962.0,845462,208838,Magga
3431963.0,845463,870,Gunna
3431964.0,845464,378123,Gudrun


In [13]:
movies.set_role("id", getml.data.roles.join_key)
movies.set_role(["year", "rank"], getml.data.roles.numerical)
movies

name,id,year,rank,name
role,join_key,numerical,numerical,unused_string
0.0,0,2002,,#28
1.0,1,2000,,"#7 Train: An Immigrant Journey, ..."
2.0,2,1971,6.4,$
3.0,3,1913,,"$1,000 Reward"
4.0,4,1915,,"$1,000 Reward"
,...,...,...,...
388264.0,412316,1991,,"""zem blch krlu"""
388265.0,412317,1995,,"""rgammk"""
388266.0,412318,2002,,"""zgnm Leyla"""
388267.0,412319,1983,,""" Istanbul"""


In [14]:
movies_genres.set_role("movie_id", getml.data.roles.join_key)
movies_genres.set_role("genre", getml.data.roles.categorical)
movies_genres

name,movie_id,genre
role,join_key,categorical
0.0,1,Documentary
1.0,1,Short
2.0,2,Comedy
3.0,2,Crime
4.0,5,Western
,...,...
395114.0,378612,Adventure
395115.0,378612,Drama
395116.0,378613,Comedy
395117.0,378613,Drama


We need to separate our data set into a training, testing and validation set:

In [15]:
split = getml.data.split.random(train=0.7, validation=0.15, test=0.15)
split

Unnamed: 0,Unnamed: 1
0.0,train
1.0,validation
2.0,train
3.0,validation
4.0,validation
,...


In [16]:
container = getml.data.Container(population=actors, split=split)

container.add(
    roles=roles,
    movies=movies,
    movies_genres=movies_genres,
)

container

Unnamed: 0,subset,name,rows,type
0,test,actors,122794,View
1,train,actors,571807,View
2,validation,actors,123117,View

Unnamed: 0,name,rows,type
0,roles,3431966,DataFrame
1,movies,388269,DataFrame
2,movies_genres,395119,DataFrame


### 2. Predictive modelling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

#### 2.1 Define relational model

To get started with relational learning, we need to specify the data model.

In [17]:
dm = getml.data.DataModel("actors")

dm.add(getml.data.to_placeholder(
    roles=roles,
    movies=movies,
    movies_genres=movies_genres,
))

dm.population.join(
    dm.roles,
    on=("id", "actor_id"),
)

dm.roles.join(
    dm.movies,
    on=("movie_id", "id"),
    relationship=getml.data.relationship.many_to_one,
)

dm.movies.join(
    dm.movies_genres,
    on=("id", "movie_id"),
)

dm

Unnamed: 0,data frames,staging table
0,actors,ACTORS__STAGING_TABLE_1
1,movies_genres,MOVIES_GENRES__STAGING_TABLE_2
2,"roles, movies",ROLES__STAGING_TABLE_3


#### 2.2 getML pipeline

<!-- #### 2.1.1  -->
__Set-up the feature learner & predictor__

We can either use the relboost default parameters or some more fine-tuned parameters. Fine-tuning these parameters in this way can increase our predictive accuracy to 85%, but the training time increases to over 4 hours. We therefore assume that we want to use the default parameters.

In [18]:
text_field_splitter = getml.preprocessors.TextFieldSplitter()

mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
)

feature_selector = getml.predictors.XGBoostClassifier()

predictor = getml.predictors.XGBoostClassifier()

__Build the pipeline__

In [19]:
pipe = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=dm,
    preprocessors=[text_field_splitter, mapping],
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=[predictor],
    share_selected_features=0.1,
)

#### 2.3 Model training

In [20]:
pipe.check(container.train)

Checking data model...
Staging... 100% |██████████| [elapsed: 00:01, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:22, remaining: 00:00]          
Checking... 100% |██████████| [elapsed: 00:01, remaining: 00:00]          



Unnamed: 0,type,label,message
0,INFO,FOREIGN KEYS NOT FOUND,"When joining ROLES__STAGING_TABLE_3 and MOVIES_GENRES__STAGING_TABLE_2 over 'id' and 'movie_id', there are no corresponding entries for 26.899421% of entries in 'id' in 'ROLES__STAGING_TABLE_3'. You might want to double-check your join keys."


In [21]:
pipe.fit(container.train)

Checking data model...
Staging... 100% |██████████| [elapsed: 00:01, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:07, remaining: 00:00]          

To see the issues in full, run .check() on the pipeline.

Staging... 100% |██████████| [elapsed: 00:01, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:06, remaining: 00:00]          
Indexing text fields... 100% |██████████| [elapsed: 00:05, remaining: 00:00]          
FastProp: Trying 226 features... 100% |██████████| [elapsed: 00:20, remaining: 00:00]          
FastProp: Building subfeatures... 100% |██████████| [elapsed: 00:03, remaining: 00:00]          
FastProp: Building features... 100% |██████████| [elapsed: 00:20, remaining: 00:00]          
XGBoost: Training as feature selector... 100% |██████████| [elapsed: 04:60, remaining: 00:00]          
XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:43, remaining: 00:00]          

Trained pipeline.
Time taken: 0h:6

#### 2.4 Model evaluation

In [22]:
pipe.score(container.test)

Staging... 100% |██████████| [elapsed: 00:01, remaining: 00:00]          
Preprocessing... 100% |██████████| [elapsed: 00:07, remaining: 00:00]          
FastProp: Building subfeatures... 100% |██████████| [elapsed: 00:02, remaining: 00:00]          
FastProp: Building features... 113% |███████████| [elapsed: 00:00, remaining: 00:00]          



Unnamed: 0,date time,set used,target,accuracy,auc,cross entropy
0,2024-02-21 15:07:05,train,target,0.8417,0.9139,0.3213
1,2024-02-21 15:07:19,test,target,0.842,0.9139,0.3225


#### 2.5 Features

The most important feature looks as follows:

In [23]:
pipe.features.to_sql()[pipe.features.sort(by="importances")[0].name]

```sql
DROP TABLE IF EXISTS "FEATURE_1_114";

CREATE TABLE "FEATURE_1_114" AS
SELECT AVG( COALESCE( f_1_1_13."feature_1_1_13", 0.0 ) ) AS "feature_1_114",
       t1.rowid AS rownum
FROM "ACTORS__STAGING_TABLE_1" t1
INNER JOIN "ROLES__STAGING_TABLE_3" t2
ON t1."id" = t2."actor_id"
LEFT JOIN "FEATURE_1_1_13" f_1_1_13
ON t2.rowid = f_1_1_13.rownum
GROUP BY t1.rowid;
```

#### 2.6 Productionization

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Here, we will demonstrate how the pipeline can be transpiled to Spark SQL and then executed on a Spark cluster.

In [24]:
pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("imdb_spark")

In [25]:
if RUN_SPARK:
    spark = SparkSession.builder.appName(
        "online_retail"
    ).config(
        "spark.driver.maxResultSize","10g"
    ).config(
        "spark.driver.memory", "10g"
    ).config(
        "spark.executor.memory", "20g"
    ).config(
        "spark.sql.execution.arrow.pyspark.enabled", "true"
    ).config(
        "spark.sql.session.timeZone", "UTC"
    ).enableHiveSupport().getOrCreate()

    spark.sparkContext.setLogLevel("ERROR")

In [26]:
if RUN_SPARK:
    population_spark = container.train.population.to_pyspark(spark, name="actors")

In [27]:
if RUN_SPARK:
    movies_genres_spark = container.movies_genres.to_pyspark(spark, name="movies_genres")
    roles_spark = container.roles.to_pyspark(spark, name="roles")
    movies_spark = container.movies.to_pyspark(spark, name="movies")

In [28]:
if RUN_SPARK:
    getml.spark.execute(spark, "imdb_spark")

In [29]:
if RUN_SPARK:
    spark.sql("SELECT * FROM `FEATURES` LIMIT 20").toPandas()

### 3. Conclusion

In this notebook we have demonstrated how getML can be applied to text fields. We have demonstrated the our  approach outperforms state-of-the-art relational learning algorithms on the IMDb dataset.

## References

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).
    
Neville, Jennifer, and David Jensen. "Relational dependency networks." Journal of Machine Learning Research 8.Mar (2007): 653-692.
    
Neville, Jennifer, and David Jensen. "Collective classification with relational dependency networks." Workshop on Multi-Relational Data Mining (MRDM-2003). 2003.
    
Neville, Jennifer, et al. "Learning relational probability trees." Proceedings of the Ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003.
    
Perovšek, Matic, et al. "Wordification: Propositionalization by unfolding relational data into bags of words." Expert Systems with Applications 42.17-18 (2015): 6442-6456.