# CORA - Categorizing academic publications using getML

In this notebook, we compare getML against extant approaches in the relational learning literature on the CORA data set, which is often used for benchmarking. We demonstrate that getML outperforms the state of the art in the relational learning literature on this data set. Beyond the benchmarking aspects, this notebooks showcases getML's excellent capabilities in dealing with categorical data.

Summary:

- Prediction type: __Classification model__
- Domain: __Academia__
- Prediction target: __The category of a paper__ 
- Population size: __2708__

_Author: Dr. Patrick Urbanke_

# Background

CORA is a well-known benchmarking dataset in the academic literature on relational learning. The dataset contains 2708 scientific publications on machine learning. The papers are divided into 7 categories. The challenge is to predict the category of a paper based on the papers it cites, the papers it is cited by and keywords contained in the paper.

It has been downloaded from the [CTU Prague relational learning repository](https://relational.fit.cvut.cz/dataset/CORA) (Motl and Schulte, 2015).

### A web frontend for getML

The getML monitor is a frontend built to support your work with getML. The getML monitor displays information such as the imported data frames, trained pipelines and allows easy data and feature exploration. You can launch the getML monitor [here](http://localhost:1709).

### Where is this running?

Your getML live session is running inside a docker container on [mybinder.org](https://mybinder.org/), a service built by the Jupyter community and funded by Google Cloud, OVH, GESIS Notebooks and the Turing Institute. As it is a free service, this session will shut down after 10 minutes of inactivity.

# Analysis

Let's get started with the analysis and set up your session:

In [1]:
import copy
import datetime
import os
from urllib import request
import time

import numpy as np
import pandas as pd
from IPython.display import Image
import matplotlib.pyplot as plt
plt.style.use('seaborn')
%matplotlib inline  

import pyspark
import getml

getml.engine.set_project('cora')



Loading pipelines...

Connected to project 'cora'


## 1. Loading data

### 1.1 Download from source

We begin by downloading the data from the source file:

In [2]:
conn = getml.database.connect_mariadb(
    host="relational.fit.cvut.cz",
    dbname="CORA",
    port=3306,
    user="guest",
    password="relational"
)

conn

Connection(conn_id='default',
           dbname='CORA',
           dialect='mysql',
           host='relational.fit.cvut.cz',
           port=3306)

In [3]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(
            name=name,
            table_name=name,
            conn=conn
        )
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame

In [4]:
paper = load_if_needed("paper")
cites = load_if_needed("cites")
content = load_if_needed("content")

In [5]:
paper

name,paper_id,class_label
role,unused_float,unused_string
0.0,35,Genetic_Algorithms
1.0,40,Genetic_Algorithms
2.0,114,Reinforcement_Learning
3.0,117,Reinforcement_Learning
4.0,128,Reinforcement_Learning
,...,...
2703.0,1154500,Case_Based
2704.0,1154520,Neural_Networks
2705.0,1154524,Rule_Learning
2706.0,1154525,Rule_Learning


In [6]:
cites

name,cited_paper_id,citing_paper_id
role,unused_float,unused_float
0.0,35,887
1.0,35,1033
2.0,35,1688
3.0,35,1956
4.0,35,8865
,...,...
5424.0,853116,19621
5425.0,853116,853155
5426.0,853118,1140289
5427.0,853155,853118


In [7]:
content

name,paper_id,word_cited_id
role,unused_float,unused_string
0.0,35,word100
1.0,35,word1152
2.0,35,word1175
3.0,35,word1228
4.0,35,word1248
,...,...
49211.0,1155073,word75
49212.0,1155073,word759
49213.0,1155073,word789
49214.0,1155073,word815


### 1.2 Prepare data for getML

getML requires that we define *roles* for each of the columns.

In [8]:
paper.set_role("paper_id", getml.data.roles.join_key)
paper.set_role("class_label", getml.data.roles.categorical)
paper

name,paper_id,class_label
role,join_key,categorical
0.0,35,Genetic_Algorithms
1.0,40,Genetic_Algorithms
2.0,114,Reinforcement_Learning
3.0,117,Reinforcement_Learning
4.0,128,Reinforcement_Learning
,...,...
2703.0,1154500,Case_Based
2704.0,1154520,Neural_Networks
2705.0,1154524,Rule_Learning
2706.0,1154525,Rule_Learning


In [9]:
cites.set_role(["cited_paper_id", "citing_paper_id"], getml.data.roles.join_key)
cites

name,cited_paper_id,citing_paper_id
role,join_key,join_key
0.0,35,887
1.0,35,1033
2.0,35,1688
3.0,35,1956
4.0,35,8865
,...,...
5424.0,853116,19621
5425.0,853116,853155
5426.0,853118,1140289
5427.0,853155,853118


We need to separate our data set into a training, testing and validation set:

In [10]:
content.set_role("paper_id", getml.data.roles.join_key)
content.set_role("word_cited_id", getml.data.roles.categorical)
content

name,paper_id,word_cited_id
role,join_key,categorical
0.0,35,word100
1.0,35,word1152
2.0,35,word1175
3.0,35,word1228
4.0,35,word1248
,...,...
49211.0,1155073,word75
49212.0,1155073,word759
49213.0,1155073,word789
49214.0,1155073,word815


The goal is to predict seven different labels. We generate a target column for each of those labels. We also have to separate the data set into a training and testing set.

In [11]:
data_full = getml.data.make_target_columns(paper, "class_label")
data_full

name,paper_id,class_label=Case_Based,class_label=Genetic_Algorithms,class_label=Neural_Networks,class_label=Probabilistic_Methods,class_label=Reinforcement_Learning,class_label=Rule_Learning,class_label=Theory
role,join_key,target,target,target,target,target,target,target
0,35,0,1,0,0,0,0,0
1,40,0,1,0,0,0,0,0
2,114,0,0,0,0,1,0,0
3,117,0,0,0,0,1,0,0
4,128,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...


In [12]:
split = getml.data.split.random(train=0.7, test=0.3, validation=0.0)
split

Unnamed: 0,Unnamed: 1
0.0,train
1.0,test
2.0,train
3.0,test
4.0,test
,...


In [13]:
container = getml.data.Container(population=data_full, split=split)
container.add(cites=cites, content=content, paper=paper)
container.freeze()
container

Unnamed: 0,subset,name,rows,type
0,test,paper,821,View
1,train,paper,1887,View

Unnamed: 0,name,rows,type
0,cites,5429,DataFrame
1,content,49216,DataFrame
2,paper,2708,DataFrame


## 2. Predictive modeling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

### 2.1 Define relational model

To get started with relational learning, we need to specify the data model. Even though the data set itself is quite simple with only three tables and six columns in total, the resulting data model is actually quite complicated.

That is because the class label can be predicting using three different pieces of information:

- The keywords used by the paper
- The keywords used by papers it cites and by papers that cite the paper
- The class label of papers it cites and by papers that cite the paper

The main challenge here is that `cites` is used twice, once to connect the _cited_ papers and then to connect the _citing_ papers. To resolve this, we need two placeholders on `cites`.

In [14]:
dm = getml.data.DataModel(paper.to_placeholder("population"))

# We need two different placeholders for cites.
dm.add(getml.data.to_placeholder(cites=[cites]*2, content=content, paper=paper))

dm.population.join(
    dm.cites[0],
    on=('paper_id', 'cited_paper_id')
)

dm.cites[0].join(
    dm.content,
    on=('citing_paper_id', 'paper_id')
)

dm.cites[0].join(
    dm.paper,
    on=('citing_paper_id', 'paper_id'),
    relationship=getml.data.relationship.many_to_one
)

dm.population.join(
    dm.cites[1],
    on=('paper_id', 'citing_paper_id')
)

dm.cites[1].join(
    dm.content,
    on=('cited_paper_id', 'paper_id')
)

dm.cites[1].join(
    dm.paper,
    on=('cited_paper_id', 'paper_id'),
    relationship=getml.data.relationship.many_to_one
)

dm.population.join(
    dm.content,
    on='paper_id'
)

dm

Unnamed: 0,data frames,staging table
0,population,POPULATION__STAGING_TABLE_1
1,"cites, paper",CITES__STAGING_TABLE_2
2,"cites, paper",CITES__STAGING_TABLE_3
3,content,CONTENT__STAGING_TABLE_4


### 2.2 getML pipeline

<!-- #### 2.1.1  -->
__Set-up the feature learner & predictor__

We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (`min_num_samples`).

In [15]:
mapping = getml.preprocessors.Mapping()

fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    num_threads=1
)

relboost = getml.feature_learning.Relboost(
    num_features=10,
    num_subfeatures=10,
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    seed=4367,
    num_threads=1,
    min_num_samples=30
)

predictor = getml.predictors.XGBoostClassifier()

__Build the pipeline__

In [16]:
pipe1 = getml.pipeline.Pipeline(
    tags=['fast_prop'],
    data_model=dm,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    predictors=[predictor]
)

pipe1

In [17]:
pipe2 = getml.pipeline.Pipeline(
    tags=['relboost'],
    data_model=dm,
    feature_learners=[relboost],
    predictors=[predictor]
)

pipe2

### 2.3 Model training

In [18]:
pipe1.check(container.train)

Checking data model...


Staging...

Preprocessing...

Checking...

INFO [MIGHT TAKE LONG]: The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.


In [19]:
pipe1.fit(container.train)

Checking data model...


Staging...

INFO [MIGHT TAKE LONG]: The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.


Staging...

Preprocessing...

FastProp: Trying 3780 features.

In [20]:
pipe2.check(container.train)

Checking data model...


Staging...

Checking...

INFO [MIGHT TAKE LONG]: The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.


The training process seems a bit intimidating. That is because the relboost algorithms needs to train separate models for each class label. This is due to the nature of the generated features.

In [21]:
pipe2.fit(container.train)

Checking data model...


Staging...

INFO [MIGHT TAKE LONG]: The number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
INFO [FOREIGN KEYS NOT FOUND]: When joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.


Staging...

Relboost: Training subfeatures...

Relboost: Trai


Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

XGBoost: Training as predictor...

XGBoost: Training as predictor...

XGBoost: Training as predictor...

XGBoost: Training as predictor...

XGBoost: Training as predictor...

XGBoost: Training as predictor...

XGBoost: Training as predictor...

Trained pipeline.
Time taken: 0h:2m:30.090179



### 2.4 Model evaluation

In [22]:
pipe1.score(container.test)



Staging...

Preprocessing...

FastProp: Building subfeatures...

FastProp: Building subfeatures...

FastProp: Building features...



Unnamed: 0,date time,set used,target,accuracy,auc,cross entropy
0.0,2021-08-20 23:35:25,train,class_label=Case_Based,0.9978802331743508,0.9999,0.02323
1.0,2021-08-20 23:35:25,train,class_label=Genetic_Algorithms,1,1.,0.004915
2.0,2021-08-20 23:35:25,train,class_label=Neural_Networks,0.9846316905140434,0.9983,0.065852
3.0,2021-08-20 23:35:25,train,class_label=Probabilistic_Methods,0.9957604663487016,0.9998,0.02765
4.0,2021-08-20 23:35:25,train,class_label=Reinforcement_Learning,0.9994700582935877,1.,0.009078
,...,...,...,...,...,...
9.0,2021-08-20 23:37:57,test,class_label=Neural_Networks,0.951278928136419,0.9787,0.163552
10.0,2021-08-20 23:37:57,test,class_label=Probabilistic_Methods,0.9732034104750305,0.9872,0.083174
11.0,2021-08-20 23:37:57,test,class_label=Reinforcement_Learning,0.9805115712545676,0.9736,0.074599
12.0,2021-08-20 23:37:57,test,class_label=Rule_Learning,0.9841656516443362,0.9937,0.052146


In [23]:
pipe2.score(container.test)



Staging...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building su

Unnamed: 0,date time,set used,target,accuracy,auc,cross entropy
0.0,2021-08-20 23:37:56,train,class_label=Case_Based,1,1.,0.008368
1.0,2021-08-20 23:37:56,train,class_label=Genetic_Algorithms,1,1.,0.004185
2.0,2021-08-20 23:37:56,train,class_label=Neural_Networks,0.9925808161102279,0.9995,0.03748
3.0,2021-08-20 23:37:56,train,class_label=Probabilistic_Methods,0.9978802331743508,1.,0.014195
4.0,2021-08-20 23:37:56,train,class_label=Reinforcement_Learning,1,1.,0.004341
,...,...,...,...,...,...
9.0,2021-08-20 23:38:01,test,class_label=Neural_Networks,0.9390986601705238,0.9802,0.182987
10.0,2021-08-20 23:38:01,test,class_label=Probabilistic_Methods,0.9732034104750305,0.9874,0.090399
11.0,2021-08-20 23:38:01,test,class_label=Reinforcement_Learning,0.9817295980511571,0.9779,0.077956
12.0,2021-08-20 23:38:01,test,class_label=Rule_Learning,0.9817295980511571,0.9918,0.06703


To make things a bit easier, we just look at our test results.

In [24]:
pipe1.scores.filter(lambda score: score.set_used == "test")

Unnamed: 0,date time,set used,target,accuracy,auc,cross entropy
0,2021-08-20 23:37:57,test,class_label=Case_Based,0.9708,0.9861,0.08689
1,2021-08-20 23:37:57,test,class_label=Genetic_Algorithms,0.9854,0.998,0.04898
2,2021-08-20 23:37:57,test,class_label=Neural_Networks,0.9513,0.9787,0.16355
3,2021-08-20 23:37:57,test,class_label=Probabilistic_Methods,0.9732,0.9872,0.08317
4,2021-08-20 23:37:57,test,class_label=Reinforcement_Learning,0.9805,0.9736,0.0746
5,2021-08-20 23:37:57,test,class_label=Rule_Learning,0.9842,0.9937,0.05215
6,2021-08-20 23:37:57,test,class_label=Theory,0.9574,0.977,0.1286


In [25]:
pipe2.scores.filter(lambda score: score.set_used == "test")

Unnamed: 0,date time,set used,target,accuracy,auc,cross entropy
0,2021-08-20 23:38:01,test,class_label=Case_Based,0.9756,0.9801,0.10383
1,2021-08-20 23:38:01,test,class_label=Genetic_Algorithms,0.9915,0.9992,0.03394
2,2021-08-20 23:38:01,test,class_label=Neural_Networks,0.9391,0.9802,0.18299
3,2021-08-20 23:38:01,test,class_label=Probabilistic_Methods,0.9732,0.9874,0.0904
4,2021-08-20 23:38:01,test,class_label=Reinforcement_Learning,0.9817,0.9779,0.07796
5,2021-08-20 23:38:01,test,class_label=Rule_Learning,0.9817,0.9918,0.06703
6,2021-08-20 23:38:01,test,class_label=Theory,0.9501,0.966,0.16415


### 2.5 Productionization

It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's `sqlite3` module.

In [26]:
pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("cora1_spark_sql")

In [27]:
pipe2.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save("cora2_spark_sql")

In [28]:
spark = SparkSession(sc)

In [29]:
population_spark = container.train.population.to_pyspark(spark, name="population")
cites_spark = cites.to_pyspark(spark, name="cites") 
content_spark = content.to_pyspark(spark, name="content") 
paper_spark = paper.to_pyspark(spark, name="paper")

In [30]:
begin = time.time()
getml.spark.execute(spark, "cora1_spark_sql")
end = time.time()

spark_runtime1 = datetime.timedelta(seconds=end - begin)

2021-08-20 23:38:07,504 Executing cora1_spark_sql/0001_population__staging_table_1.sql...
2021-08-20 23:38:21,964 Executing cora1_spark_sql/0002_cites__staging_table_2.sql...
2021-08-20 23:38:30,112 Executing cora1_spark_sql/0003_cites__staging_table_3.sql...
2021-08-20 23:38:35,087 Executing cora1_spark_sql/0004_content__staging_table_4.sql...
2021-08-20 23:38:36,586 Executing cora1_spark_sql/0005_word_cited_id__mapping_1_1_target_1_avg.sql...
2021-08-20 23:38:40,629 Executing cora1_spark_sql/0006_word_cited_id__mapping_1_1_target_2_avg.sql...
2021-08-20 23:38:42,862 Executing cora1_spark_sql/0007_word_cited_id__mapping_1_1_target_3_avg.sql...
2021-08-20 23:38:44,947 Executing cora1_spark_sql/0008_word_cited_id__mapping_1_1_target_4_avg.sql...
2021-08-20 23:38:47,181 Executing cora1_spark_sql/0009_word_cited_id__mapping_1_1_target_5_avg.sql...
2021-08-20 23:38:49,374 Executing cora1_spark_sql/0010_word_cited_id__mapping_1_1_target_6_avg.sql...
2021-08-20 23:38:51,461 Executing cora1_s

2021-08-20 23:46:11,829 Executing cora1_spark_sql/0096_feature_1_1_57.sql...
2021-08-20 23:46:17,138 Executing cora1_spark_sql/0097_feature_1_1_58.sql...
2021-08-20 23:46:25,628 Executing cora1_spark_sql/0098_feature_1_1_59.sql...
2021-08-20 23:46:35,417 Executing cora1_spark_sql/0099_feature_1_1_60.sql...
2021-08-20 23:46:40,286 Executing cora1_spark_sql/0100_feature_1_1_61.sql...
2021-08-20 23:46:45,955 Executing cora1_spark_sql/0101_feature_1_1_62.sql...
2021-08-20 23:46:51,086 Executing cora1_spark_sql/0102_feature_1_1_63.sql...
2021-08-20 23:46:56,581 Executing cora1_spark_sql/0103_feature_1_1_64.sql...
2021-08-20 23:47:01,916 Executing cora1_spark_sql/0104_feature_1_1_65.sql...
2021-08-20 23:47:06,907 Executing cora1_spark_sql/0105_feature_1_1_66.sql...
2021-08-20 23:47:11,922 Executing cora1_spark_sql/0106_feature_1_1_67.sql...
2021-08-20 23:47:20,888 Executing cora1_spark_sql/0107_feature_1_1_68.sql...
2021-08-20 23:47:30,400 Executing cora1_spark_sql/0108_feature_1_1_69.sql...

2021-08-20 23:56:59,242 Executing cora1_spark_sql/0202_feature_1_1_163.sql...
2021-08-20 23:57:04,305 Executing cora1_spark_sql/0203_feature_1_1_164.sql...
2021-08-20 23:57:09,283 Executing cora1_spark_sql/0204_feature_1_1_165.sql...
2021-08-20 23:57:14,388 Executing cora1_spark_sql/0205_feature_1_1_166.sql...
2021-08-20 23:57:22,711 Executing cora1_spark_sql/0206_feature_1_1_167.sql...
2021-08-20 23:57:32,750 Executing cora1_spark_sql/0207_feature_1_1_168.sql...
2021-08-20 23:57:38,101 Executing cora1_spark_sql/0208_feature_1_1_169.sql...
2021-08-20 23:57:43,740 Executing cora1_spark_sql/0209_feature_1_1_170.sql...
2021-08-20 23:57:48,782 Executing cora1_spark_sql/0210_feature_1_1_171.sql...
2021-08-20 23:57:53,844 Executing cora1_spark_sql/0211_feature_1_1_172.sql...
2021-08-20 23:57:59,187 Executing cora1_spark_sql/0212_feature_1_1_173.sql...
2021-08-20 23:58:04,192 Executing cora1_spark_sql/0213_feature_1_1_174.sql...
2021-08-20 23:58:09,465 Executing cora1_spark_sql/0214_feature_1

2021-08-21 00:08:27,777 Executing cora1_spark_sql/0309_feature_1_2_78.sql...
2021-08-21 00:08:32,633 Executing cora1_spark_sql/0310_feature_1_2_79.sql...
2021-08-21 00:08:38,650 Executing cora1_spark_sql/0311_feature_1_2_80.sql...
2021-08-21 00:08:43,967 Executing cora1_spark_sql/0312_feature_1_2_81.sql...
2021-08-21 00:08:49,032 Executing cora1_spark_sql/0313_feature_1_2_82.sql...
2021-08-21 00:08:54,085 Executing cora1_spark_sql/0314_feature_1_2_83.sql...
2021-08-21 00:08:59,548 Executing cora1_spark_sql/0315_feature_1_2_84.sql...
2021-08-21 00:09:04,630 Executing cora1_spark_sql/0316_feature_1_2_85.sql...
2021-08-21 00:09:13,644 Executing cora1_spark_sql/0317_feature_1_2_86.sql...
2021-08-21 00:09:23,353 Executing cora1_spark_sql/0318_feature_1_2_87.sql...
2021-08-21 00:09:28,168 Executing cora1_spark_sql/0319_feature_1_2_88.sql...
2021-08-21 00:09:34,168 Executing cora1_spark_sql/0320_feature_1_2_89.sql...
2021-08-21 00:09:38,929 Executing cora1_spark_sql/0321_feature_1_2_90.sql...

2021-08-21 00:19:06,342 Executing cora1_spark_sql/0415_feature_1_2_184.sql...
2021-08-21 00:19:15,300 Executing cora1_spark_sql/0416_feature_1_2_185.sql...
2021-08-21 00:19:25,209 Executing cora1_spark_sql/0417_feature_1_2_186.sql...
2021-08-21 00:19:30,045 Executing cora1_spark_sql/0418_feature_1_2_187.sql...
2021-08-21 00:19:35,909 Executing cora1_spark_sql/0419_feature_1_2_188.sql...
2021-08-21 00:19:40,800 Executing cora1_spark_sql/0420_feature_1_2_189.sql...
2021-08-21 00:19:45,957 Executing cora1_spark_sql/0421_feature_1_2_190.sql...
2021-08-21 00:19:51,117 Executing cora1_spark_sql/0422_feature_1_2_191.sql...
2021-08-21 00:19:56,373 Executing cora1_spark_sql/0423_feature_1_2_192.sql...
2021-08-21 00:20:01,293 Executing cora1_spark_sql/0424_feature_1_1.sql...
2021-08-21 00:20:05,759 Executing cora1_spark_sql/0425_feature_1_2.sql...
2021-08-21 00:20:09,690 Executing cora1_spark_sql/0426_feature_1_3.sql...
2021-08-21 00:20:14,464 Executing cora1_spark_sql/0427_feature_1_4.sql...
20

2021-08-21 00:27:24,936 Executing cora1_spark_sql/0524_feature_1_101.sql...
2021-08-21 00:27:29,480 Executing cora1_spark_sql/0525_feature_1_102.sql...
2021-08-21 00:27:34,464 Executing cora1_spark_sql/0526_feature_1_103.sql...
2021-08-21 00:27:39,464 Executing cora1_spark_sql/0527_feature_1_104.sql...
2021-08-21 00:27:43,710 Executing cora1_spark_sql/0528_feature_1_105.sql...
2021-08-21 00:27:48,630 Executing cora1_spark_sql/0529_feature_1_106.sql...
2021-08-21 00:27:53,319 Executing cora1_spark_sql/0530_feature_1_107.sql...
2021-08-21 00:27:57,995 Executing cora1_spark_sql/0531_feature_1_108.sql...
2021-08-21 00:28:02,996 Executing cora1_spark_sql/0532_feature_1_109.sql...
2021-08-21 00:28:07,615 Executing cora1_spark_sql/0533_feature_1_110.sql...
2021-08-21 00:28:12,383 Executing cora1_spark_sql/0534_feature_1_111.sql...
2021-08-21 00:28:17,269 Executing cora1_spark_sql/0535_feature_1_112.sql...
2021-08-21 00:28:21,825 Executing cora1_spark_sql/0536_feature_1_113.sql...
2021-08-21 0

In [31]:
begin = time.time()
getml.spark.execute(spark, "cora2_spark_sql")
end = time.time()

spark_runtime2 = datetime.timedelta(seconds=end - begin)

2021-08-21 00:38:14,200 Executing cora2_spark_sql/0001_population__staging_table_1.sql...
2021-08-21 00:38:14,802 Executing cora2_spark_sql/0002_cites__staging_table_2.sql...
2021-08-21 00:38:18,340 Executing cora2_spark_sql/0003_cites__staging_table_3.sql...
2021-08-21 00:38:22,284 Executing cora2_spark_sql/0004_content__staging_table_4.sql...
2021-08-21 00:38:22,762 Executing cora2_spark_sql/0005_feature_1_1_1.sql...
2021-08-21 00:38:29,309 Executing cora2_spark_sql/0006_feature_1_1_2.sql...
2021-08-21 00:38:35,119 Executing cora2_spark_sql/0007_feature_1_1_3.sql...
2021-08-21 00:38:40,512 Executing cora2_spark_sql/0008_feature_1_1_4.sql...
2021-08-21 00:38:45,999 Executing cora2_spark_sql/0009_feature_1_1_5.sql...
2021-08-21 00:38:51,310 Executing cora2_spark_sql/0010_feature_1_1_6.sql...
2021-08-21 00:38:56,335 Executing cora2_spark_sql/0011_feature_1_1_7.sql...
2021-08-21 00:39:01,662 Executing cora2_spark_sql/0012_feature_1_1_8.sql...
2021-08-21 00:39:06,757 Executing cora2_spark

2021-08-21 00:48:13,489 Executing cora2_spark_sql/0109_feature_3_1_1.sql...
2021-08-21 00:48:19,383 Executing cora2_spark_sql/0110_feature_3_1_2.sql...
2021-08-21 00:48:24,865 Executing cora2_spark_sql/0111_feature_3_1_3.sql...
2021-08-21 00:48:30,508 Executing cora2_spark_sql/0112_feature_3_1_4.sql...
2021-08-21 00:48:35,915 Executing cora2_spark_sql/0113_feature_3_1_5.sql...
2021-08-21 00:48:41,345 Executing cora2_spark_sql/0114_feature_3_1_6.sql...
2021-08-21 00:48:46,629 Executing cora2_spark_sql/0115_feature_3_1_7.sql...
2021-08-21 00:48:52,228 Executing cora2_spark_sql/0116_feature_3_1_8.sql...
2021-08-21 00:48:57,814 Executing cora2_spark_sql/0117_feature_3_1_9.sql...
2021-08-21 00:49:03,341 Executing cora2_spark_sql/0118_feature_3_1_10.sql...
2021-08-21 00:49:09,156 Executing cora2_spark_sql/0119_feature_3_1_11.sql...
2021-08-21 00:49:14,602 Executing cora2_spark_sql/0120_feature_3_1_12.sql...
2021-08-21 00:49:20,073 Executing cora2_spark_sql/0121_feature_3_1_13.sql...
2021-08-

2021-08-21 00:58:21,213 Executing cora2_spark_sql/0217_feature_5_1_5.sql...
2021-08-21 00:58:26,934 Executing cora2_spark_sql/0218_feature_5_1_6.sql...
2021-08-21 00:58:32,683 Executing cora2_spark_sql/0219_feature_5_1_7.sql...
2021-08-21 00:58:38,341 Executing cora2_spark_sql/0220_feature_5_1_8.sql...
2021-08-21 00:58:44,186 Executing cora2_spark_sql/0221_feature_5_1_9.sql...
2021-08-21 00:58:49,815 Executing cora2_spark_sql/0222_feature_5_1_10.sql...
2021-08-21 00:58:55,581 Executing cora2_spark_sql/0223_feature_5_1_11.sql...
2021-08-21 00:59:01,050 Executing cora2_spark_sql/0224_feature_5_1_12.sql...
2021-08-21 00:59:06,166 Executing cora2_spark_sql/0225_feature_5_1_13.sql...
2021-08-21 00:59:11,998 Executing cora2_spark_sql/0226_feature_5_1_14.sql...
2021-08-21 00:59:17,886 Executing cora2_spark_sql/0227_feature_5_1_15.sql...
2021-08-21 00:59:23,968 Executing cora2_spark_sql/0228_feature_5_1_16.sql...
2021-08-21 00:59:29,866 Executing cora2_spark_sql/0229_feature_5_1_17.sql...
2021

2021-08-21 01:08:32,120 Executing cora2_spark_sql/0325_feature_7_1_9.sql...
2021-08-21 01:08:37,901 Executing cora2_spark_sql/0326_feature_7_1_10.sql...
2021-08-21 01:08:43,681 Executing cora2_spark_sql/0327_feature_7_1_11.sql...
2021-08-21 01:08:48,622 Executing cora2_spark_sql/0328_feature_7_1_12.sql...
2021-08-21 01:08:53,917 Executing cora2_spark_sql/0329_feature_7_1_13.sql...
2021-08-21 01:08:59,498 Executing cora2_spark_sql/0330_feature_7_1_14.sql...
2021-08-21 01:09:04,655 Executing cora2_spark_sql/0331_feature_7_1_15.sql...
2021-08-21 01:09:10,190 Executing cora2_spark_sql/0332_feature_7_1_16.sql...
2021-08-21 01:09:15,542 Executing cora2_spark_sql/0333_feature_7_1_17.sql...
2021-08-21 01:09:21,710 Executing cora2_spark_sql/0334_feature_7_1_18.sql...
2021-08-21 01:09:27,297 Executing cora2_spark_sql/0335_feature_7_1_19.sql...
2021-08-21 01:09:32,655 Executing cora2_spark_sql/0336_feature_7_1_20.sql...
2021-08-21 01:09:38,185 Executing cora2_spark_sql/0337_features_7_1.sql...
20

In [32]:
begin = time.time()
features1 = pipe1.transform(container.train)
end = time.time()

getml_runtime1 = datetime.timedelta(seconds=end - begin)



Staging...

Preprocessing...

FastProp: Building subfeatures...

FastProp: Building subfeatures...

FastProp: Building features...



In [33]:
begin = time.time()
features1 = pipe2.transform(container.train)
end = time.time()

getml_runtime2 = datetime.timedelta(seconds=end - begin)



Staging...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building features...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building subfeatures...

Relboost: Building su

In [34]:
spark_runtime1 / getml_runtime1

5851.28686128308

In [35]:
spark_runtime2 / getml_runtime2

240.54293225006398

## 3. Conclusion

In this notebook we have demonstrated that getML outperforms state-of-the-art relational learning algorithms on the CORA dataset.

## References

Dinh, Quang-Thang, Christel Vrain, and Matthieu Exbrayat. "A Link-Based Method for Propositionalization." ILP (Late Breaking Papers). 2012.

Motl, Jan, and Oliver Schulte. "The CTU prague relational learning repository." arXiv preprint arXiv:1511.03086 (2015).

Perlich, Claudia, and Foster Provost. "Distribution-based aggregation for relational learning with identifier attributes." Machine Learning 62.1-2 (2006): 65-105.

Preisach, Christine, and Lars Schmidt-Thieme. "Relational ensemble classification." Sixth International Conference on Data Mining (ICDM'06). IEEE, 2006.

# Next Steps

This tutorial benchmarked getML against academic state-of-the-art algorithms from relational learning literature and getML's qualities with respect to categorical data.

If you are interested in further real-world applications of getML, head back to the [notebook overview](welcome.md) and choose one of the remaining examples.

Here is some additional material from our [documentation](https://docs.getml.com/latest/) if you want to learn more about getML:
* [Feature learning with Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel)
* [Feature learning with Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost)

# Get in contact

If you have any question schedule a [call with Alex](https://go.getml.com/meetings/alexander-uhlig/getml-demo), the co-founder of getML, or write us an [email](team@getml.com). Prefer a private demo of getML? Just contact us to make an appointment.