# CORA: getML performance breaks record

Graph Neural Networks (GNNs) are renowned for their outstanding performance on graph-structured data, excelling in tasks like node classification and link prediction. However, deploying GNNs is often complex. Tasks such as graph preprocessing, optimizing architectures, tuning hyperparameters, and ensuring convergence are non-trivial challenges when working with neural network based approaches, requiring considerable time investment.

**getML** offers a faster and more user-friendly alternative. Leveraging **getML FastProp**, the fastest open-source tool for propositionalization-based automation of feature engineering on relational data and time series, FastProp transforms relational data into a single feature table suitable for standard machine learning models by efficiently computing a wide range of statistical and temporal aggregates. When combined with models like **XGBoost**, getML delivers a straightforward yet highly performant approach to predictive modeling. This method eliminates the need for complex GNN-based approaches while ensuring coding efficiency, computational speed, and high model accuracy.

This notebook demonstrates how **getML** surpasses the previous record on the CORA dataset—set by the GNN-based approach of [Izadi et al. (2020)](https://paperswithcode.com/sota/node-classification-on-cora)—with minimal code and configuration.

Summary:

- Prediction type: __Classification model__
- Domain: __Academia__
- Prediction target: __The category of a paper__ 
- Source data: __Relational data set, 3 tables__
- Population size: __2,708__

First let some boilerplate code run.

In [1]:
%pip install -q "getml==1.5.0" "ipywidgets==8.1.5"

Note: you may need to restart the kernel to use updated packages.


In [None]:
import os

import json
import numpy as np
import pandas as pd

import getml

print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0



In [3]:
getml.engine.launch()
getml.engine.set_project("cora_sota")

Launching ./getML --allow-push-notifications=true --allow-remote-ips=false --home-directory=/home/user/.getML --in-memory=true --install=false --launch-browser=true --log=false --project-directory=/home/user/.getML/projects in /home/user/.getML/getml-enterprise-1.5.0-amd64-linux...
Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/getml_20241119160445.log
[2K  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

### 1. Loading data

#### 1.1 Download from source

We begin by downloading the data from the source file:

In [4]:
conn = getml.database.connect_mysql(
    host="relational.fel.cvut.cz",
    dbname="CORA",
    port=3306,
    user="guest",
    password="ctu-relational",
)

conn

Connection(dbname='CORA', dialect='mysql', host='relational.fel.cvut.cz', port=3306)

In [5]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(name=name, table_name=name, conn=conn)
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame

In [6]:
paper = load_if_needed("paper")
cites = load_if_needed("cites")
content = load_if_needed("content")

Here we deviate from the regular procedure by introducing the exact same train test split as the [current top seed](https://paperswithcode.com/paper/optimization-of-graph-neural-networks-with). While we contend, that testing on a single split is not sufficient to demonstrate performance of an algorithm on a specific data set, we proceed as such in order to maximize comparability with the current incumbent of the Leader Board. For a more extensive investigation of the getML performance on the CORA dataset, checkout [our other notebooks](https://getml.com/latest/examples/enterprise-notebooks/kaggle_notebooks/). 

To achieve the identical split we first need to match papers and their associated word matrix across data sources. 

In [None]:
if not os.path.exists("assets/zuordnung.json"):
    !pip install torch
    !pip install -q git+https://github.com/pyg-team/pytorch_geometric.git
    from utils.zuordnung import run_zuordnung

    # may take 90 minutes or longer to run
    run_zuordnung(content)

In [None]:
with open("assets/zuordnung.json", "r") as f:
    zuordnung = json.load(f)

paper_df = paper.to_pandas()
paper_df["paper_id"] = paper_df["paper_id"].astype(int)
zuo_df = pd.DataFrame(zuordnung)
zuo_df[0] = zuo_df[0].astype(int)
paper_df = paper_df.merge(zuo_df, left_on="paper_id", right_on=0).sort_values(by=1)
paper_df = paper_df[["class_label", "paper_id"]]

We split the sorted data set according to the instructions in the Izadi et al. paper (see:  IV. Experiments, A. Datasets, third split)

In [9]:
paper_train = getml.data.DataFrame.from_pandas(paper_df[:1707], name="train")
paper_val = getml.data.DataFrame.from_pandas(
    paper_df[1707 : 1707 + 500], name="validation"
)
paper_test = getml.data.DataFrame.from_pandas(paper_df[1707 + 500 :], name="test")

paper, split = getml.data.split.concat(
    "population", train=paper_train, validation=paper_val, test=paper_test
)

#### 1.2 Prepare data for getML

getML requires that we define *roles* for each of the columns.

In [10]:
paper.set_role("paper_id", getml.data.roles.join_key)
paper.set_role("class_label", getml.data.roles.categorical)
cites.set_role(["cited_paper_id", "citing_paper_id"], getml.data.roles.join_key)
content.set_role("paper_id", getml.data.roles.join_key)
content.set_role("word_cited_id", getml.data.roles.categorical)

The goal is to predict seven different labels. We generate a target column for each of those labels.

In [11]:
data_full = getml.data.make_target_columns(paper, "class_label")

In [12]:
container = getml.data.Container(population=data_full, split=split)
container.add(cites=cites, content=content, paper=paper)
container.freeze()
container

Unnamed: 0,subset,name,rows,type
0,test,population,500,View
1,train,population,1708,View
2,validation,population,500,View

Unnamed: 0,alias,name,rows,type
0,cites,cites,5429,DataFrame
1,content,content,49216,DataFrame
2,paper,population,2708,DataFrame


### 2. Predictive modeling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

#### 2.1 Define relational model

To get started with relational learning, we need to specify the data model. Even though the data set itself is quite simple with only three tables and six columns in total, the resulting data model is actually quite complicated.

That is because the class label can be predicting using three different pieces of information:

- The keywords used by the paper
- The keywords used by papers it cites and by papers that cite the paper
- The class label of papers it cites and by papers that cite the paper

The main challenge here is that `cites` is used twice, once to connect the _cited_ papers and then to connect the _citing_ papers. To resolve this, we need two placeholders on `cites`.

In [13]:
dm = getml.data.DataModel(paper.to_placeholder("population"))

# We need two different placeholders for cites.
dm.add(getml.data.to_placeholder(cites=[cites] * 2, content=content, paper=paper))

dm.population.join(dm.cites[0], on=("paper_id", "cited_paper_id"))

dm.cites[0].join(dm.content, on=("citing_paper_id", "paper_id"))

dm.cites[0].join(
    dm.paper,
    on=("citing_paper_id", "paper_id"),
    relationship=getml.data.relationship.many_to_one,
)

dm.population.join(dm.cites[1], on=("paper_id", "citing_paper_id"))

dm.cites[1].join(dm.content, on=("cited_paper_id", "paper_id"))

dm.cites[1].join(
    dm.paper,
    on=("cited_paper_id", "paper_id"),
    relationship=getml.data.relationship.many_to_one,
)

dm.population.join(dm.content, on="paper_id")

dm

Unnamed: 0,data frames,staging table
0,population,POPULATION__STAGING_TABLE_1
1,"cites, paper",CITES__STAGING_TABLE_2
2,"cites, paper",CITES__STAGING_TABLE_3
3,content,CONTENT__STAGING_TABLE_4


## 2.2. Hyperparameter Search
To mimic the approach of the GNN paper, we conduct a small Hyperparameter search, train on the train data, validate on the validate data and use the untouched test data as holdout set to get an unbiased estimate of the true performance.
For expediency, we make a grit search along two dimensions and keep the number of levels deliberately small:
 
    num_features: 250, 300, 350
    built-in aggregation sets: minimal, default, all

In [14]:
mapping = getml.preprocessors.Mapping()
predictor = getml.predictors.XGBoostClassifier()

actual_labels_val = paper[split == "validation"].class_label.to_numpy()
actual_labels_test = paper[split == "test"].class_label.to_numpy()
class_label = paper.class_label.unique()

pipe1 = getml.pipeline.Pipeline(
    data_model=dm, preprocessors=[mapping], predictors=[predictor]
)

In [15]:
def prob_to_acc(prob, actual_labels, class_label) -> float:
    ix_max = np.argmax(prob, axis=1)
    predicted_labels = np.asarray([class_label[ix] for ix in ix_max])
    return (actual_labels == predicted_labels).sum() / len(actual_labels)

In [16]:
%%capture
parameter_sweep = {}
i = 0
for num_feat in [250, 300, 350]:
    for aggregation_set in [
        getml.feature_learning.aggregations.FASTPROP.Minimal,
        getml.feature_learning.aggregations.FASTPROP.Default,
        getml.feature_learning.aggregations.FASTPROP.All,
    ]:
        fast_prop = getml.feature_learning.FastProp(
            loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
            aggregation=aggregation_set,
            num_features=num_feat,
        )

        pipe1.feature_learners = [fast_prop]

        pipe1.fit(container.train)

        probs_val = pipe1.predict(container.validation)
        val_acc = prob_to_acc(probs_val, actual_labels_val, class_label)

        parameter_sweep[i] = {
            "num_feat": num_feat,
            "agg_set": aggregation_set,
            "val_acc": val_acc,
        }

        i += 1

In [17]:
best_val_acc_comb = list(
    sorted(parameter_sweep.items(), key=lambda item: item[1]["val_acc"], reverse=True)
)[0][1]

In [18]:
print(f"Accuracy on validation set: {best_val_acc_comb['val_acc']}")
print(f"Number of features used: {best_val_acc_comb['num_feat']}")
print(f"Aggregation set used: {best_val_acc_comb['agg_set']}")

Accuracy on validation set: 0.876
Number of features used: 300
Aggregation set used: frozenset({'MAX', 'SUM', 'AVG', 'COUNT', 'MIN'})


Now as we identified the parameter combination that yields the highest accuracy on the validation set, let's use the same parameters on the hold out data to attain an unbiased estimate of the model's predictive performance.

In [19]:
fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    aggregation=best_val_acc_comb["agg_set"],
    num_features=best_val_acc_comb["num_feat"],
)

pipe1.feature_learners = [fast_prop]

pipe1.fit(container.train)

probs_test = pipe1.predict(container.test)
test_acc = prob_to_acc(probs_test, actual_labels_test, class_label)

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Retrieving features from cache... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

Time taken: 0:00:00.518892.

[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00
[?25h

In [20]:
print(f"Accuracy on the test set: {test_acc}")

Accuracy on the test set: 0.906


# Conclusion

This notebook demonstrates how **getML**, powered by its **FastProp** feature engineering algorithm and **XGBoost**, surpasses the current state-of-the-art on the CORA dataset. By replicating the data split and hyperparameter optimization methods of Izadi et al., we achieve a record-breaking accuracy of **90.6%**, exceeding their previous benchmark of 90.16%.

At the core of this success is **FastProp**, which automates feature creation for relational datasets by efficiently generating statistical and temporal aggregates.

This example highlights how cutting-edge performance can be achieved without the need for manual feature engineering or complex GNN-based approaches, enabling faster iteration and greater model interpretability.

By incorporating getML into their workflows, data scientists can achieve superior results with less effort, seamlessly combining efficiency with state-of-the-art performance.

# References

Izadi, Fang, Stevenson, Lin (2020): Optimization of Graph Neural Networks with Natural Gradient Descent   
https://arxiv.org/pdf/2008.09624v1