Recent years have shown an incredible proliferation of sophisticated Machine Learning algorithms. Keeping up with that development has become a full time job. Wouldn't it be nice to have a tool that fits all and still provides cutting edge results?! Look no further: getML to the rescue!

In previous notebooks we have analysed the performance of getML on the CORA dataset, and benchmarked it extensively against alternative approaches. 
In this short notebook, we demonstrate, how getML outperforms the State of the Art performance with just a little tweak in its configurations.

First let some boilerplate code run.

In [1]:
%pip install -q "ipywidgets==8.1.5"
!pip install /home/jan-meyer/Documents/gitlab/monorepo/src/python-api

Note: you may need to restart the kernel to use updated packages.
Processing /home/jan-meyer/Documents/gitlab/monorepo/src/python-api
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hCollecting jinja2 (from getml==1.5.0)
  Using cached jinja2-3.1.4-py3-none-any.whl.metadata (2.6 kB)
Collecting numpy~=1.22 (from getml==1.5.0)
  Using cached numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Collecting pandas (from getml==1.5.0)
  Using cached pandas-2.2.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
Collecting pyarrow~=16.0 (from getml==1.5.0)
  Using cached pyarrow-16.1.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (3.0 kB)
Collecting rich~=13.0 (from getml==1.5.0)
  Using cached rich-13.8.1-py3-none-any.whl.metadata (18 kB)
Collecting markdown-it-py>=2.2.0 (from rich~=13.0->getml==1.5.0)
  Usin

In [2]:
import os

import numpy as np
import pandas as pd

import getml

print(f"getML API version: {getml.__version__}\n")

getML API version: 1.5.0



In [3]:
#getml.engine.shutdown()|
getml.engine.launch(allow_remote_ips=True, token="token")
getml.engine.set_project("cora_sota")

OSError: Could not find getML executable in any of the following locations:
['/home/jan-meyer/.getML', '/usr/local/getML', '/home/jan-meyer/Documents/github/getml-demo/.venv/lib/python3.11/site-packages/getml/.getML']

Refer to the installation documentation for more information:
https://getml.com/latest/install/

### 1. Loading data

#### 1.1 Download from source

We begin by downloading the data from the source file:

In [4]:
conn = getml.database.connect_mysql(
    host="db.relational-data.org",
    dbname="CORA",
    port=3306,
    user="guest",
    password="relational",
)

conn

Connection(dbname='CORA', dialect='mysql', host='db.relational-data.org', port=3306)

In [5]:
def load_if_needed(name):
    """
    Loads the data from the relational learning
    repository, if the data frame has not already
    been loaded.
    """
    if not getml.data.exists(name):
        data_frame = getml.data.DataFrame.from_db(name=name, table_name=name, conn=conn)
        data_frame.save()
    else:
        data_frame = getml.data.load_data_frame(name)
    return data_frame

In [6]:
paper = load_if_needed("paper")
cites = load_if_needed("cites")
content = load_if_needed("content")

Here we deviate from the regular procedure by introducing the exact same train test split as the [current top seed](https://paperswithcode.com/paper/optimization-of-graph-neural-networks-with). While we contend, that testing on a single split is not sufficient to demonstrate performance of an algorithm on a specific data set, we proceed as such in order to maximize comparability with the current incumbent of the Leader Board. For a more extensive investigation of the getML performance on the CORA dataset, checkout our other notebooks. 

To achieve the identical split we first need to match papers and their associated word matrix across data sources. 

In [7]:
if not os.path.exists("assets/zuordnung.txt"):
    from zuordnung import run_zuordnung

    # may take 90 minutes or longer to run
    run_zuordnung(content)

In [8]:
f = open("assets/zuordnung.txt", "r")
zuordnung = f.read()
zuordnung = eval(zuordnung)


paper_df = paper.to_pandas()
paper_df["paper_id"] = paper_df["paper_id"].astype(int)
zuo_df = pd.DataFrame(zuordnung)
zuo_df[0] = zuo_df[0].astype(int)
paper_df = paper_df.merge(zuo_df, left_on="paper_id", right_on=0).sort_values(by=1)
paper_df = paper_df[["class_label", "paper_id"]]

We split the sorted data set according to the instructions in the GNN paper (see:  IV. Experiments, A. Datasets, third split)

In [9]:
paper_train = getml.data.DataFrame.from_pandas(paper_df[:1707], name="train")
paper_val = getml.data.DataFrame.from_pandas(
    paper_df[1707 : 1707 + 500], name="validation"
)
paper_test = getml.data.DataFrame.from_pandas(paper_df[1707 + 500 :], name="test")

paper, split = getml.data.split.concat(
    "population", train=paper_train, validation=paper_val, test=paper_test
)

Similar to the approach in the paper, we perform hyperparameter optimization and select the parameters that perform best on the validation set. The performance on the test set serves as our benchmark value. 

#### 1.2 Prepare data for getML

getML requires that we define *roles* for each of the columns.

In [10]:
paper.set_role("paper_id", getml.data.roles.join_key)
paper.set_role("class_label", getml.data.roles.categorical)
cites.set_role(["cited_paper_id", "citing_paper_id"], getml.data.roles.join_key)
content.set_role("paper_id", getml.data.roles.join_key)
content.set_role("word_cited_id", getml.data.roles.categorical)

The goal is to predict seven different labels. We generate a target column for each of those labels. We also have to separate the data set into a training and testing set.

In [11]:
data_full = getml.data.make_target_columns(paper, "class_label")

In [12]:
container = getml.data.Container(population=data_full, split=split)
container.add(cites=cites, content=content, paper=paper)
container.freeze()
container

Unnamed: 0,subset,name,rows,type
0,test,population,500,View
1,train,population,1708,View
2,validation,population,500,View

Unnamed: 0,alias,name,rows,type
0,cites,cites,5429,DataFrame
1,content,content,49216,DataFrame
2,paper,population,2708,DataFrame


### 2. Predictive modeling

We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning.

#### 2.1 Define relational model

To get started with relational learning, we need to specify the data model. Even though the data set itself is quite simple with only three tables and six columns in total, the resulting data model is actually quite complicated.

That is because the class label can be predicting using three different pieces of information:

- The keywords used by the paper
- The keywords used by papers it cites and by papers that cite the paper
- The class label of papers it cites and by papers that cite the paper

The main challenge here is that `cites` is used twice, once to connect the _cited_ papers and then to connect the _citing_ papers. To resolve this, we need two placeholders on `cites`.

In [13]:
dm = getml.data.DataModel(paper.to_placeholder("population"))

# We need two different placeholders for cites.
dm.add(getml.data.to_placeholder(cites=[cites] * 2, content=content, paper=paper))

dm.population.join(dm.cites[0], on=("paper_id", "cited_paper_id"))

dm.cites[0].join(dm.content, on=("citing_paper_id", "paper_id"))

dm.cites[0].join(
    dm.paper,
    on=("citing_paper_id", "paper_id"),
    relationship=getml.data.relationship.many_to_one,
)

dm.population.join(dm.cites[1], on=("paper_id", "citing_paper_id"))

dm.cites[1].join(dm.content, on=("cited_paper_id", "paper_id"))

dm.cites[1].join(
    dm.paper,
    on=("cited_paper_id", "paper_id"),
    relationship=getml.data.relationship.many_to_one,
)

dm.population.join(dm.content, on="paper_id")

dm

Unnamed: 0,data frames,staging table
0,population,POPULATION__STAGING_TABLE_1
1,"cites, paper",CITES__STAGING_TABLE_2
2,"cites, paper",CITES__STAGING_TABLE_3
3,content,CONTENT__STAGING_TABLE_4


## 2.2. Hyperparameter Search
To mimic the approach of the GNN paper, we conduct a small Hyperparameter search, training on the train data, validate on the validate data and use the untouched test data as holdout set to get an unbiased estimate of the true performance.
For expediency, we make a grit search along two dimensions and keep the number of levels deliberately small:
 
    num_features: 250, 300, 350
    built-in aggregation sets: minimal, default, all

In [14]:
mapping = getml.preprocessors.Mapping()
predictor = getml.predictors.XGBoostClassifier()

actual_labels_val = paper[split == "validation"].class_label.to_numpy()
actual_labels_test = paper[split == "test"].class_label.to_numpy()
class_label = paper.class_label.unique()

pipe1 = getml.pipeline.Pipeline(
    data_model=dm, preprocessors=[mapping], predictors=[predictor]
)

In [15]:
def prob_to_acc(prob, actual_labels, class_label) -> float:
    ix_max = np.argmax(prob, axis=1)
    predicted_labels = np.asarray([class_label[ix] for ix in ix_max])
    return (actual_labels == predicted_labels).sum() / len(actual_labels)

In [16]:
%%capture
parameter_sweep = {}
i = 0
for num_feat in [250, 300, 350]:
    for aggregation_set in [
        getml.feature_learning.aggregations.FASTPROP.Minimal,
        getml.feature_learning.aggregations.FASTPROP.Default,
        getml.feature_learning.aggregations.FASTPROP.All,
    ]:
        fast_prop = getml.feature_learning.FastProp(
            loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
            aggregation=aggregation_set,
            num_features=num_feat,
        )

        pipe1.feature_learners = [fast_prop]

        pipe1.fit(container.train)

        probs_val = pipe1.predict(container.validation)
        val_acc = prob_to_acc(probs_val, actual_labels_val, class_label)

        parameter_sweep[i] = {
            "num_feat": num_feat,
            "agg_set": aggregation_set,
            "val_acc": val_acc,
        }

        i += 1

OSError: The Mapping preprocessor is not supported in the community edition. Please upgrade to getML enterprise to use this. An overview of what is supported in the community edition can be found in the official getML documentation.

In [17]:
best_val_acc_comb = list(
    sorted(parameter_sweep.items(), key=lambda item: item[1]["val_acc"], reverse=True)
)[0][1]

IndexError: list index out of range

In [30]:
print(f"Accuracy on validation set: {best_val_acc_comb['val_acc']}")
print(f"Number of features used: {best_val_acc_comb['num_feat']}")
print(f"Aggregation set used: {best_val_acc_comb['agg_set']}")

Accuracy on validation set: 0.874
Number of features used: 300
Aggregation set used: frozenset({'AVG', 'MIN', 'MAX', 'SUM', 'COUNT'})


Now as we identified the parameter combination that yields the highest accuracy on the validation set, let's use the same parameters on the hold out data to attain an unbiased estimate of the model's predictive performance.

In [31]:
fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,
    aggregation=best_val_acc_comb["agg_set"],
    num_features=best_val_acc_comb["num_feat"],
)

pipe1.feature_learners = [fast_prop]

pipe1.fit(container.train)

probs_test = pipe1.predict(container.test)
test_acc = prob_to_acc(probs_test, actual_labels_test, class_label)

Output()

Output()

Time taken: 0:00:01.034566.



Output()

In [32]:
print(f"Accuracy on the test set: {test_acc}")

Accuracy on the test set: 0.906


# Conclusion

This notebook sought out to attain a new record predictive performance on the well known Cora data set by using exclusively getML's feature learning framework. To maximize comparability we mimicked the methodology of the current record holder.

We replicated the exact data split used in their research and performed hyperparameter optimization in a similar manner. On the holdout dataset, we achieved an accuracy of 90.6%, which compares favorably to the previous record of 90.16%. Therefore, our solution, combining FastProp for automated feature engineering and XGBoost for classification, can now be considered the new state-of-the-art on this popular benchmark dataset.

Remarkable is the ease of implementation. Requiring only minimal tweaking of parameters, getML beat an advanced Graph Neural Network algorithm. Cutting edge predictive performance is now within reach of every Data Scientist by simply incorporating getML in their prediction pipelines.

In [33]:
list(
    sorted(parameter_sweep.items(), key=lambda item: item[1]["val_acc"], reverse=True)
)

[(3,
  {'num_feat': 300,
   'agg_set': frozenset({'AVG', 'COUNT', 'MAX', 'MIN', 'SUM'}),
   'val_acc': 0.874}),
 (6,
  {'num_feat': 350,
   'agg_set': frozenset({'AVG', 'COUNT', 'MAX', 'MIN', 'SUM'}),
   'val_acc': 0.874}),
 (0,
  {'num_feat': 250,
   'agg_set': frozenset({'AVG', 'COUNT', 'MAX', 'MIN', 'SUM'}),
   'val_acc': 0.872}),
 (4,
  {'num_feat': 300,
   'agg_set': frozenset({'AVG',
              'COUNT',
              'COUNT DISTINCT',
              'COUNT MINUS COUNT DISTINCT',
              'FIRST',
              'LAST',
              'MAX',
              'MEDIAN',
              'MIN',
              'MODE',
              'STDDEV',
              'SUM',
              'TREND'}),
   'val_acc': 0.87}),
 (7,
  {'num_feat': 350,
   'agg_set': frozenset({'AVG',
              'COUNT',
              'COUNT DISTINCT',
              'COUNT MINUS COUNT DISTINCT',
              'FIRST',
              'LAST',
              'MAX',
              'MEDIAN',
              'MIN',
              'M