Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 29 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
</p>

<p align="center" style="text-align: center;">
<a href="https://getml.com/contact" target="_blank">
<a href="https://getml.com/latest/contact" target="_blank">
<img src="https://img.shields.io/badge/schedule-a_meeting-blueviolet.svg" /></a>
<a href="mailto:hello@getml.com" target="_blank">
<img src="https://img.shields.io/badge/contact-us_by_mail-orange.svg" /></a>
Expand All @@ -20,24 +20,24 @@

# Introduction

This repository contains different [Jupyter Notebooks](https://jupyter.org/) to demonstrate the capabilities of [getML](https://www.getml.com/) in the realm of machine learning on relational data-sets in various domains. getML and its feature engineering algorithms ([FastProp](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#fastprop), [Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel), [Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost), [RelMT](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relmt)), its [predictors](https://docs.getml.com/latest/user_guide/predicting/predicting.html#using-getml) (LinearRegression, LogisticRegression, XGBoostClassifier, XGBoostRegressor) and its [hyperparameter optimizer](https://docs.getml.com/latest/user_guide/hyperopt/hyperopt.html#hyperparameter-optimization) (RandomSearch, LatinHypercubeSearch, GaussianHyperparameterSearch), are benchmarked against competing tools in similar categories, like [featuretools](https://www.featuretools.com/), [tsfresh](https://tsfresh.com/), [prophet](https://facebook.github.io/prophet/). While [FastProp](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#fastprop) usually outperforms the competition in terms of runtime and resource requirements, the more sophisticated algorithms ([Multirel](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#multirel), [Relboost](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relboost), [RelMT](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relmt)), which are part of the [professional and enterprise feature-sets](https://www.getml.com/pricing), can lead to higher accuracy with lower resource requirements still then the competition. The demonstrations are done on publicly available data-sets, which are standardly used for such comparisons.
This repository contains different [Jupyter Notebooks](https://jupyter.org) to demonstrate the capabilities of [getML](https://www.getml.com) in the realm of machine learning on relational data-sets in various domains. getML and its feature engineering algorithms ([FastProp](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-fastprop), [Multirel](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-multirel), [Relboost](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-relboost), [RelMT](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-relmt)), its [predictors](https://getml.com/latest/user_guide/concepts/predicting#using-getml) (LinearRegression, LogisticRegression, XGBoostClassifier, XGBoostRegressor) and its [hyperparameter optimizer](https://getml.com/latest/user_guide/concepts/hyperopt#hyperparameter-optimization) (RandomSearch, LatinHypercubeSearch, GaussianHyperparameterSearch), are benchmarked against competing tools in similar categories, like [featuretools](https://www.featuretools.com/), [tsfresh](https://tsfresh.com/), and [prophet](https://facebook.github.io/prophet/). While [FastProp](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-fastprop) usually outperforms the competition in terms of runtime and resource requirements, the more sophisticated algorithms ([Multirel](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-multirel), [Relboost](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-relboost), [RelMT](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-relmt)), which are part of the [Enterprise edition](https://getml.com/latest/enterprise), often lead to even higher accuracy while maintaining low resource requirements. The demonstrations are done on publicly available data-sets, which are standardly used for such comparisons.

# Table of Contents

* [Introduction](#introduction)
* [Table of Contents](#table-of-contents)
* [Usage](#usage)
* [Reading Online](#reading-online)
* [Experimenting Locally](#experimenting-locally)
* [Using Docker](#using-docker)
* [On the Machine (Linux/x64 & arm64)](#on-the-machine-linuxx64--arm64)
* [Notebooks](#notebooks)
* [Overview](#overview)
* [Descriptions](#descriptions)
* [Quick access by grouping by](#quick-access-by-grouping-by)
* [Benchmarks](#benchmarks)
* [FastProp-Benchmarks](#fastprop-benchmarks)
* [Further Benchmarks in the Relational Dataset Repository](#further-benchmarks-in-the-relational-dataset-repository)
- [Introduction](#introduction)
- [Table of Contents](#table-of-contents)
- [Usage](#usage)
- [Reading Online](#reading-online)
- [Experimenting Locally](#experimenting-locally)
- [Using Docker](#using-docker)
- [On the Machine (Linux/x64 \& arm64)](#on-the-machine-linuxx64--arm64)
- [Notebooks](#notebooks)
- [Overview](#overview)
- [Descriptions](#descriptions)
- [Quick access by grouping by](#quick-access-by-grouping-by)
- [Benchmarks](#benchmarks)
- [FastProp Benchmarks](#fastprop-benchmarks)
- [Further Benchmarks in the Relational Dataset Repository](#further-benchmarks-in-the-relational-dataset-repository)

# Usage

Expand All @@ -55,40 +55,40 @@ To experiment with the notebooks, such as playing with different pipelines and p

There are a `docker-compose.yml` and a `Dockerfile` for easy usage provided.

Simply clone this repository and command to start the `notebooks` service. The image, it depends on, will be build if it is not already available.
Simply clone this repository and run the docker command to start the `notebooks` service. The image it depends on will be build if it is not already available.

```
$ git clone https://github.com/getml/getml-demo.git
$ docker compose up notebooks
```

To open Jupyter Lab in the browser, look for the following lines in the output and copy-paste it in your browser:
To open Jupyter Lab in the browser, look for the following lines in the output and copy-paste them in your browser:

```
Or copy and paste one of these URLs:

http://localhost:8888/lab?token=<generated_token>
```

After the first `getml.engine.launch(...)` is executed and the engine is started, its monitor can be opened in the browser under
After the first `getml.engine.launch(...)` is executed and the Engine is started, the corresponding Monitor can be opened in the browser under

```
http://localhost:1709/#/token/token
```

> [!NOTE]
> Using alternatives to [Docker Desktop](https://www.docker.com/products/docker-desktop/) like
> * [Podman](https://podman.io/),
> * [Podman Desktop](https://podman-desktop.io/) or
> * [Rancher Desktop](https://rancherdesktop.io/) with a container engine like dockerd(moby) or containerd(nerdctl)
> Using alternatives to [Docker Desktop](https://www.docker.com/products/docker-desktop) like
> * [Podman](https://podman.io),
> * [Podman Desktop](https://podman-desktop.io) or
> * [Rancher Desktop](https://rancherdesktop.io) with a container engine like dockerd(moby) or containerd(nerdctl)
>
> allows bind-mounting the notebooks in a user-writeable way (this might need to include `userns_mode: keep-id`) instead of having to `COPY` them in. In combination with volume-binding `/home/getml/.getML/logs` and `/home/getml/.getML/projects`, runs and changes can be persisted across containers.
> allows bind-mounting the notebooks in a user-writeable way (this might need to be included: `userns_mode: keep-id`) instead of having to `COPY` them in. In combination with volume-binding `/home/user/.getML/logs` and `/home/user/.getML/projects`, runs and changes can be persisted across containers.

### On the Machine (Linux/x64 & arm64)

Alternatively, getML and the notebooks can be run natively on the local Linux machine by having certain software installed, like Python and some Python libraries, Jupyter-Lab and the getML engine. The [getML Python library](https://github.com/getml/getml-community/) provides an engine version without [enterprise features](https://www.getml.com/pricing). But as those features are shown in the demonstration notebooks, the [trail of the enterprise version](https://www.getml.com/download) can be used for those cases.
Alternatively, getML and the notebooks can be run natively on the local Linux machine by having certain software installed, like Python and some Python libraries, Jupyter-Lab and the getML Engine. The [getML Python library](https://github.com/getml/getml-community) provides an Engine version without [Enterprise features](https://getml.com/latest/enterprise). In order to replicate Enterprise functionalities in the notebooks, you may obtain an [Enterprise trial version](https://getml.com/latest/enterprise/request-trial).

The following commands will set up a Python environment with necessary Python libraries and the trail of the getML enterprise version, and Jupyter-Lab
The following commands will set up a Python environment with necessary Python libraries and the getML Enterprise trial version, and Jupyter-Lab

```
$ git clone https://github.com/getml/getml-demo.git
Expand All @@ -101,7 +101,7 @@ $ jupyter-lab
```

> [!TIP]
> Install the [trail of the enterprise version](https://www.getml.com/download) via the [Install getML on Linux guide](https://docs.getml.com/latest/home/installation/linux.html#install-getml-on-linux) to try the enterprise features.
> Install the [Enterprise trial version](https://getml.com/latest/enterprise/request-trial) via the [Install getML on Linux guide](https://getml.com/latest/install/packages/linux#install-getml-on-linux) to try the Enterprise features.

With the last command, Jupyter-Lab should automatically open in the browser. If not, look for the following lines in the output and copy-paste it in your browser:

Expand All @@ -111,7 +111,7 @@ Or copy and paste one of these URLs:
http://localhost:8888/lab?token=<generated_token>
```

After the first `getml.engine.launch(...)` is executed and the engine is started, its monitor can be opened in the browser under
After the first `getml.engine.launch(...)` is executed and the Engine is started, the corresponding Monitor can be opened in the browser under

```
http://localhost:1709/#/token/token
Expand Down Expand Up @@ -446,7 +446,7 @@ relational data scheme involving many tables.

An algorithm, that generates specific different features can only use columns for conditions, it is not allowed to aggregate columns – and it doesn't need to do so. That means, the computational complexity is linear instead of quadratic. For data sets with a large number of columns, this can make all the difference in the world. For instance, if you have 100 columns the size of the search space of the second approach is only 1% of the size of the search space of the first one.

getML features an algorithm called relboost, which generates features according to this principle and is therefore very suitable for data sets with many columns.
getML features an algorithm called Relboost, which generates features according to this principle and is therefore very suitable for data sets with many columns.

To illustrate the problem, we use a data set related to robotics. When robots interact with humans, the most important thing is, that they don't hurt people. In order to prevent such accidents, the force vector on the robot's arm is measured. However, measuring the force vector is expensive. Therefore, we want consider an alternative approach, where we would like to predict the force vector based on other sensor data that are less costly to measure. To do so, we use machine learning. However, the data set contains measurements from almost 100 different sensors and we do not know which and how many sensors are relevant for predicting the force vector.

Expand Down
4 changes: 2 additions & 2 deletions air_pollution.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -195,7 +195,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we spilt our data. We introduce a [simple, time-based split](https://docs.getml.com/latest/api/split/getml.data.split.time.html) and use all data until 2013-12-31 for training and everything starting from 2014-01-01 for testing."
"First, we spilt our data. We introduce a [simple, time-based split](https://getml.com/latest/reference/data/split/#getml.data.split.time.time) and use all data until 2013-12-31 for training and everything starting from 2014-01-01 for testing."
]
},
{
Expand Down Expand Up @@ -4672,7 +4672,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a typical [RelMT](https://docs.getml.com/latest/user_guide/feature_engineering/feature_engineering.html#relmt) feature, where the aggregation (`SUM` in this case) is applied conditionally – the conditions are learned by `RelMT` – to a set of linear models, whose weights are, again, learned by `RelMT`."
"This is a typical [RelMT](https://getml.com/latest/user_guide/concepts/feature_engineering/#feature-engineering-algorithms-relmt) feature, where the aggregation (`SUM` in this case) is applied conditionally – the conditions are learned by `RelMT` – to a set of linear models, whose weights are, again, learned by `RelMT`."
]
},
{
Expand Down
6 changes: 3 additions & 3 deletions atherosclerosis.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -207,8 +207,8 @@
"\n",
"The `getml.datasets.load_atherosclerosis` method took care of the entire data lifting:\n",
"* Downloads csv's from our servers in python\n",
"* Converts csv's to getML [DataFrames](https://docs.getml.com/latest/api/getml.data.DataFrame.html#getml.data.DataFrame)\n",
"* Sets [roles](https://docs.getml.com/latest/user_guide/annotating_data/annotating_data.html#roles) to columns inside getML DataFrames"
"* Converts csv's to getML [DataFrames](https://getml.com/latest/reference/data/data_frame#getml.data.DataFrame)\n",
"* Sets [roles](https://getml.com/latest/user_guide/concepts/annotating_data#roles) to columns inside getML DataFrames"
]
},
{
Expand Down Expand Up @@ -18729,7 +18729,7 @@
"source": [
"#### 1.3 Define relational model\n",
"\n",
"To start with relational learning, we need to specify an abstract data model. Here, we use the [high-level star schema API](https://docs.getml.com/latest/api/getml.data.StarSchema.html) that allows us to define the abstract data model and construct a [container](https://docs.getml.com/latest/api/getml.data.Container.html) with the concrete data at one-go. While a simple `StarSchema` indeed works in many cases, it is not sufficient for more complex data models like schoflake schemas, where you would have to define the data model and construct the container in separate steps, by utilzing getML's [full-fledged data model](https://docs.getml.com/latest/api/getml.data.DataModel.html) and [container](https://docs.getml.com/latest/api/getml.data.Container.html) APIs respectively."
"To start with relational learning, we need to specify an abstract data model. Here, we use the [high-level star schema API](https://getml.com/latest/reference/data/star_schema) that allows us to define the abstract data model and construct a [container](https://getml.com/latest/reference/data/container) with the concrete data at one-go. While a simple `StarSchema` indeed works in many cases, it is not sufficient for more complex data models like schoflake schemas, where you would have to define the data model and construct the container in separate steps, by utilzing getML's [full-fledged data model](https://getml.com/latest/reference/data/data_model) and [container](https://getml.com/latest/reference/data/container) APIs respectively."
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions dodgers.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -977,7 +977,7 @@
"source": [
"#### 1.3 Define relational model\n",
"\n",
"To start with relational learning, we need to specify the data model. We manually replicate the appropriate time series structure by setting time series related join conditions (`horizon`, `memory` and `allow_lagged_targets`). This is done abstractly using [Placeholders](https://docs.getml.com/latest/user_guide/data_model/data_model.html#placeholders)\n",
"To start with relational learning, we need to specify the data model. We manually replicate the appropriate time series structure by setting time series related join conditions (`horizon`, `memory` and `allow_lagged_targets`). This is done abstractly using [Placeholders](https://getml.com/latest/user_guide/concepts/data_model#placeholders)\n",
"\n",
"The data model consists of two tables:\n",
"* __Population table__ `traffic_{test/train}`: holds target and the contemporarily available time-based components\n",
Expand Down Expand Up @@ -6484,7 +6484,7 @@
"\n",
"We have compared getML's feature learning algorithms to Prophet and tsfresh on a data set related to traffic on LA's 101 North freeway. We found that getML significantly outperforms both Prophet and tsfresh. These results are consistent with the view that relational learning is a powerful tool for time series analysis.\n",
"\n",
"You are encouraged to reproduce these results. You will need [getML](https://getml.com/product) to do so. You can download it for free."
"You are encouraged to reproduce these results. You will need [getML](https://getml.com) to do so. You can download it for free."
]
}
],
Expand Down
Loading