# Tutorial 

**From data to Knowledge Graph**

In [1]:
# This cell is not import for the understanding of the tutorial
# and only prints the version of the biokb_ipni package and the time of execution.
import os
import biokb_ipni
from datetime import datetime

# delete existing environment variables
if "CONNECTION_STR" in os.environ:
    del os.environ["CONNECTION_STR"]

print(
    "biokb_ipni version:",
    biokb_ipni.__version__,
    "Date:",
    datetime.now().strftime("%Y-%m-%d"),
)

biokb_ipni version: 0.1.8 Date: 2026-02-03


**Tip**: This notebook is available at GitHub [here](https://github.com/biokb/biokb_ipni/blob/main/docs/notebooks/tutorial.ipynb). It's recommended to set up a virtual environment (see description below), install the biokb_ipni package and run the [Jupyter](https://jupyter.org/) notebook in the virtual environment cell by cell.

## Abstract

The [biokb_ipni](https://pypi.org/project/biokb-ipni/) library is part of the [biokb family](https://pypi.org/search/?q=biokb), which aims to integrate ontologies, terminologies, and knowledge from multiple domains—such as biochemistry, pharmacology, taxonomy, and ethnobotany—into a unified Knowledge Graph. Each library - like `biobk_ipni` on plant names - is focusing on a different aspect, but all are using the same workflow to import the data into the database and knowledge graph. Since the biokb family uses [SQLAlchemy](https://www.sqlalchemy.org/) as its database layer, many different relational database systems ([SQLite](https://sqlite.org/), [MySLQ](https://www.mysql.com/), [MariaDB](https://mariadb.org/), [PostgreSQL](https://www.postgresql.org/), ...) can be used. As backend for the knowledge graph `biokb_ipni` is using Neo4j, but any triple store SPARQL server like [Fuseki](https://jena.apache.org/documentation/fuseki2/) can load the Resource Description Framework (RDF) file which can be created with the library. An essential criterion for the high connectivity of the various knowledge graphs (like IPNI) is the library-wide use of the same [Uniform Resource Identifier](https://en.wikipedia.org/wiki/Uniform_Resource_Identifier)s. Each major version of the biokb family libraries will ensure that the same URIs will be used across all libraries.

This tutorial describes the use of biokb_ipni to generate a Knowledge Graph from primary data provided by the [International Plant Names Index (IPNI)](https://www.ipni.org/). It also outlines the individual steps required to create either a relational database or RDF Turtle files.

## Overview

The workflow is simple:

[**data**]-*import_data()*->[**relational_database**]-*create_ttls()*->[**rdf_files**]-*import_ttls()*->[**knowledge_graph**] 

Main functions are:
1. `import_data()`: **Import data** in database
2. `create_ttls()`: **Create RDF files** from database
3. `import_ttls()`: **Import RDF files into a Knowledge Graph**

You have choices how to use the library depending:
1. Command line interface (CLI)
2. Python API
3. Podman/Docker containers

If you want to use all features of the library it is recommended to use the Podman/Docker containers, since all dependencies are already installed and configured. If you only want to use parts of the library (like only the API or only the RDF generator) you can also install the library in a [virtual environment](https://docs.python.org/3/tutorial/venv.html) and use it via CLI or Python API.

## Installation

In general it is recommended to install python libraries in a [virtual environment](https://docs.python.org/3/tutorial/venv.html) to avoid conflicts between libraries. To create a virtual environment you first need to create a new directory, navigate to it and activate it. If your operating system is Linux or MacOS run:

```bash
mkdir biokb_ipni_test
cd biokb_ipni_test
source .venv/bin/activate
```
If you are using Windows please check [this tutorial](https://docs.python.org/3/tutorial/venv.html#windows-virtual-environments).

Once the virtual environment is activated you can install the biokb_ipni library with pip:

```bash
pip install biokb_ipni
```

**Tip**: [uv](https://docs.astral.sh/uv/) is a great and very fast alternative tool to create and manage virtual environments.

## Use Cases

Depending on your needs, follow the links:

1. [Relational database only](#relational-database-only)
2. [RDF turtle files only](#rdf-turtle-file-only)
3. [Knowledge Graph](#knowledge-graph-in-neo4j)

In the different chapters you will find step-by-step instructions how to achieve your goal (CLI, RESTful API, Python API, or Podman/Docker containers).



### Relational database only


**Tip**: SQLite database can be opened with [DB Browser for SQLite](https://sqlitebrowser.org/), which provides a user-friendly interface to explore the database structure and content. Or with the [SQLite Viewer extension](https://marketplace.visualstudio.com/items?itemName=alexcvzz.vscode-sqlite) for [VS Code](https://code.visualstudio.com/).

#### Python

by default creates (or updates) a SQLite database `biokb.db` in subfolder of your home directory `~/.biokb/`

In [None]:
from biokb_ipni import import_data

import_data()

If you want to change the logging level you can do it as follows:

```python
import logging
from biokb_ipni import import_data
logging.getLogger('biokb_ipni').setLevel(logging.WARNING)
import_data()
```

Output:

```
2026-01-28 14:18:54,188 - biokb_ipni.db.manager - INFO - Using database engine: Engine(sqlite:////home/ceb/.biokb/biokb.db)
2026-01-28 14:18:54,231 - biokb_ipni.db.manager - INFO - Database recreated.
2026-01-28 14:18:54,231 - biokb_ipni.db.manager - INFO - Loading NCBI Taxonomy data for mapping families and names
2026-01-28 14:19:06,912 - biokb_ipni.db.manager - INFO - Importing references
2026-01-28 14:19:17,348 - biokb_ipni.db.manager - INFO - Importing families
2026-01-28 14:19:19,166 - biokb_ipni.db.manager - INFO - Importing names
2026-01-28 14:20:09,802 - biokb_ipni.db.manager - INFO - Importing type materials
2026-01-28 14:20:15,326 - biokb_ipni.db.manager - INFO - Importing name relations
```

In [4]:
print(import_data.__doc__)

Import all data in database.

Args:
    engine (Optional[Engine]): SQLAlchemy engine. Defaults to None.
    force_download (bool, optional): If True, will force download the data, even if
        files already exist. If False, it will skip the downloading part if files
        already exist locally. Defaults to False.
    delete_files (bool, optional): If True, downloaded files are deleted after import.
        Defaults to False.

Returns:
    Dict[str, int]: table=key and number of inserted=value



#### Command line interface

```bash
biokb_ipni -v import-data
```

`-v` provides you with information about the processes currently running. If you do not wish to see this information, you can simply omit `-v`.

##### Options with the CLI

```bash
  -f, --force-download          Force re-download of the source file [default: False]
  -d, --delete-files            Delete downloaded source files after import [default: False]
  -c, --connection-string TEXT  SQLAlchemy engine URL [default:sqlite:////~/.biokb/biokb.db]
  --help                        Show this message and exit.
```

Here an example to create a database with a custom name, force re-download of the source file, and delete downloaded source files after import:

```bash
biokb_ipni -v import-data -f -d -c sqlite:///my_own_name.db
```

#### How to write a connection string

In the last (and also in the next) section we used a connection string to define the type and name of the database. Here are some examples how to write connection strings for different database management systems:

- SQLite: `sqlite:///ipni.db` (creates a file named `ipni.db` in the current directory)
- MySQL: `mysql+pymysql://username:password@localhost/ipni_db`
- PostgreSQL: `postgresql+psycopg2://username:password@localhost/ipni_db`

If you are using a different host or port, please adjust the connection string accordingly. For more details on connection strings, please refer to the [SQLAlchemy documentation](https://docs.sqlalchemy.org/en/20/core/engines.html#database-urls). SQLite and MySQL are supported out of the box. For other databases you need to install the specific libraries (like `psycopg2` package) in your environment.

#### Python with other database management systems

If we want to use a different database like the default (SQLite -> ~/.biokb/biokb.db) in python we have to create an engine with SQLAlchemy first and then pass it to the `import_data` function. Here an example with MySQL (assuming you have already created a database named `ipni_db` in your MySQL server, *host:localhost, port:3306, database: ipni, user: username, password: password*):

```python
from sqlalchemy import create_engine
from biokb_ipni import import_data

engine = create_engine("mysql+pymysql://username:password@localhost:3306/ipni")
import_data(engine)
```

## RESTful API only

If you only want to use the RESTful API to access the data you can start it as follows:


```bash
biokb_ipni run-server
```

If you get an error message like:

```
biokb_ipni run-server
API server running at http://127.0.0.1:8000/docs#/
ERROR:    [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use
```

Use another port with the `--port` option:

```bash
biokb_ipni run-server --port 8080
```

Depending on the port (here 8080) open http://127.0.0.1:8080/docs#/ in your browser to access the API documentation.

By default username `admin` and password `admin` are used to access the protected methods of the API (with a lock symbol). You can change it with the options `--user` and `--password`:

```bash
biokb_ipni run-server --user my_user --password my_password
```

If data already import via `biokb_ipni import-data` the API will use the existing database located at `~/.biokb/biokb.db`. Otherwise open the `/import_data/` endpoint to import the data (use user and password). ttls files will be created (if not exists) and exported via the `/export_data/` endpoint. If you want to import the data into a Neo4j Knowledge Graph use `/import_neo4j/`.

