# Beginner quickstart to BioCypher (a practical example)

[//]: # (------------------------------------------    DO NOT MODIFY THIS    ------------------------------------------)
<style type="text/css">
.tg  {border-collapse:collapse;
      border-spacing:0;
     }
.tg td{border-color:black;
       border-style:solid;
       border-width:1px;
       font-family:Arial, sans-serif;
       font-size:14px;
       overflow:hidden;
       padding:10px 5px;
       word-break:normal;
      }
.tg th{border-color:black;
       border-style:solid;
       border-width:1px;
       font-family:Arial, sans-serif;
       font-size:14px;
       font-weight:normal;
       overflow:hidden;
       padding:10px 5px;
       word-break:normal;
      }
.tg .tg-fymr{border-color:inherit;
             font-weight:bold;
             text-align:left;
             vertical-align:top
            }
.tg .tg-0pky{border-color:inherit;
             text-align:left;
             vertical-align:top
            }
[//]: # (--------------------------------------------------------------------------------------------------------------)

[//]: # (-------------------------------------    FILL THIS OUT WITH YOUR DATA    -------------------------------------)
</style>
<table class="tg">
    <tbody>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Title:</td>
        <td class="tg-0pky">Beginner quickstart to BioCypher (a practical example)</td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Authors:</td>
        <td class="tg-0pky">
            <a href="https://github.com/ecarrenolozano" target="_blank" rel="noopener noreferrer">Edwin Carreño</a>
        </td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Affiliations:</td>
        <td class="tg-0pky">
            <a href="https://www.ssc.uni-heidelberg.de/en" target="_blank" rel="noopener noreferrer">Scientific Software Center</a>,
            <a href="https://saezlab.org/" target="_blank" rel="noopener noreferrer">Saez-Rodriguez Group</a>
        </td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Date Created:</td>
        <td class="tg-0pky">19.03.2025</td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Description:</td>
        <td class="tg-0pky">A hands-on tutorial to explore BioCypher as quick as possible!</td>
      </tr>
    </tbody>
</table>

[//]: # (--------------------------------------------------------------------------------------------------------------)

## Overview

In this tutorial, we’re going to dive into **BioCypher** with a practical example. We’ll start with a simple CSV file and turn it into a Knowledge Graph step by step.

At the end of this tutorial we will cover the following.
    
- Explore your data using Pandas

- Graph modeling (from columns to nodes/edges)

- BioCypher introduction:

    - Setting up the configuration

    - Writing a custom adapter to map your data to the graph model

    - Importing everything into Neo4j to see a graph in action


## Setup

Before starting our tutorial, ensure you have the following to guarantee a smooth experience!

| Pre-requisite | Version   | Check on terminal |
|---------------|-----------|-------------------|
| Git           | >2.0      | `git --version`   |
| Neo4j desktop | 2025.04.0 | `neo4j --version` |
| Poetry        | >=1.8     | `poetry about`    |

### Clone repository

In [1]:
# Remove the biocypher folder if already exist
%rm -rf biocypher
!git clone https://github.com/biocypher/biocypher.git

Cloning into 'biocypher'...
remote: Enumerating objects: 16138, done.[K
remote: Counting objects: 100% (158/158), done.[K
remote: Compressing objects: 100% (54/54), done.[K
remote: Total 16138 (delta 65), reused 138 (delta 59), pack-reused 15980 (from 1)[K
Receiving objects: 100% (16138/16138), 22.45 MiB | 40.90 MiB/s, done.
Resolving deltas: 100% (11863/11863), done.


In [2]:
%cd biocypher

/home/ecarreno/SSC-Projects/e-TUTORIALS/biocypher


In [3]:
!poetry lock --no-update
!poetry install --no-root --quiet

Creating virtualenv [36mbiocypher[39m in /home/ecarreno/SSC-Projects/e-TUTORIALS/biocypher/.venv
[2K[34mResolving dependencies...[39m [39;2m(0.6s)[39;22m

[34mWriting lock file[39m


### Importing Libraries

Recommendations:

- Respect the order of the imports, they are indicated by the numbers *1, 2, 3*.
- One import per line is recommended, with this we can track easily any modified line when we use git.
- Absolute imports are recommended (see *3. Local application/library specific imports* below), they improve readability and give better error messages.
- You should put a blank line between each group of imports.

In [4]:
# 1. Standard library imports
import os

# 2. Related third party imports
import numpy as np
import pandas as pd

# 3. Local application/library specific imports
# import <mypackage>.<MyClass>         # this is an example
# from <mypackage> import <MyClass>    # this is another example

## Section 1. Exploratory Data Analysis with Pandas

The CSV file we are going to use contains data about Protein-Protein interactions. It means in some way they should describe how the a Protein interacts with another Protein. For now, ignore what a Protein is and how they are interacting. Focus on how the information is structured.

You should be able to answer the following:
- How many columns does the dataset have?
- How many rows?
- All the columns are strings? numbers? other?

**ONLY IF** you have experience in this kind of data, you can analyze further. 
- What types of interactions exists in this dataset?
- Do the interactions contain properties?
- A protein can contain properties? Which ones?

In [5]:
# Define the path to the dataset (our CSV file)
path_dataset = "../subset_interactions_edgecases.tsv"

In [6]:
print("This file exist? {}".format(os.path.exists(path_dataset)))

This file exist? True


### Load data into Pandas Dataframe (without predefined data types)

By default the option *keep_default_na* is True, it means that Pandas will interpret empty values or null values as NaN values.

In [7]:
dataframe = pd.read_table(path_dataset, sep="\t", keep_default_na=True)

In [8]:
!pwd

/home/ecarreno/SSC-Projects/e-TUTORIALS/biocypher


In [25]:
dataframe.head(15)

Unnamed: 0,source,target,source_genesymbol,target_genesymbol,is_directed,is_stimulation,is_inhibition,consensus_direction,consensus_stimulation,consensus_inhibition,...,dorothea_coexp,dorothea_level,type,curation_effort,extra_attrs,evidences,ncbi_tax_id_source,entity_type_source,ncbi_tax_id_target,entity_type_target
0,P1,P2,SPI1,SMARCC1,1,0,0,0,0,0,...,False,D,transcriptional,0,"{""DoRothEA_curated"":true,""DoRothEA_chipseq"":fa...","{""id_a"":""P17947"",""id_b"":""Q9Y241"",""positive"":[]...",9606,protein,9606,protein
1,P2,P1,SMARCC1,SPI1,1,0,0,0,0,0,...,False,D,transcriptional,0,"{""DoRothEA_curated"":true,""DoRothEA_chipseq"":fa...","{""id_a"":""P17947"",""id_b"":""Q9Y241"",""positive"":[]...",9606,protein,9606,protein
2,P2,P3,SMARCC1,Gtf2b,1,0,0,0,0,0,...,,,transcriptional,0,{},"{""id_a"":""Q92922"",""id_b"":""Q8NI51"",""positive"":[]...",9606,protein,10116,protein
3,P3,P4,Gtf2b,Egr1,1,0,0,0,0,0,...,False,D,transcriptional,0,"{""DoRothEA_curated"":false,""DoRothEA_chipseq"":t...","{""id_a"":""P62916"",""id_b"":""P14882"",""positive"":[]...",10116,protein,10090,protein
4,P4,P5,Egr1,Rheb,1,0,0,0,0,0,...,False,D,transcriptional,0,"{""DoRothEA_curated"":true,""DoRothEA_chipseq"":fa...","{""id_a"":""P08046"",""id_b"":""P17563"",""positive"":[]...",10090,protein,10090,protein
5,P5,P6,Rheb,Myc,1,0,0,0,0,0,...,,,post_translational,1,{},"{""id_a"":""Q921J2"",""id_b"":""Q6ZWX6"",""positive"":[]...",10090,protein,10090,protein
6,P6,P4,Myc,Egr1,1,0,0,0,0,0,...,False,D,transcriptional,0,"{""DoRothEA_curated"":true,""DoRothEA_chipseq"":fa...","{""id_a"":""P18146"",""id_b"":""P09466"",""positive"":[]...",10090,protein,10090,protein
7,P7,P8,CASP7,Cebpa,1,0,0,0,0,0,...,False,D,transcriptional,0,"{""DoRothEA_curated"":true,""DoRothEA_chipseq"":fa...","{""id_a"":""P18146"",""id_b"":""P09466"",""positive"":[]...",10090,protein,10090,protein
8,P9,P9,ZNF318,ZNF318,1,0,0,0,0,0,...,False,D,transcriptional,0,"{""DoRothEA_curated"":true,""DoRothEA_chipseq"":fa...","{""id_a"":""P18146"",""id_b"":""P09466"",""positive"":[]...",10090,protein,10090,protein
9,P1,P2,SPI1,SMARCC1,1,0,0,0,0,0,...,False,D,transcriptional,0,"{""DoRothEA_curated"":true,""DoRothEA_chipseq"":fa...","{""id_a"":""P17947"",""id_b"":""Q9Y241"",""positive"":[]...",9606,protein,9606,protein


In [10]:
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 36 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   source                 11 non-null     object
 1   target                 11 non-null     object
 2   source_genesymbol      11 non-null     object
 3   target_genesymbol      11 non-null     object
 4   is_directed            11 non-null     int64 
 5   is_stimulation         11 non-null     int64 
 6   is_inhibition          11 non-null     int64 
 7   consensus_direction    11 non-null     int64 
 8   consensus_stimulation  11 non-null     int64 
 9   consensus_inhibition   11 non-null     int64 
 10  sources                11 non-null     object
 11  references             1 non-null      object
 12  omnipath               11 non-null     bool  
 13  kinaseextra            11 non-null     bool  
 14  ligrecextra            11 non-null     bool  
 15  pathwayextra           11

### Dataset Metadata

We are interested in having a table with the following information:


- Column name.
- Data Type.
- A certain colum could contain null values.
- Number of unique  values.

That could be done using the next cell:

### Unique values per column

In [11]:
metadata = pd.DataFrame(
    {
        "Column Name": dataframe.columns,
        "Data Type": dataframe.dtypes.values,
        "Nullable": dataframe.isnull().any().values,
        "Unique Values": [dataframe[col].nunique() for col in dataframe.columns],
    }
)

metadata

Unnamed: 0,Column Name,Data Type,Nullable,Unique Values
0,source,object,False,8
1,target,object,False,8
2,source_genesymbol,object,False,8
3,target_genesymbol,object,False,9
4,is_directed,int64,False,1
5,is_stimulation,int64,False,1
6,is_inhibition,int64,False,1
7,consensus_direction,int64,False,1
8,consensus_stimulation,int64,False,1
9,consensus_inhibition,int64,False,1


To know all the unique values in a certain column, just type the column's name in the variable *field*:

In [12]:
field = "type"

print(
    "List of unique values in field: {}\n\t{}".format(
        field, sorted(dataframe[field].unique())
    )
)

List of unique values in field: type
	['post_translational', 'transcriptional']


## Section 2. Graph Modeling

### What a graph is?

A formal definition of a graph is: *"A graph is a set V of vertices and a collection E of pairs of vertices from V, called edges."*

$$
G = (V,E)
$$

$\lvert V \rvert$ or $\lvert n \rvert$: number of vertices (nodes).
$\lvert E \rvert$ or $\lvert m \rvert$: number of edges (relationships).

For our purposes, a graph is just a collection of nodes interconnected between them by relationships. What a node is? What a relationship is? Instead of giving you formal definitions, let dive to this concept with two examples:

#### Example 1. My friend Bob

```mermaid
graph LR
    Alice((Alice)) -- is a friend of --> Bob((Bob))
    Bob((Bob)) -- is a --> Person((Person))
    Bob((Bob)) -- is interested in --> IC((Integrated Circuit))
    IC((Integrated Circuit)) -- was invented by --> JK((Jack Kilby))
    IC((Integrated Circuit)) -- was invented by --> RN((Robert Noyce))
    JK((Jack Kilby)) -- worked for --> TI((Texas Instruments))
    JK((Jack Kilby)) -- is a --> Person((Person))
    RN((Robert Noyce)) -- worked for --> FS((Fairchild Semiconductors))
    RN((Robert Noyce))-- is a --> Person((Person))
    TI((Texas Instruments)) -- is a --> Company((Company))
    FS((Fairchild Semiconductors)) -- is a --> Company((Company))
    
```

##### List of Nodes

They represent abstract things, concepts, and usually can be identified by **nouns**.

- Alice
- Bob
- Person
- Integrated circuit
- Jack Kilby
- Robert Noyce
- Texas Instruments
- Fairchild Semiconductors
- Company

##### List of Relationships (Edges)

They represent actions, interactions, and usually can be identified by **verbs**.

- *is a friend of*
- *is a*
- *is interested in*
- *was invented by*
- *worked for*

#### Example 2. Biological Organization Underlying Human Disease

```mermaid
graph LR
    Tissue((Tissue)) -- contains --> DNA((DNA))
    DNA((DNA)) -- encodes --> Gene((Gene))
    Gene((Gene)) -- interact_with --> Gene((Gene))
    Variant((Variant)) -- alters --> Gene((Gene))
    Variant((Variant)) -- causes_or_contributes_to --> Disease((Disease))
    Gene((Gene)) -- causes_or_contributes_to --> Disease((Disease))
    DNA((DNA)) -- transcribed_into --> RNA((RNA))
    RNA((RNA)) -- processed_into --> miRNA((miRNA))
    miRNA((RNA)) -- decreases --> mRNA((mRNA))
    RNA((RNA)) -- processed_into --> mRNA((RNA))
    mRNA((mRNA)) -- translated_into --> Protein((Protein))
    Protein((Protein)) -- has_function --> MolFunc((Molecular Function))
    Protein((Protein)) -- participates_in --> BioProc((Biological Process))
    Protein((Protein)) -- located_in --> CellComp((Cellular Component))
    Pathway((Pathway)) -- has_component --> CellComp((Cellular Component))
    Pathway((Pathway)) -- realizes --> MolFunc((Molecular Function))
    Pathway((Pathway)) -- has_part --> BioProc((Biological Process))
    Protein((Protein)) -- participates_in --> Pathway((Pathway))
    Chemical((Chemical)) -- participates_in --> Pathway((Pathway))
    Chemical((Chemical)) -- is_substance_that_treats --> Disease((Disease))
    Pathway((Pathway)) -- causes_or_contributes_to --> Disease((Disease))
```

##### List of Nodes

They represent abstract things, concepts, and usually can be identified by **nouns**.

- Biological Process
- Cell
- Chemical
- Disease
- DNA
- Fluid
- Gene
- miRNA
- Molecular Function
- mRNA
- Pathway
- Protein
- RNA
- Tissue
- Variant

##### List of Relationships (Edges)

They represent actions, interactions, and usually can be identified by **verbs**.

- *alters*
- *causes_or_contributes_to*
- *contains*
- *decreases*
- *encodes*
- *has_component*
- *has_function*
- *has_part*
- *interacts_with*
- *is_substance_that_treats*
- *located_in*
- *participates_in*
- *processed_into*
- *realizes*
- *transcribed_into*
- *translated_into*

**Reference:** inspired on [An open source knowledge graph ecosystem for the life sciences (2024)](https://www.nature.com/articles/s41597-024-03171-w)

#### Key points

1. A knowledge graph is a collection of nodes and edges that define interactions between them.
2. Nodes and Edges contains properties i.e., A person, has a name, an ID, a date of birth, etc.


### Modelling

Do not worry if the Proteins and interactions are not in your scope of study. Just ignore that, and treat those as nodes and edges.

#### First iteration:
Our input is the data contained in the dataset, this data contains information about Protein-Protein interactions. As we saw, nodes and edges can contain properties associated. So, the main challenge here is to create a diagram similar to the examples we have. Do not worry, we already have two keys to achieve this:

1. The *"Protein-protein interactions"* phrase.
2. Our dataset has two columns called "source" and "target".

It means we have a Protein node and another Protein node. At the same time, a relatioship between them exist. That is all you need to know right now.

```mermaid
graph LR
    Source((Protein)) -- interacts_with --> Target((Protein))
```
#### Second iteration
Can we improve this graph?- the answer is yes, take a look again at the data. Exist a field called "type". This column contains two possible values:  

- *post_translational*
- *transcriptional*

We can change the edge for a more descriptive one, such as *is_interaction_type*.

```mermaid
graph LR
    A((Protein))
    B((Protein))
    A -- is_interaction_type --> B
```

#### Third iteration
In this point, Nodes and Edges do not contain properties. Can we add properties to the Nodes and Edges? Yes, we can. Analyze the dataset again, for this tutorial I have choosen the following properties to assign to each node.

**Nodes**
- Protein node (source):
    - source_genesymbol
    - source_ncbi_tax_id
    - source_entity_type

- Protein node (target):
    - target_genesymbol
    - target_ncbi_tax_id
    - target_entity_type

**Edges**
- *is_interaction_type*:
    - is_directed
    - is_stimulation
    - is_inhibition
    - sources


<div>
<img src="./protein-protein-interaction-model2.png" width="300"/>
</div>


#### Manual checking

By looking at the data (source and target columns) try to draw nodes and edges in a piece of paper. At the end your data should represent a graph like this one:

<div>
<img src="./protein-protein-interaction-data.png" width="400"/>
</div>

A graph like this one is what we want, by using BioCypher we can create it and display it in Neo4j.


## Section 3. BioCypher Adapter

In order to build a graph by using BioCypher we need to do the following:

1. Read the data (our CSV file) and generate Nodes and Edges (Adapter concept)
2. Define the schema for the graph (which types of nodes/edges and their properties should we expect in our graph)
3. Configure BioCypher to write an script that will serve to import the BioCypher graph into a Neo4j graph.

That's all, we will explain each step in this tutorial