# Exploratory Data Analysis: Networks Dataframe (from Omnipath database)

[//]: # (------------------------------------------    DO NOT MODIFY THIS    ------------------------------------------)
<style type="text/css">
.tg  {border-collapse:collapse;
      border-spacing:0;
     }
.tg td{border-color:black;
       border-style:solid;
       border-width:1px;
       font-family:Arial, sans-serif;
       font-size:14px;
       overflow:hidden;
       padding:10px 5px;
       word-break:normal;
      }
.tg th{border-color:black;
       border-style:solid;
       border-width:1px;
       font-family:Arial, sans-serif;
       font-size:14px;
       font-weight:normal;
       overflow:hidden;
       padding:10px 5px;
       word-break:normal;
      }
.tg .tg-fymr{border-color:inherit;
             font-weight:bold;
             text-align:left;
             vertical-align:top
            }
.tg .tg-0pky{border-color:inherit;
             text-align:left;
             vertical-align:top
            }
[//]: # (--------------------------------------------------------------------------------------------------------------)

[//]: # (-------------------------------------    FILL THIS OUT WITH YOUR DATA    -------------------------------------)
</style>
<table class="tg">
    <tbody>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Title:</td>
        <td class="tg-0pky">Exploratory Data Analysis: Networks Dataframe (from Omnipath database)</td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Authors:</td>
        <td class="tg-0pky">
            <a href="https://github.com/ecarrenolozano" target="_blank" rel="noopener noreferrer">Edwin Carreño</a>
        </td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Affiliations:</td>
        <td class="tg-0pky">
            <a href="https://www.ssc.uni-heidelberg.de/en" target="_blank" rel="noopener noreferrer">Scientific Software Center</a>,
            <a href="https://saezlab.org/" target="_blank" rel="noopener noreferrer">Saez-Rodriguez Group</a>
        </td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Date Created:</td>
        <td class="tg-0pky">19.03.2025</td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Description:</td>
        <td class="tg-0pky">Extraction of metadata for building database tables </td>
      </tr>
    </tbody>
</table>

[//]: # (--------------------------------------------------------------------------------------------------------------)

## Overview

This notebook should help to understand the information contained in the "Networks" dataset from the Omnipath database.

## Setup (if required)

If your code require to install dependencies before your main code, please add the commands to install the dependencies.

### Pandas installation

In [1]:
%pip install pandas -q

Note: you may need to restart the kernel to use updated packages.


## Importing Libraries

Recommendations:

- Respect the order of the imports, they are indicated by the numbers *1, 2, 3*.
- One import per line is recommended, with this we can track easily any modified line when we use git.
- Absolute imports are recommended (see *3. Local application/library specific imports* below), they improve readability and give better error messages.
- You should put a blank line between each group of imports.

In [2]:
# 1. Standard library imports
import os

# 2. Related third party imports
import numpy as np
import pandas as pd

# 3. Local application/library specific imports
# import <mypackage>.<MyClass>         # this is an example
# from <mypackage> import <MyClass>    # this is another example

## Introduction

TO DO


## Section 1. Load "Networks" dataset

### Section 1.1. Setting dataset path

In [5]:
dataset_path_networks = os.path.join(
    "../data/omnipath_networks/omnipath_webservice_interactions__latest.tsv.gz"
)


In [6]:
print("This file exist? {}".format(os.path.exists(dataset_path_networks)))

This file exist? True


### Section 1.2. Load dataset as Pandas DataFrame

#### Configuring Pandas view

In [7]:
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_columns", None)

#### Load data into Pandas Dataframe (without predefined data types)

By default the option *keep_default_na* is True, it means that Pandas will interpret empty values or null values as NaN values.

In [8]:
networks_df = pd.read_table(dataset_path_networks, sep="\t", keep_default_na=True)

In [10]:
networks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 36 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   source                 11 non-null     object
 1   target                 11 non-null     object
 2   source_genesymbol      11 non-null     object
 3   target_genesymbol      11 non-null     object
 4   is_directed            11 non-null     int64 
 5   is_stimulation         11 non-null     int64 
 6   is_inhibition          11 non-null     int64 
 7   consensus_direction    11 non-null     int64 
 8   consensus_stimulation  11 non-null     int64 
 9   consensus_inhibition   11 non-null     int64 
 10  sources                11 non-null     object
 11  references             1 non-null      object
 12  omnipath               11 non-null     bool  
 13  kinaseextra            11 non-null     bool  
 14  ligrecextra            11 non-null     bool  
 15  pathwayextra           11

## Section 2. Metadata

We are interested in having a table with the following information:


- Column name.
- Data Type.
- A certain colum could contain null values.
- Number of unique  values.

That could be done using the next cell:

### Section 2.1. Overview

### Section 2.2. Unique values per column

In [11]:
metadata = pd.DataFrame(
    {
        "Column Name": networks_df.columns,
        "Data Type": networks_df.dtypes.values,
        "Nullable": networks_df.isnull().any().values,
        "Unique Values": [networks_df[col].nunique() for col in networks_df.columns],
    }
)

metadata

Unnamed: 0,Column Name,Data Type,Nullable,Unique Values
0,source,object,False,8
1,target,object,False,8
2,source_genesymbol,object,False,8
3,target_genesymbol,object,False,9
4,is_directed,int64,False,1
5,is_stimulation,int64,False,1
6,is_inhibition,int64,False,1
7,consensus_direction,int64,False,1
8,consensus_stimulation,int64,False,1
9,consensus_inhibition,int64,False,1


To know all the unique values in a certain column, just type the column's name in the variable *field*:

In [12]:
field = "type"

print(
    "List of unique values in field: {}\n\t{}".format(
        field, sorted(networks_df[field].unique())
    )
)

List of unique values in field: type
	['post_translational', 'transcriptional']


## Section 3. Free Exploratory Analysis

In this section you can explore the data as you want, it means you can filter, select columns, counting values, etc. Feel free to explore as much you want.

### Counting Null values

In [13]:
networks_df.references[networks_df["references"].isnull()]

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
Name: references, dtype: object

### Counting Null 

In [14]:
num_nulls_in_dorothea_curated = networks_df["references"].isnull().sum()
num_nulls_in_dorothea_curated

np.int64(10)

### Counting True values

In [15]:
num_True_in_dorothea_curated = (networks_df.dorothea_curated == True).sum()
num_True_in_dorothea_curated

np.int64(8)

### Counting "True" values

In [16]:
num_true_in_dorothea_curated = (networks_df.dorothea_curated == "True").sum()
num_true_in_dorothea_curated

np.int64(0)

### Counting 1 values

In [17]:
num_one_in_dorothea_curated = (networks_df.dorothea_curated == "1").sum()
num_one_in_dorothea_curated

np.int64(0)

### Counting False values

In [18]:
num_False_in_dorothea_curated = (networks_df.dorothea_curated == False).sum()
num_False_in_dorothea_curated

np.int64(1)

### Counting "False" values

In [None]:
num_false_in_dorothea_curated = (networks_df.dorothea_curated == "False").sum()
num_false_in_dorothea_curated

### Filtering

In [None]:
filtered = networks_df[
    (networks_df["source"] == "Q16254") & (networks_df["target"] == "O43683")
]
# filtered[["source", "target", "is_stimulation", "omnipath"]]

filtered

In [None]:
omnipath_df = networks_df[(networks_df["omnipath"]) == True]
omnipath_df.info()

In [None]:
omnipath_df[(omnipath_df["source"]) == (omnipath_df["target"])]

## Section 4. Load the dataset with predefined data types.

In [None]:
# Data types for interactions
dtype = {
    "source": "string",
    "target": "string",
    "source_genesymbol": "string",
    "target_genesymbol": "string",
    "is_directed": "boolean",
    "is_stimulation": "boolean",
    "is_inhibition": "boolean",
    "consensus_direction": "boolean",
    "consensus_stimulation": "boolean",
    "consensus_inhibition": "boolean",
    "sources": "string",
    "references": "string",
    "omnipath": "boolean",
    "kinaseextra": "boolean",
    "ligrecextra": "boolean",
    "pathwayextra": "boolean",
    "mirnatarget": "boolean",
    "dorothea": "boolean",
    "collectri": "boolean",
    "tf_target": "boolean",
    "lncrna_mrna": "boolean",
    "tf_mirna": "boolean",
    "small_molecule": "boolean",
    "dorothea_curated": "boolean",
    "dorothea_chipseq": "boolean",
    "dorothea_tfbs": "boolean",
    "dorothea_coexp": "boolean",
    "dorothea_level": "string",
    "type": "string",
    "curation_effort": "Int64",
    "extra_attrs": "string",
    "evidences": "string",
    "ncbi_tax_id_source": "Int64",
    "entity_type_source": "string",
    "ncbi_tax_id_target": "Int64",
    "entity_type_target": "string",
}

In [37]:
networks_df = pd.read_table(dataset_path_networks, dtype=dtype)

In [38]:
networks_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1217900 entries, 0 to 1217899
Data columns (total 36 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   source                 1217900 non-null  string 
 1   target                 1217900 non-null  string 
 2   source_genesymbol      1217900 non-null  string 
 3   target_genesymbol      1217900 non-null  string 
 4   is_directed            1217900 non-null  boolean
 5   is_stimulation         1217900 non-null  boolean
 6   is_inhibition          1217900 non-null  boolean
 7   consensus_direction    1217900 non-null  boolean
 8   consensus_stimulation  1217900 non-null  boolean
 9   consensus_inhibition   1217900 non-null  boolean
 10  sources                1217900 non-null  string 
 11  references             413020 non-null   string 
 12  omnipath               1217900 non-null  boolean
 13  kinaseextra            1217900 non-null  boolean
 14  ligrecextra       

Note that by specifying the datatypes the size of the dataset in memory has been reduced **21.83%**.