# Exploratory Data Analysis: Intercell Dataframe (from Omnipath database)

[//]: # (------------------------------------------    DO NOT MODIFY THIS    ------------------------------------------)
<style type="text/css">
.tg  {border-collapse:collapse;
      border-spacing:0;
     }
.tg td{border-color:black;
       border-style:solid;
       border-width:1px;
       font-family:Arial, sans-serif;
       font-size:14px;
       overflow:hidden;
       padding:10px 5px;
       word-break:normal;
      }
.tg th{border-color:black;
       border-style:solid;
       border-width:1px;
       font-family:Arial, sans-serif;
       font-size:14px;
       font-weight:normal;
       overflow:hidden;
       padding:10px 5px;
       word-break:normal;
      }
.tg .tg-fymr{border-color:inherit;
             font-weight:bold;
             text-align:left;
             vertical-align:top
            }
.tg .tg-0pky{border-color:inherit;
             text-align:left;
             vertical-align:top
            }
[//]: # (--------------------------------------------------------------------------------------------------------------)

[//]: # (-------------------------------------    FILL THIS OUT WITH YOUR DATA    -------------------------------------)
</style>
<table class="tg">
    <tbody>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Title:</td>
        <td class="tg-0pky">Exploratory Data Analysis: Intercell Dataframe (from Omnipath database)</td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Authors:</td>
        <td class="tg-0pky">
            <a href="https://github.com/ecarrenolozano" target="_blank" rel="noopener noreferrer">Edwin Carreño</a>
        </td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Affiliations:</td>
        <td class="tg-0pky">
            <a href="https://www.ssc.uni-heidelberg.de/en" target="_blank" rel="noopener noreferrer">Scientific Software Center</a>,
            <a href="https://saezlab.org/" target="_blank" rel="noopener noreferrer">Saez-Rodriguez Group</a>
        </td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Date Created:</td>
        <td class="tg-0pky">22.04.2025</td>
      </tr>
      <tr>
        <td class="tg-fymr" style="font-weight: bold">Description:</td>
        <td class="tg-0pky">Extraction of metadata for building database tables </td>
      </tr>
    </tbody>
</table>

[//]: # (--------------------------------------------------------------------------------------------------------------)

## Overview

This notebook should help to understand the information contained in the "Complexes" dataset from the Omnipath database.

## Setup (if required)

If your code require to install dependencies before your main code, please add the commands to install the dependencies.

### Pandas installation

In [2]:
%pip install pandas -q

Note: you may need to restart the kernel to use updated packages.


## Importing Libraries

Recommendations:

- Respect the order of the imports, they are indicated by the numbers *1, 2, 3*.
- One import per line is recommended, with this we can track easily any modified line when we use git.
- Absolute imports are recommended (see *3. Local application/library specific imports* below), they improve readability and give better error messages.
- You should put a blank line between each group of imports.

In [1]:
# 1. Standard library imports
import os

# 2. Related third party imports
import numpy as np
import pandas as pd

# 3. Local application/library specific imports
# import <mypackage>.<MyClass>         # this is an example
# from <mypackage> import <MyClass>    # this is another example 

## Introduction

TO DO


## Section 1. Load "Networks" dataset

### Section 1.1. Setting dataset path

In [6]:
dataset_path_complexes = os.path.join("../data/omnipath_complexes/omnipath_webservice_complexes__latest.tsv.gz")

In [7]:
print("This file exist? {}".format(os.path.exists(dataset_path_complex)))

This file exist? True


### Section 1.2. Load dataset as Pandas DataFrame

#### Configuring Pandas view

In [8]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)  

#### Load data into Pandas Dataframe (without predefined data types)

By default the option *keep_default_na* is True, it means that Pandas will interpret empty values or null values as NaN values.

In [None]:
complexes_df = pd.read_table(dataset_path_complexes, sep="\t", keep_default_na=True)

In [None]:
complexes_df.head(20)

Unnamed: 0,name,components,components_genesymbols,stoichiometry,sources,references,identifiers
0,NFY,P23511_P25208_Q13952,NFYA_NFYB_NFYC,1:1:1,SPIKE;Compleat;CORUM;hu.MAP2;hu.MAP;ComplexPortal;PDB;SIGNOR,9372932;14755292;15243141,SIGNOR:SIGNOR-C1;CORUM:4478;Compleat:HC1449;intact:EBI-6672597;PDB:6qms;PDB:4awl;PDB:6qmq;PDB:6qmp
1,mTORC2,P42345_P68104_P85299_Q6R327_Q8TB45_Q9BVC4,DEPTOR_EEF1A1_MLST8_MTOR_PRR5_RICTOR,0:0:0:0:0:0,SIGNOR,,SIGNOR:SIGNOR-C2
2,mTORC1,P42345_Q8N122_Q8TB45_Q96B36_Q9BVC4,AKT1S1_DEPTOR_MLST8_MTOR_RPTOR,0:0:0:0:0,SIGNOR,,SIGNOR:SIGNOR-C3
3,SCF-betaTRCP,P63208_Q13616_Q9Y297,BTRC_CUL1_SKP1,1:1:1,SPIKE;Compleat;CORUM;SIGNOR,9990852,SIGNOR:SIGNOR-C5;CORUM:227;Compleat:HC757
4,CBP/p300,Q09472_Q92793,CREBBP_EP300,0:0,SIGNOR,,SIGNOR:SIGNOR-C6
5,P300/PCAF,Q09472_Q92793_Q92831,CREBBP_EP300_KAT2B,0:0:0,SIGNOR,,SIGNOR:SIGNOR-C7
6,SMAD2/SMAD4,Q13485_Q15796,SMAD2_SMAD4,1:2,ComplexPortal;PDB;SIGNOR,12923550;4065410;8061611;16322555;14755292,SIGNOR:SIGNOR-C8;intact:EBI-9692430;reactome:R-HSA-206827;PDB:1u7v
7,SMAD3/SMAD4,P84022_Q13485,SMAD3_SMAD4,2:1,ComplexPortal;PDB;SIGNOR,12923550;4065410;8061611;16322555;14755292,SIGNOR:SIGNOR-C9;intact:EBI-9826300;PDB:1u7f;PDB:1U7F
8,SMAD4/JUN,P05412_Q13485,JUN_SMAD4,0:0,SIGNOR,,SIGNOR:SIGNOR-C10
9,SMAD2/SMURF2,Q15796_Q9HAU4,SMAD2_SMURF2,1:1,Compleat;SIGNOR,11389444,SIGNOR:SIGNOR-C11;Compleat:HC501


In [15]:
complexes_df[complexes_df["components"]=="P04271_Q15109"]

Unnamed: 0,name,components,components_genesymbols,stoichiometry,sources,references,identifiers
21318,,P04271_Q15109,AGER_S100B,4:1,PDB,,PDB:5d7f;PDB:4xyn


In [11]:
complexes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35460 entries, 0 to 35459
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   name                    10665 non-null  object
 1   components              35459 non-null  object
 2   components_genesymbols  35459 non-null  object
 3   stoichiometry           35459 non-null  object
 4   sources                 35460 non-null  object
 5   references              5557 non-null   object
 6   identifiers             24140 non-null  object
dtypes: object(7)
memory usage: 1.9+ MB


## Section 2. Metadata

We are interested in having a table with the following information:


- Column name.
- Data Type.
- A certain colum could contain null values.
- Number of unique  values.

That could be done using the next cell:

### Section 2.1. Overview

In [13]:
metadata = pd.DataFrame({
    'Column Name': complexes_df.columns,
    'Data Type': complexes_df.dtypes.values,
    'Nullable': complexes_df.isnull().any().values,
    'Unique Values': [complexes_df[col].nunique() for col in complexes_df.columns]
})

metadata

Unnamed: 0,Column Name,Data Type,Nullable,Unique Values
0,name,object,True,6061
1,components,object,True,35459
2,components_genesymbols,object,True,35282
3,stoichiometry,object,True,594
4,sources,object,False,300
5,references,object,True,3358
6,identifiers,object,True,19800


### Section 2.2. Unique values per column

To know all the unique values in a certain column, just type the column's name in the variable *field*:

In [None]:
field = "parent"

print("List of unique values in field: {}\n\t{}".format(field, complexes_df[field].unique()))

List of unique values in field: parent
	['transmembrane' 'transmembrane_predicted' 'peripheral' 'plasma_membrane'
 'plasma_membrane_transmembrane' 'plasma_membrane_regulator'
 'plasma_membrane_peripheral' 'secreted' 'cell_surface' 'ecm' 'ligand'
 'receptor' 'secreted_enzyme' 'secreted_peptidase' 'extracellular'
 'intracellular' 'receptor_regulator' 'secreted_receptor'
 'sparc_ecm_regulator' 'ecm_regulator' 'ligand_regulator'
 'cell_surface_ligand' 'cell_adhesion' 'matrix_adhesion' 'adhesion'
 'matrix_adhesion_regulator' 'cell_surface_enzyme'
 'cell_surface_peptidase' 'secreted_enyzme' 'extracellular_peptidase'
 'secreted_peptidase_inhibitor' 'transporter' 'ion_channel'
 'ion_channel_regulator' 'gap_junction' 'tight_junction'
 'adherens_junction' 'desmosome' 'intracellular_intercellular_related']


## Section 3. Free Exploratory Analysis

In this section you can explore the data as you want, it means you can filter, select columns, counting values, etc. Feel free to explore as much you want.

### Counting Null values

In [16]:
complexes_df.references[complexes_df['references'].isnull()]

1        NaN
2        NaN
4        NaN
5        NaN
8        NaN
        ... 
35455    NaN
35456    NaN
35457    NaN
35458    NaN
35459    NaN
Name: references, Length: 29903, dtype: object

### Counting Null 

In [12]:
num_nulls_in_references = intercell_df['references'].isnull().sum()
num_nulls_in_references

np.int64(32120)

### Counting True values

In [13]:
num_True_in_references = (intercell_df.references==True).sum()
num_True_in_references

np.int64(0)

### Counting "True" values

In [14]:
num_true_in_references = (intercell_df.references=="True").sum()
num_true_in_references

np.int64(0)

### Counting 1 values

In [15]:
num_one_in_references = (intercell_df.references=="1").sum()
num_one_in_references

np.int64(0)

### Counting False values

In [16]:
num_False_in_references= (intercell_df.references==False).sum()
num_False_in_references

np.int64(0)

### Counting "False" values

In [17]:
num_false_in_references = (intercell_df.references=="False").sum()
num_false_in_references

np.int64(0)

### Filtering

In [18]:
filtered = intercell_df[(intercell_df["substrate"]=="Q9D0N7")]
filtered

Unnamed: 0,enzyme,enzyme_genesymbol,substrate,substrate_genesymbol,isoforms,residue_type,residue_offset,modification,sources,references,curation_effort,ncbi_tax_id
65530,O88697,Stk16,Q9D0N7,Chaf1b,1,S,422,phosphorylation,PhosphoNetworks,,0,10090
65531,O88697,Stk16,Q9D0N7,Chaf1b,1,S,458,phosphorylation,PhosphoNetworks,,0,10090


In [19]:
filtered = intercell_df[(intercell_df["enzyme"]=="P02340")]
filtered

Unnamed: 0,enzyme,enzyme_genesymbol,substrate,substrate_genesymbol,isoforms,residue_type,residue_offset,modification,sources,references,curation_effort,ncbi_tax_id
69040,P02340,Tp53,Q04207,Rela,2,S,538,phosphorylation,ProtMapper;Sparser_ProtMapper,ProtMapper:18477470,1,10090
69128,P02340,Tp53,P39689,Cdkn1a,1,S,148,phosphorylation,ProtMapper;Sparser_ProtMapper,ProtMapper:12897801,1,10090
69715,P02340,Tp53,P02340,Tp53,1,S,37,phosphorylation,ProtMapper;Sparser_ProtMapper,ProtMapper:20354524,1,10090
69767,P02340,Tp53,Q9Z265,Chek2,1,T,68,phosphorylation,ProtMapper;REACH_ProtMapper,ProtMapper:24553354,1,10090
69997,P02340,Tp53,Q9DBR7,Ppp1r12a,1;2,T,698,phosphorylation,ProtMapper;REACH_ProtMapper,ProtMapper:24343302,1,10090
70021,P02340,Tp53,P38532,Hsf1,1;2,S,326,phosphorylation,ProtMapper;Sparser_ProtMapper,ProtMapper:24763051,1,10090
70187,P02340,Tp53,P27661,H2ax,1,S,140,phosphorylation,ProtMapper;REACH_ProtMapper,ProtMapper:24413150,1,10090
70303,P02340,Tp53,Q62193,Rpa2,1,S,33,phosphorylation,ProtMapper;REACH_ProtMapper,ProtMapper:29874588,1,10090
70677,P02340,Tp53,O35280,Chek1,1,S,317,phosphorylation,ProtMapper;REACH_ProtMapper,ProtMapper:29874588,1,10090
70678,P02340,Tp53,O35280,Chek1,1,S,345,phosphorylation,ProtMapper;REACH_ProtMapper,ProtMapper:20840867,1,10090


In [20]:
filtered = intercell_df[(intercell_df["substrate"]=="Q9D0N7")]
filtered

Unnamed: 0,enzyme,enzyme_genesymbol,substrate,substrate_genesymbol,isoforms,residue_type,residue_offset,modification,sources,references,curation_effort,ncbi_tax_id
65530,O88697,Stk16,Q9D0N7,Chaf1b,1,S,422,phosphorylation,PhosphoNetworks,,0,10090
65531,O88697,Stk16,Q9D0N7,Chaf1b,1,S,458,phosphorylation,PhosphoNetworks,,0,10090


In [27]:
count_enzymes = intercell_df["enzyme"].count()
count_enzymes

np.int64(93028)

In [26]:
count_substrate = intercell_df["substrate"].count()
count_substrate

np.int64(93028)

In [28]:
93028*2

186056

In [25]:
intercell_df.shape

(93028, 12)

## Section 4. Load the dataset with predefined data types.