# Preprocesador de archivo de interacciones
## Idea
La idea es que a partir del archivo de interacciones pueda obtener una lista con los nodos para cada tipo de objeto considerado en la red, en este caso corresponderian a los ligandos y blancos. El objetivo es tener un _script_ que genere los archivos de nodos, además de dar información de la información contenida en este archivo.
## Input
Archivo de interacciones
## Output
Archivo de nodos para los ligandos  
Archivo de nodos para los blancos  
Estadisticas básicas para el archivo  

In [1]:
# Import dependencies to notebook
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Modify to use other interaction files
path = '/Users/cvigilv/Dropbox/2018/Data/CC&D mk. 2/chembl23_GS3_v2.mphase_gt_0.txt.co'

In [3]:
# Load interactions file to dataframe
interaction_matrix = pd.read_csv(path, delimiter = '\t', index_col = False)

print('First 5 entries of interactions file:\n')
print(interaction_matrix.head())

First 5 entries of interactions file:

          Target         Ligand T_Accession  \
0  CHEMBL1075104  CHEMBL1287853      Q5S007   
1  CHEMBL1075104  CHEMBL1289926      Q5S007   
2  CHEMBL1075104  CHEMBL1721885      Q5S007   
3  CHEMBL1075104  CHEMBL1789941      Q5S007   
4  CHEMBL1075104  CHEMBL1908397      Q5S007   

                                              T_Name  \
0  Leucine-rich repeat serine/threonine-protein k...   
1  Leucine-rich repeat serine/threonine-protein k...   
2  Leucine-rich repeat serine/threonine-protein k...   
3  Leucine-rich repeat serine/threonine-protein k...   
4  Leucine-rich repeat serine/threonine-protein k...   

                                    T_Pfam  min(ACT)  avg(ACT)  max(ACT)  \
0  PF00069,PF08477,PF12799,PF13855,PF16095     870.0     935.0    1000.0   
1  PF00069,PF08477,PF12799,PF13855,PF16095     920.0     955.0     990.0   
2  PF00069,PF08477,PF12799,PF13855,PF16095      70.0      71.0      72.0   
3  PF00069,PF08477,PF12799,PF13855,PF

In [4]:
# Get ligands information
ligands_info = interaction_matrix[['Ligand','L_SMILES','L_MaxPhase']].copy()
ligands_info.drop_duplicates(subset = 'Ligand', keep = 'first', inplace = True)
ligands_info = ligands_info.reset_index(drop = True)
ligands_info.to_csv(path_or_buf = path+'.ul', sep = '\t',index=False)

print('First 5 entries of unique ligands from interactions file:\n')
print(ligands_info.head())

First 5 entries of unique ligands from interactions file:

          Ligand                                           L_SMILES  \
0  CHEMBL1287853  Cc1cnc(Nc2ccc(OCCN3CCCC3)cc2)nc1Nc4cccc(c4)S(=...   
1  CHEMBL1289926  CNC(=O)c1ccccc1Sc2ccc3c(\\C=C\\c4ccccn4)n[nH]c3c2   
2  CHEMBL1721885  Cc1[nH]c(\\C=C\\2/C(=O)Nc3ccc(F)cc23)c(C)c1C(=...   
3  CHEMBL1789941        N#CC[C@H](C1CCCC1)n2cc(cn2)c3ncnc4[nH]ccc34   
4  CHEMBL1908397     O=C(N1CCNCC1)c2ccc(\\C=C\\c3n[nH]c4ccccc34)cc2   

   L_MaxPhase  
0           3  
1           4  
2           2  
3           4  
4           1  


In [5]:
# Get target information
targets_info = interaction_matrix[['Target','T_Accession','T_Name', 'T_Pfam', 'T_Sequence']].copy()
targets_info.drop_duplicates(subset = 'Target', keep = 'first', inplace = True)
targets_info = targets_info.reset_index(drop = True)
targets_info.to_csv(path_or_buf = path+'.ut', sep = '\t',index=False)

print('First 5 entries of unique targets from interactions file:\n')
print(targets_info.head())

First 5 entries of unique targets from interactions file:

          Target T_Accession  \
0  CHEMBL1075104      Q5S007   
1  CHEMBL1075132      Q12931   
2  CHEMBL1075133      Q8WTQ7   
3  CHEMBL1075144      Q9GZQ4   
4  CHEMBL1075155      Q15208   

                                              T_Name  \
0  Leucine-rich repeat serine/threonine-protein k...   
1           Heat shock protein 75 kDa, mitochondrial   
2                                   Rhodopsin kinase   
3                            Neuromedin-U receptor 2   
4                 Serine/threonine-protein kinase 38   

                                    T_Pfam  \
0  PF00069,PF08477,PF12799,PF13855,PF16095   
1                          PF00183,PF02518   
2                          PF00069,PF00615   
3                                  PF00001   
4                          PF00069,PF00433   

                                          T_Sequence  
0  MASGSCQGCEEDEETLKKLIVRLNNVQEGKQIETLVQILEDLLVFT...  
1  MARELRALLLWGRRLRPLLRA

In [6]:
# Get aditional information of interactions file
print('Unique instances on interactions files:\n')
print(interaction_matrix.nunique())
print('\n'+'-'*32+'\n')
print('Top 15 most common ligands in interactions file:\n')
print(interaction_matrix['Ligand'].value_counts().head(n = 15))
print('\n'+'-'*32+'\n')
print('Top 15 most common targets in interactions file:\n')
print(interaction_matrix['Target'].value_counts().head(n = 15))
print('\n'+'-'*32+'\n')
print('Top 15 most common Pfam in targets:\n')
print(targets_info['T_Pfam'].value_counts().head(n = 15))

Unique instances on interactions files:

Target          897
Ligand         1232
T_Accession     897
T_Name          896
T_Pfam          343
min(ACT)       2260
avg(ACT)       3604
max(ACT)       2414
L_SMILES       1232
Species           1
L_MaxPhase        4
T_Sequence      897
dtype: int64

--------------------------------

Top 15 most common ligands in interactions file:

CHEMBL603469     336
CHEMBL1908397    269
CHEMBL1287853    268
CHEMBL475251     250
CHEMBL535        234
CHEMBL1721885    216
CHEMBL288441     192
CHEMBL1230609    189
CHEMBL574738     182
CHEMBL608533     181
CHEMBL572878     163
CHEMBL522892     156
CHEMBL601719     141
CHEMBL1789941    114
CHEMBL428690     112
Name: Ligand, dtype: int64

--------------------------------

Top 15 most common targets in interactions file:

CHEMBL1833    102
CHEMBL1867     96
CHEMBL222      96
CHEMBL224      96
CHEMBL225      93
CHEMBL234      92
CHEMBL1942     89
CHEMBL240      89
CHEMBL228      87
CHEMBL1916     78
CHEMBL216     