This notebook processes data from 33 cancer types from the **TCGA** database with interaction network data generated by **RING 2.0** (https://ring.biocomputingup.it/).

Here, the **Adenoid cystic carcinoma** (**ACC**) is processed. To process the other 32 cancers, just change the input files (section 42.1) as the processing is the same.

#0 - Basic settings

In [None]:
#Permission to access any file from Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#Increasing the display capacity of columns and rows
import pandas as pd

pd.set_option('display.max_columns', 7000)
pd.set_option('display.max_rows',70000)

In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Tue Jul 13 12:46:49 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    30W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('To enable a high-RAM runtime, select the Runtime > "Change runtime type"')
  print('menu, and then select High-RAM in the Runtime shape dropdown. Then, ')
  print('re-execute this cell.')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


#42 - Generating attributes from the RING nodes file

The nodes files of all PDBs that were submitted to **RING 2.0** were integrated into a single file and processed through the **TrataRINs** Notebook which is located in the **drive/My Drive/ProcessaNovaBase/TrataArqsRING** folder. The database that has this processing is **nodesDB_proc**.

The attributes from the RING nodes file are:

- PDB
- NodeId
- Chain
-Position
- Residue
- Dssp
- Degree
- Bfactor_CA

In [None]:
#Increasing the display capacity of columns and rows
import pandas as pd

pd.set_option('display.max_columns', 7000)
pd.set_option('display.max_rows',90000)
pd.set_option('display.width', 7000)

In [None]:
#Reading the ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDB_PDBWild_Blosum62_Group_Change_Essential_substitution.csv  database

import pandas as pd
base_ACC = pd.read_csv("drive/My Drive/ProcessaNovaBase/MontagemdeArqscomRINGeBetwennessClust/Bases15Tecidos/ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDBcomDuplicidade_PDBWild_Blosum62_Group_Change_Essential_change_substitut.csv", delimiter='\t')

In [None]:
base_ACC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2846 entries, 0 to 2845
Data columns (total 86 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         2846 non-null   int64  
 1   POS                           2846 non-null   int64  
 2   ID                            2846 non-null   object 
 3   REF                           2846 non-null   object 
 4   ALT                           2846 non-null   object 
 5   avsnp150                      2846 non-null   object 
 6   Interpro_domain               2846 non-null   object 
 7   dbNSFP_DEOGEN2_pred           2846 non-null   object 
 8   dbNSFP_MetaSVM_pred           2846 non-null   object 
 9   dbNSFP_fathmmMKL_coding_pred  2846 non-null   object 
 10  dbNSFP_PrimateAI_pred         2846 non-null   object 
 11  dbNSFP_PROVEAN_pred           2846 non-null   object 
 12  dbNSFP_MCAP_pred              2846 non-null   object 
 13  dbN

In [None]:
base_ACC.head(15)

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution
0,1,2303896,.,C,T,rs752779978,.,D,D,D,T,D,D,D,D,8e-06,P,D,D,T,D,D,D,8e-06,.,D,D,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro423Leu,c.1268C>T,SKI,NM_003036.3,4,T,2,1268,>,Pro,Leu,423,subst,.,1,2303896,rs752779978,0.0,1,16/20,16,1,1,1,NM_003036.3,P12755,SKI,5XOD,1.85,SKI_HUMAN,15.0,40.0,5XOD,15.0,40.0,P12755,B,2.0,27.0,UNP,15,PGLQKTLEQFHLSSMSSLGGPAAFSA,26.0,5XOD,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0
1,1,3816294,.,G,A,.,.,T,T,N,T,N,T,T,T,0.0,B,T,T,T,N,N,T,0.0,.,N,T,T,B,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro883Leu,c.2648C>T,CEP104,NM_014704.3,21,A,2,2648,>,Pro,Leu,883,subst,.,1,3816294,rs1197412379,0.0,0,0/20,0,0,0,0,NM_014704.3,O60308,"CEP104,KIAA0562",5LPI,1.8,CE104_HUMAN,746.0,875.0,5LPI,746.0,875.0,O60308,D,5.0,134.0,UNP,746,DEHYLDNLCIFCGERSESFTEEGLDLHYWKHCLMLTRCDHCKQVVE...,131.0,5LPI,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0
2,1,3816294,.,G,A,.,.,T,T,N,T,N,T,T,T,0.0,B,T,T,T,N,N,T,0.0,.,N,T,T,B,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro883Leu,c.2648C>T,CEP104,NM_014704.3,21,A,2,2648,>,Pro,Leu,883,subst,.,1,3816294,rs1197412379,0.0,0,0/20,0,0,0,0,NM_014704.3,O60308,"CEP104,KIAA0562",5LPH,2.25,CE104_HUMAN,392.0,676.0,5LPH,392.0,676.0,O60308,A,4.0,288.0,UNP,392,GEAVVEPEMSNADISDARRGGMLGEPEPLTEKALREASSAIDVLGE...,288.0,5LPH,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0
3,1,21844156,.,G,A,rs764778166,Immunoglobulin_I-set|Immunoglobulin_V-set_doma...,T,T,N,T,N,D,D,T,4.1e-05,P,D,T,D,N,N,T,4e-05,.,N,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgg/Tgg,p.Arg2871Trp,c.8611C>T,HSPG2,NM_001291860.1,65,A,1,8611,>,Arg,Trp,2871,subst,.,1,21844156,rs764778166,0.0,1,6/20,6,1,1,0,NM_001291860.1,P98160,HSPG2,3SH4,1.5,PGBM_HUMAN,4197.0,4391.0,3SH4,1.0,195.0,P98160,A,1.0,195.0,UNP,4197,DAPGQYGAYFHDDGFLAFPGHVFSRSLPEVPETIELEVRTSTASGL...,197.0,3SH4,-3,positivecharge,aromatic,positivechargeTOaromatic,0,1,0TO1,0
4,1,29097885,.,G,A,rs750558736,"Band_4.1,_C-terminal\x3bFERM_adjacent_(FA)|PH_...",T,D,D,T,N,T,T,T,1.6e-05,D,T,D,D,N,N,T,1.2e-05,.,D,D,D,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtc/Atc,p.Val755Ile,c.2263G>A,EPB41,NM_001166005.1,17,A,1,2263,>,Val,Ile,755,subst,.,1,29097885,rs750558736,0.0,1,8/20,8,1,1,0,NM_001166005.1,P11171,"EPB41,E41P",3QIJ,1.8,EPB41_HUMAN,211.0,488.0,3QIJ,211.0,488.0,P11171,A,19.0,296.0,UNP,211,HCKVSLLDDTVYECVVEKHAKGQDLLKRVCEHLNLLEEDYFGLAIW...,281.0,3QIJ,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0
5,1,40292580,.,G,A,rs748601004,Peptidase_M48,T,T,D,T,N,T,D,T,2.5e-05,B,T,T,T,D,D,T,1.2e-05,.,D,T,D,P,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtt/Att,p.Val447Ile,c.1339G>A,ZMPSTE24,NM_005857.4,10,A,1,1339,>,Val,Ile,447,subst,.,1,40292580,rs748601004,0.0,1,7/20,7,1,1,0,NM_005857.4,O75844,"ZMPSTE24,FACE1,STE24",5SYT,2.0,FACE1_HUMAN,1.0,474.0,5SYT,1.0,474.0,O75844,A,1.0,474.0,UNP,1,MGMWASLDALWEMPAEKRIFGAVLLFSWTVYLWETFLAQRQRRIYK...,479.0,5SYT,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0
6,1,40406739,.,C,T,rs759545887,.,T,D,D,D,D,D,D,D,8e-06,D,D,T,D,D,D,D,4e-06,.,D,D,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro36Leu,c.107C>T,SMAP2,NM_022733.2,2,T,2,107,>,Pro,Leu,36,subst,.,1,40406739,rs759545887,0.0,1,17/20,17,1,1,1,NM_022733.2,Q8WU79,"SMAP2,SMAP1L",2IQJ,1.9,SMAP2_HUMAN,1.0,132.0,2IQJ,1.0,132.0,Q8WU79,A,3.0,134.0,UNP,1,-,0.0,2IQJ,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0
7,1,47280813,.,C,T,rs777645890,.,T,T,D,T,N,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,D,T,T,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gaa/Aaa,p.Glu549Lys,c.1645G>A,STIL,NM_001048166.1,12,T,1,1645,>,Glu,Lys,549,subst,.,1,47280813,rs777645890,0.0,0,2/20,2,0,0,0,NM_001048166.1,Q15468,"STIL,SIL",5LHW,0.91,STIL_HUMAN,726.0,750.0,5LHW,726.0,750.0,Q15468,A,4.0,28.0,UNP,726,LTEQDRQLRLLQAQIQRLLEAQSLM,25.0,5LHW,1,negativecharge,positivecharge,negativechargeTOpositivecharge,0,1,0TO1,0
8,1,47280813,.,C,T,rs777645890,.,T,T,D,T,N,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,D,T,T,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gaa/Aaa,p.Glu549Lys,c.1645G>A,STIL,NM_001048166.1,12,T,1,1645,>,Glu,Lys,549,subst,.,1,47280813,rs777645890,0.0,0,2/20,2,0,0,0,NM_001048166.1,Q15468,"STIL,SIL",5LHZ,2.51,STIL_HUMAN,726.0,750.0,5LHZ,726.0,750.0,Q15468,D,4.0,28.0,UNP,726,LTEQDRQLRLLQAQIQRLLEAQSLM,25.0,5LHZ,1,negativecharge,positivecharge,negativechargeTOpositivecharge,0,1,0TO1,0
9,1,56949695,.,G,A,rs150146785,Membrane_attack_complex_component/perforin_(MA...,T,T,N,T,N,T,T,T,0.000264,P,D,T,T,N,N,T,0.000223,.,N,T,T,D,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg242Cys,c.724C>T,C8B,NM_000066.3,6,A,1,724,>,Arg,Cys,242,subst,.,1,56949695,rs150146785,1.0,1,2/20,2,0,0,0,NM_000066.3,P07358,C8B,3OJY,2.51,CO8B_HUMAN,55.0,591.0,3OJY,1.0,537.0,P07358,B,1.0,537.0,UNP,55,SVDVTLMPIDCELSSWSSWTTCDPCQKKRYRYAYLLQPSQFHGEPC...,543.0,3OJY,-3,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0


##42.1 Reading the RING nodes database


The **nodesDB_proc.csv** file was generated in the **TrataRINs** notebook which is located in the **TrataArqsRING** folder of this drive.

In [None]:
import pandas as pd
df_nodes_RING = pd.read_csv("drive/My Drive/ProcessaNovaBase/TrataArqsRING/nodesDB_proc.csv",sep='\t', keep_default_na=False)

In [None]:
df_nodes_RING.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12883771 entries, 0 to 12883770
Data columns (total 8 columns):
 #   Column           Dtype  
---  ------           -----  
 0   PDB_id_RING      object 
 1   NodeId_RING      object 
 2   Chain_RING       object 
 3   Position_RING    int64  
 4   Residue_RING     object 
 5   Dssp_RING        object 
 6   Degree_RING      int64  
 7   Bfactor_CA_RING  float64
dtypes: float64(1), int64(2), object(5)
memory usage: 786.4+ MB


In [None]:
#Checking for 'missing' values.
df_nodes_RING.isna().sum()

PDB_id_RING        0
NodeId_RING        0
Chain_RING         0
Position_RING      0
Residue_RING       0
Dssp_RING          0
Degree_RING        0
Bfactor_CA_RING    0
dtype: int64

In [None]:
df_nodes_RING.head()

Unnamed: 0,PDB_id_RING,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING
0,10GS,A:2:_:PRO,A,2,Pro,,1,31.96
1,10GS,A:3:_:TYR,A,3,Tyr,E,12,18.42
2,10GS,A:4:_:THR,A,4,Thr,E,3,19.34
3,10GS,A:5:_:VAL,A,5,Val,E,4,17.96
4,10GS,A:6:_:VAL,A,6,Val,E,5,19.18


In [None]:
df_nodes_RING.query('PDB_id_RING == "3ISU"')

Unnamed: 0,PDB_id_RING,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING
3568974,3ISU,A:1533:_:PRO,A,1533,Pro,,2,53.73
3568975,3ISU,A:1534:_:SER,A,1534,Ser,,1,48.83
3568976,3ISU,A:1535:_:LEU,A,1535,Leu,E,13,40.15
3568977,3ISU,A:1536:_:HIS,A,1536,His,E,4,32.82
3568978,3ISU,A:1537:_:TYR,A,1537,Tyr,E,14,24.32
3568979,3ISU,A:1538:_:THR,A,1538,Thr,E,4,22.35
3568980,3ISU,A:1539:_:ALA,A,1539,Ala,H,13,20.3
3568981,3ISU,A:1540:_:ALA,A,1540,Ala,H,6,20.7
3568982,3ISU,A:1541:_:GLN,A,1541,Gln,H,9,20.97
3568983,3ISU,A:1542:_:LEU,A,1542,Leu,H,15,21.15


##42.2 Joining the ACC table (through the fields PDB_wild_id, aminBefore, pdbx_strand_id and poschangeProt) with RING nodes table (through the fields PDB_id_RING, NodeId_RING, Chain_RING e Position_RING), to filter, only, mutations that have PDB_wild annotated in RING

In [None]:
#Attributes that will be the key in the join with the ACC database
def categories_column(df):
    for col in ['PDB_id_RING',	'Residue_RING', 'Chain_RING',	'Position_RING']:
        mydic= df[col].value_counts().to_dict()
        print(col, mydic)
        print('\n')

categories_column(df_nodes_RING)

PDB_id_RING {'5LE5': 6046, '1QO5': 6046, '5L5U': 6039, '5LF0': 6018, '5L5F': 6018, '5LF7': 6018, '5LF3': 6016, '5LF4': 6013, '5L5H': 6005, '3LK4': 6005, '5LF6': 6004, '5L5S': 6002, '5L5A': 6002, '5LEY': 6000, '5LF1': 5999, '5L5O': 5985, '6HTR': 5975, '5LEZ': 5927, '5LEX': 5914, '4R3O': 5799, '5DOU': 5521, '2Q3E': 5401, '4DVQ': 5317, '5K9Q': 5163, '4XGZ': 4786, '2F5Z': 4751, '1ZY8': 4558, '3B2U': 4478, '4DL1': 4387, '4AY1': 4175, '4ZUL': 3964, '2A3W': 3944, '4ZUK': 3943, '2J6L': 3932, '3N80': 3911, '3SZ9': 3893, '1YDE': 3887, '1O02': 3885, '5L13': 3879, '1O01': 3877, '3INJ': 3873, '5W08': 3871, '1CW3': 3860, '1O00': 3859, '1NZZ': 3856, '4KWG': 3849, '1NZX': 3845, '1O05': 3841, '2VLE': 3836, '5L2O': 3832, '4ZVW': 3826, '1N4S': 3819, '4KWF': 3814, '3PVN': 3813, '1N4Q': 3803, '6VR6': 3793, '3PNW': 3765, '4CQM': 3759, '6Z86': 3754, '5Z2C': 3690, '6I34': 3690, '6I35': 3689, '1ZMD': 3664, '1ZMC': 3663, '5LHD': 3629, '6QAK': 3614, '6X5T': 3602, '4BL5': 3598, '2QG4': 3594, '5NHG': 3589, '3SOM':

In [None]:
#Attributes that will be the key in the join with the RING
def categories_column(df):
    for col in ['PDB_id',	'aminBefore', 'pdbx_strand_id',	'poschangeProt']:
        mydic= df[col].value_counts().to_dict()
        print(col, mydic)
        print('\n')

categories_column(base_ACC)

PDB_id {'6BY8': 32, '6O5I': 32, '3Q05': 32, '2FOJ': 9, '2FOO': 9, '2F1X': 9, '3LW1': 8, '1AIE': 8, '3U84': 8, '6RWU': 8, '2XWR': 8, '1KZY': 8, '1H26': 8, '6RL6': 8, '5MCU': 8, '5MF7': 8, '6S39': 8, '6RWI': 8, '3DAB': 8, '6SIQ': 8, '6RKI': 8, '6RX2': 8, '5MHC': 8, '3KZ8': 8, '6RM7': 8, '4XR8': 8, '4HJE': 8, '1TSR': 8, '6SIO': 8, '3OQ5': 8, '6RWS': 8, '4X34': 8, '6VR5': 8, '5UN8': 8, '1XQH': 8, '1TUP': 8, '3KMD': 8, '6VR1': 8, '5BUA': 8, '4IBU': 8, '2ADY': 8, '3IGL': 8, '4BUZ': 8, '6RWH': 8, '6RKK': 8, '6SLV': 8, '6RJZ': 8, '6RL4': 8, '5MCW': 8, '4IBV': 8, '6RM5': 8, '5DDD': 8, '3TG5': 8, '2OCJ': 8, '2PCX': 8, '6S9Q': 8, '5MG7': 8, '6SIP': 8, '6R5L': 8, '2AHI': 8, '6S3C': 8, '6RK8': 8, '4RP7': 8, '6VRN': 8, '6SIN': 8, '5MCT': 8, '6RL3': 8, '4RP6': 8, '3IGK': 8, '6S40': 8, '6RKM': 8, '2YBG': 8, '3D0A': 8, '6FJ5': 8, '4QO1': 8, '5OL0': 8, '2AC0': 8, '4IBW': 8, '1C26': 8, '5MCV': 8, '6V4F': 8, '2ATA': 8, '6V4H': 8, '5MOC': 8, '1YCS': 8, '2B3G': 8, '3SLA': 6, '3FQN': 6, '6M90': 6, '1G3J': 6,

In [None]:
base_ACC.query("PDB_id == '3ISU' and aminBefore == 'Ala' and pdbx_strand_id == 'A' and poschangeProt == '1540'")

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution
15,1,156528563,.,G,A,rs764657941,"RasGAP_protein,_C-terminal",T,T,D,T,D,T,D,T,8e-06,D,D,T,T,D,D,T,4e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCt/gTt,p.Ala1540Val,c.4619C>T,IQGAP3,NM_178229.4,36,A,2,4619,>,Ala,Val,1540,subst,.,1,156528563,rs764657941,0.0,1,9/20,9,1,1,0,NM_178229.4,Q86VI3,IQGAP3,3ISU,1.88,IQGA3_HUMAN,1529.0,1631.0,3ISU,1529.0,1631.0,Q86VI3,A,19.0,121.0,UNP,1529,GKKQPSLHYTAAQLLEKGVLVEIEDLPASHFRNVIFDITPGDEAGK...,104.0,3ISU,0,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0


In [None]:
df_nodes_RING.query("PDB_id_RING == '3ISU' and Residue_RING == 'Ala' and Chain_RING == 'A' and Position_RING == '1540'")

Unnamed: 0,PDB_id_RING,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING
3568981,3ISU,A:1540:_:ALA,A,1540,Ala,H,6,20.7


This join will perform a filter on the **ACC** database, selecting only mutations that have a **PDB wild** annotated in **RING**, where we will obtain information on the interaction networks of PDBs, in the context of nodes.

In [None]:
import pandas as pd
base_merge_node_RING = pd.merge(base_ACC, df_nodes_RING, left_on=['PDB_id','aminBefore','pdbx_strand_id','poschangeProt'], right_on=['PDB_id_RING','Residue_RING','Chain_RING','Position_RING'], how='left')


In [None]:
base_merge_node_RING.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2846 entries, 0 to 2845
Data columns (total 94 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         2846 non-null   int64  
 1   POS                           2846 non-null   int64  
 2   ID                            2846 non-null   object 
 3   REF                           2846 non-null   object 
 4   ALT                           2846 non-null   object 
 5   avsnp150                      2846 non-null   object 
 6   Interpro_domain               2846 non-null   object 
 7   dbNSFP_DEOGEN2_pred           2846 non-null   object 
 8   dbNSFP_MetaSVM_pred           2846 non-null   object 
 9   dbNSFP_fathmmMKL_coding_pred  2846 non-null   object 
 10  dbNSFP_PrimateAI_pred         2846 non-null   object 
 11  dbNSFP_PROVEAN_pred           2846 non-null   object 
 12  dbNSFP_MCAP_pred              2846 non-null   object 
 13  dbN

In [None]:
base_merge_node_RING.head(20)

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING
0,1,2303896,.,C,T,rs752779978,.,D,D,D,T,D,D,D,D,8e-06,P,D,D,T,D,D,D,8e-06,.,D,D,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro423Leu,c.1268C>T,SKI,NM_003036.3,4,T,2,1268,>,Pro,Leu,423,subst,.,1,2303896,rs752779978,0.0,1,16/20,16,1,1,1,NM_003036.3,P12755,SKI,5XOD,1.85,SKI_HUMAN,15.0,40.0,5XOD,15.0,40.0,P12755,B,2.0,27.0,UNP,15,PGLQKTLEQFHLSSMSSLGGPAAFSA,26.0,5XOD,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,,,,,,,,
1,1,3816294,.,G,A,.,.,T,T,N,T,N,T,T,T,0.0,B,T,T,T,N,N,T,0.0,.,N,T,T,B,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro883Leu,c.2648C>T,CEP104,NM_014704.3,21,A,2,2648,>,Pro,Leu,883,subst,.,1,3816294,rs1197412379,0.0,0,0/20,0,0,0,0,NM_014704.3,O60308,"CEP104,KIAA0562",5LPI,1.8,CE104_HUMAN,746.0,875.0,5LPI,746.0,875.0,O60308,D,5.0,134.0,UNP,746,DEHYLDNLCIFCGERSESFTEEGLDLHYWKHCLMLTRCDHCKQVVE...,131.0,5LPI,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,,,,,,,,
2,1,3816294,.,G,A,.,.,T,T,N,T,N,T,T,T,0.0,B,T,T,T,N,N,T,0.0,.,N,T,T,B,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro883Leu,c.2648C>T,CEP104,NM_014704.3,21,A,2,2648,>,Pro,Leu,883,subst,.,1,3816294,rs1197412379,0.0,0,0/20,0,0,0,0,NM_014704.3,O60308,"CEP104,KIAA0562",5LPH,2.25,CE104_HUMAN,392.0,676.0,5LPH,392.0,676.0,O60308,A,4.0,288.0,UNP,392,GEAVVEPEMSNADISDARRGGMLGEPEPLTEKALREASSAIDVLGE...,288.0,5LPH,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,,,,,,,,
3,1,21844156,.,G,A,rs764778166,Immunoglobulin_I-set|Immunoglobulin_V-set_doma...,T,T,N,T,N,D,D,T,4.1e-05,P,D,T,D,N,N,T,4e-05,.,N,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgg/Tgg,p.Arg2871Trp,c.8611C>T,HSPG2,NM_001291860.1,65,A,1,8611,>,Arg,Trp,2871,subst,.,1,21844156,rs764778166,0.0,1,6/20,6,1,1,0,NM_001291860.1,P98160,HSPG2,3SH4,1.5,PGBM_HUMAN,4197.0,4391.0,3SH4,1.0,195.0,P98160,A,1.0,195.0,UNP,4197,DAPGQYGAYFHDDGFLAFPGHVFSRSLPEVPETIELEVRTSTASGL...,197.0,3SH4,-3,positivecharge,aromatic,positivechargeTOaromatic,0,1,0TO1,0,,,,,,,,
4,1,29097885,.,G,A,rs750558736,"Band_4.1,_C-terminal\x3bFERM_adjacent_(FA)|PH_...",T,D,D,T,N,T,T,T,1.6e-05,D,T,D,D,N,N,T,1.2e-05,.,D,D,D,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtc/Atc,p.Val755Ile,c.2263G>A,EPB41,NM_001166005.1,17,A,1,2263,>,Val,Ile,755,subst,.,1,29097885,rs750558736,0.0,1,8/20,8,1,1,0,NM_001166005.1,P11171,"EPB41,E41P",3QIJ,1.8,EPB41_HUMAN,211.0,488.0,3QIJ,211.0,488.0,P11171,A,19.0,296.0,UNP,211,HCKVSLLDDTVYECVVEKHAKGQDLLKRVCEHLNLLEEDYFGLAIW...,281.0,3QIJ,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0,,,,,,,,
5,1,40292580,.,G,A,rs748601004,Peptidase_M48,T,T,D,T,N,T,D,T,2.5e-05,B,T,T,T,D,D,T,1.2e-05,.,D,T,D,P,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtt/Att,p.Val447Ile,c.1339G>A,ZMPSTE24,NM_005857.4,10,A,1,1339,>,Val,Ile,447,subst,.,1,40292580,rs748601004,0.0,1,7/20,7,1,1,0,NM_005857.4,O75844,"ZMPSTE24,FACE1,STE24",5SYT,2.0,FACE1_HUMAN,1.0,474.0,5SYT,1.0,474.0,O75844,A,1.0,474.0,UNP,1,MGMWASLDALWEMPAEKRIFGAVLLFSWTVYLWETFLAQRQRRIYK...,479.0,5SYT,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0,5SYT,A:447:_:VAL,A,447.0,Val,,2.0,39.45
6,1,40406739,.,C,T,rs759545887,.,T,D,D,D,D,D,D,D,8e-06,D,D,T,D,D,D,D,4e-06,.,D,D,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro36Leu,c.107C>T,SMAP2,NM_022733.2,2,T,2,107,>,Pro,Leu,36,subst,.,1,40406739,rs759545887,0.0,1,17/20,17,1,1,1,NM_022733.2,Q8WU79,"SMAP2,SMAP1L",2IQJ,1.9,SMAP2_HUMAN,1.0,132.0,2IQJ,1.0,132.0,Q8WU79,A,3.0,134.0,UNP,1,-,0.0,2IQJ,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2IQJ,A:36:_:PRO,A,36.0,Pro,,21.0,23.19
7,1,47280813,.,C,T,rs777645890,.,T,T,D,T,N,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,D,T,T,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gaa/Aaa,p.Glu549Lys,c.1645G>A,STIL,NM_001048166.1,12,T,1,1645,>,Glu,Lys,549,subst,.,1,47280813,rs777645890,0.0,0,2/20,2,0,0,0,NM_001048166.1,Q15468,"STIL,SIL",5LHW,0.91,STIL_HUMAN,726.0,750.0,5LHW,726.0,750.0,Q15468,A,4.0,28.0,UNP,726,LTEQDRQLRLLQAQIQRLLEAQSLM,25.0,5LHW,1,negativecharge,positivecharge,negativechargeTOpositivecharge,0,1,0TO1,0,,,,,,,,
8,1,47280813,.,C,T,rs777645890,.,T,T,D,T,N,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,D,T,T,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gaa/Aaa,p.Glu549Lys,c.1645G>A,STIL,NM_001048166.1,12,T,1,1645,>,Glu,Lys,549,subst,.,1,47280813,rs777645890,0.0,0,2/20,2,0,0,0,NM_001048166.1,Q15468,"STIL,SIL",5LHZ,2.51,STIL_HUMAN,726.0,750.0,5LHZ,726.0,750.0,Q15468,D,4.0,28.0,UNP,726,LTEQDRQLRLLQAQIQRLLEAQSLM,25.0,5LHZ,1,negativecharge,positivecharge,negativechargeTOpositivecharge,0,1,0TO1,0,,,,,,,,
9,1,56949695,.,G,A,rs150146785,Membrane_attack_complex_component/perforin_(MA...,T,T,N,T,N,T,T,T,0.000264,P,D,T,T,N,N,T,0.000223,.,N,T,T,D,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg242Cys,c.724C>T,C8B,NM_000066.3,6,A,1,724,>,Arg,Cys,242,subst,.,1,56949695,rs150146785,1.0,1,2/20,2,0,0,0,NM_000066.3,P07358,C8B,3OJY,2.51,CO8B_HUMAN,55.0,591.0,3OJY,1.0,537.0,P07358,B,1.0,537.0,UNP,55,SVDVTLMPIDCELSSWSSWTTCDPCQKKRYRYAYLLQPSQFHGEPC...,543.0,3OJY,-3,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,3OJY,B:242:_:ARG,B,242.0,Arg,E,4.0,42.32


In [None]:
#Removing records from the database with NaN values, since a left outer join was applied, and not all PDB wilds have annotation in RING
base_merge_node_RING.dropna(inplace=True)

In [None]:
base_merge_node_RING.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 717 entries, 5 to 2828
Data columns (total 94 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         717 non-null    int64  
 1   POS                           717 non-null    int64  
 2   ID                            717 non-null    object 
 3   REF                           717 non-null    object 
 4   ALT                           717 non-null    object 
 5   avsnp150                      717 non-null    object 
 6   Interpro_domain               717 non-null    object 
 7   dbNSFP_DEOGEN2_pred           717 non-null    object 
 8   dbNSFP_MetaSVM_pred           717 non-null    object 
 9   dbNSFP_fathmmMKL_coding_pred  717 non-null    object 
 10  dbNSFP_PrimateAI_pred         717 non-null    object 
 11  dbNSFP_PROVEAN_pred           717 non-null    object 
 12  dbNSFP_MCAP_pred              717 non-null    object 
 13  dbNS

In [None]:
base_merge_node_RING.head(10)

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING
5,1,40292580,.,G,A,rs748601004,Peptidase_M48,T,T,D,T,N,T,D,T,2.5e-05,B,T,T,T,D,D,T,1.2e-05,.,D,T,D,P,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtt/Att,p.Val447Ile,c.1339G>A,ZMPSTE24,NM_005857.4,10,A,1,1339,>,Val,Ile,447,subst,.,1,40292580,rs748601004,0.0,1,7/20,7,1,1,0,NM_005857.4,O75844,"ZMPSTE24,FACE1,STE24",5SYT,2.0,FACE1_HUMAN,1.0,474.0,5SYT,1.0,474.0,O75844,A,1.0,474.0,UNP,1,MGMWASLDALWEMPAEKRIFGAVLLFSWTVYLWETFLAQRQRRIYK...,479.0,5SYT,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0,5SYT,A:447:_:VAL,A,447.0,Val,,2.0,39.45
6,1,40406739,.,C,T,rs759545887,.,T,D,D,D,D,D,D,D,8e-06,D,D,T,D,D,D,D,4e-06,.,D,D,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro36Leu,c.107C>T,SMAP2,NM_022733.2,2,T,2,107,>,Pro,Leu,36,subst,.,1,40406739,rs759545887,0.0,1,17/20,17,1,1,1,NM_022733.2,Q8WU79,"SMAP2,SMAP1L",2IQJ,1.9,SMAP2_HUMAN,1.0,132.0,2IQJ,1.0,132.0,Q8WU79,A,3.0,134.0,UNP,1,-,0.0,2IQJ,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2IQJ,A:36:_:PRO,A,36.0,Pro,,21.0,23.19
9,1,56949695,.,G,A,rs150146785,Membrane_attack_complex_component/perforin_(MA...,T,T,N,T,N,T,T,T,0.000264,P,D,T,T,N,N,T,0.000223,.,N,T,T,D,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg242Cys,c.724C>T,C8B,NM_000066.3,6,A,1,724,>,Arg,Cys,242,subst,.,1,56949695,rs150146785,1.0,1,2/20,2,0,0,0,NM_000066.3,P07358,C8B,3OJY,2.51,CO8B_HUMAN,55.0,591.0,3OJY,1.0,537.0,P07358,B,1.0,537.0,UNP,55,SVDVTLMPIDCELSSWSSWTTCDPCQKKRYRYAYLLQPSQFHGEPC...,543.0,3OJY,-3,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,3OJY,B:242:_:ARG,B,242.0,Arg,E,4.0,42.32
14,1,155268782,.,C,T,.,.,T,T,D,T,N,T,D,T,0.0,B,T,T,T,D,D,T,0.0,.,D,T,.,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg138Gln,c.413G>A,CLK2,NM_001294338.1,4,T,2,413,>,Arg,Gln,138,subst,.,1,155268782,rs1477026654,0.0,0,5/20,5,0,0,0,NM_001294338.1,P49760,CLK2,6FYK,2.39,CLK2_HUMAN,136.0,496.0,6FYK,136.0,496.0,P49760,A,3.0,363.0,UNP,136,SSRRAKSVEDDAEGHLIYHVGDWLQERYEIVSTLGEGTFGRVVQCV...,365.0,6FYK,1,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,6FYK,A:138:_:ARG,A,138.0,Arg,,4.0,40.34
15,1,156528563,.,G,A,rs764657941,"RasGAP_protein,_C-terminal",T,T,D,T,D,T,D,T,8e-06,D,D,T,T,D,D,T,4e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCt/gTt,p.Ala1540Val,c.4619C>T,IQGAP3,NM_178229.4,36,A,2,4619,>,Ala,Val,1540,subst,.,1,156528563,rs764657941,0.0,1,9/20,9,1,1,0,NM_178229.4,Q86VI3,IQGAP3,3ISU,1.88,IQGA3_HUMAN,1529.0,1631.0,3ISU,1529.0,1631.0,Q86VI3,A,19.0,121.0,UNP,1529,GKKQPSLHYTAAQLLEKGVLVEIEDLPASHFRNVIFDITPGDEAGK...,104.0,3ISU,0,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,3ISU,A:1540:_:ALA,A,1540.0,Ala,H,6.0,20.7
20,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,5J1A,1.86,CD1A_HUMAN,1.0,295.0,5J1A,-16.0,278.0,P06126,A,1.0,295.0,UNP,1,MLFLLLPLLAVLPGDGNADGLKEPLSFHVTWIASFYNHSWKQNLVS...,298.0,5J1A,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,5J1A,A:59:_:SER,A,59.0,Ser,,4.0,58.38
21,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,4X6F,1.91,CD1A_HUMAN,21.0,295.0,4X6F,4.0,278.0,P06126,A,1.0,275.0,UNP,21,LKEPLSFHVTWIASFYNHSWKQNLVSGWLSDLQTHTWDSNSSTIVF...,278.0,4X6F,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,4X6F,A:59:_:SER,A,59.0,Ser,,3.0,63.97
22,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,4X6E,2.1,CD1A_HUMAN,21.0,295.0,4X6E,4.0,278.0,P06126,A,1.0,275.0,UNP,21,LKEPLSFHVTWIASFYNHSWKQNLVSGWLSDLQTHTWDSNSSTIVF...,278.0,4X6E,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,4X6E,A:59:_:SER,A,59.0,Ser,,3.0,55.98
23,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,1ONQ,2.15,CD1A_HUMAN,18.0,294.0,1ONQ,1.0,277.0,P06126,A,1.0,277.0,UNP,18,ADGLKEPLSFHVIWIASFYNHSWKQNLVSGWLSDLQTHTWDSNSST...,280.0,1ONQ,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,1ONQ,A:59:_:SER,A,59.0,Ser,,16.0,10.42
34,1,160880651,.,G,A,rs1022786885,"Fibrinogen,_alpha/beta/gamma_chain,_C-terminal...",T,T,D,T,D,T,D,T,0.0,P,D,T,T,D,N,T,0.0,.,D,T,T,P,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cct/Tct,p.Pro208Ser,c.622C>T,ITLN1,NM_017625.2,6,A,1,622,>,Pro,Ser,208,subst,.,1,160880651,rs1022786885,0.0,1,7/20,7,1,1,0,NM_017625.2,Q8WWA0,"ITLN1,INTL,ITLN,LFR,UNQ640/PRO1270",6USC,1.59,ITLN1_HUMAN,35.0,313.0,6USC,35.0,313.0,Q8WWA0,A,1.0,279.0,UNP,35,PSLPRSCKEIKDECPSAFDGLYFLRTENGVIYQTFCDMTSGGGGWT...,282.0,6USC,-1,nonpolar,polar,nonpolarTOpolar,0,0,0TO0,0,6USC,A:208:_:PRO,A,208.0,Pro,E,161.0,16.98


In [None]:
#Identify duplicates records in the data
dupes=base_merge_node_RING.duplicated()
sum(dupes)

0

##42.3 Generating an intermediate file with the *ACC* database with the attributes from the *RING* nodes file

In [None]:
base_merge_node_RING.to_csv("drive/My Drive/ProcessaNovaBase/MontagemdeArqscomRINGeBetwennessClust/Bases15Tecidos/ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDBcomDuplicidade_PDBWild_Blosum62_Group_Change_Essential_substitution_nodes_RING.csv",sep='\t',index=False)

#43 - Generating the attributes from the RING edges file

The edges files of all **PDBs** that were submitted to **RING** were integrated into a single file and processed through the **TrataRINs** Notebook which is located in the **drive/My Drive/ProcessaNovaBase/TrataArqsRING** folder. The database that has this processing is **edgesDB_proc**.

The attributes from the RING edges file are:

- PDB
- Node
- Node_pos
- Node_chain
- Inter_Lig_tot
- Inter_Res_tot
- Inter_IAC_Lig_tot       
- Inter_VDW_Lig_tot       
- Inter_HBOND_Lig_tot      
- Inter_PIPISTACK_Lig_tot  
- Inter_IONIC_Lig_tot      
- Inter_SSBOND_Lig_tot     
- Inter_PICATION_Lig_tot  
- Inter_IAC_Res_tot        
- Inter_VDW_Res_tot        
- Inter_HBOND_Res_tot      
- Inter_PIPISTACK_Res_tot  
- Inter_IONIC_Res_tot      
- Inter_SSBOND_Res_tot     
- Inter_PICATION_Res_tot   

In [None]:
#Increasing the display capacity of columns and rows
import pandas as pd

pd.set_option('display.max_columns', 7000)
pd.set_option('display.max_rows',90000)
pd.set_option('display.width', 7000)

In [None]:
#Reading the ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDB_PDBWild_Blosum62_Group_Change_Essential_substitution_nodes_RING database

import pandas as pd
base_ACC = pd.read_csv("drive/My Drive/ProcessaNovaBase/MontagemdeArqscomRINGeBetwennessClust/Bases15Tecidos/ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDBcomDuplicidade_PDBWild_Blosum62_Group_Change_Essential_substitution_nodes_RING.csv", delimiter='\t')

In [None]:
base_ACC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 717 entries, 0 to 716
Data columns (total 94 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         717 non-null    int64  
 1   POS                           717 non-null    int64  
 2   ID                            717 non-null    object 
 3   REF                           717 non-null    object 
 4   ALT                           717 non-null    object 
 5   avsnp150                      717 non-null    object 
 6   Interpro_domain               717 non-null    object 
 7   dbNSFP_DEOGEN2_pred           717 non-null    object 
 8   dbNSFP_MetaSVM_pred           717 non-null    object 
 9   dbNSFP_fathmmMKL_coding_pred  717 non-null    object 
 10  dbNSFP_PrimateAI_pred         717 non-null    object 
 11  dbNSFP_PROVEAN_pred           717 non-null    object 
 12  dbNSFP_MCAP_pred              717 non-null    object 
 13  dbNSF

In [None]:
base_ACC.head(15)

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING
0,1,40292580,.,G,A,rs748601004,Peptidase_M48,T,T,D,T,N,T,D,T,2.5e-05,B,T,T,T,D,D,T,1.2e-05,.,D,T,D,P,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtt/Att,p.Val447Ile,c.1339G>A,ZMPSTE24,NM_005857.4,10,A,1,1339,>,Val,Ile,447,subst,.,1,40292580,rs748601004,0.0,1,7/20,7,1,1,0,NM_005857.4,O75844,"ZMPSTE24,FACE1,STE24",5SYT,2.0,FACE1_HUMAN,1.0,474.0,5SYT,1.0,474.0,O75844,A,1.0,474.0,UNP,1,MGMWASLDALWEMPAEKRIFGAVLLFSWTVYLWETFLAQRQRRIYK...,479.0,5SYT,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0,5SYT,A:447:_:VAL,A,447.0,Val,,2.0,39.45
1,1,40406739,.,C,T,rs759545887,.,T,D,D,D,D,D,D,D,8e-06,D,D,T,D,D,D,D,4e-06,.,D,D,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro36Leu,c.107C>T,SMAP2,NM_022733.2,2,T,2,107,>,Pro,Leu,36,subst,.,1,40406739,rs759545887,0.0,1,17/20,17,1,1,1,NM_022733.2,Q8WU79,"SMAP2,SMAP1L",2IQJ,1.9,SMAP2_HUMAN,1.0,132.0,2IQJ,1.0,132.0,Q8WU79,A,3.0,134.0,UNP,1,-,0.0,2IQJ,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2IQJ,A:36:_:PRO,A,36.0,Pro,,21.0,23.19
2,1,56949695,.,G,A,rs150146785,Membrane_attack_complex_component/perforin_(MA...,T,T,N,T,N,T,T,T,0.000264,P,D,T,T,N,N,T,0.000223,.,N,T,T,D,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg242Cys,c.724C>T,C8B,NM_000066.3,6,A,1,724,>,Arg,Cys,242,subst,.,1,56949695,rs150146785,1.0,1,2/20,2,0,0,0,NM_000066.3,P07358,C8B,3OJY,2.51,CO8B_HUMAN,55.0,591.0,3OJY,1.0,537.0,P07358,B,1.0,537.0,UNP,55,SVDVTLMPIDCELSSWSSWTTCDPCQKKRYRYAYLLQPSQFHGEPC...,543.0,3OJY,-3,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,3OJY,B:242:_:ARG,B,242.0,Arg,E,4.0,42.32
3,1,155268782,.,C,T,.,.,T,T,D,T,N,T,D,T,0.0,B,T,T,T,D,D,T,0.0,.,D,T,.,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg138Gln,c.413G>A,CLK2,NM_001294338.1,4,T,2,413,>,Arg,Gln,138,subst,.,1,155268782,rs1477026654,0.0,0,5/20,5,0,0,0,NM_001294338.1,P49760,CLK2,6FYK,2.39,CLK2_HUMAN,136.0,496.0,6FYK,136.0,496.0,P49760,A,3.0,363.0,UNP,136,SSRRAKSVEDDAEGHLIYHVGDWLQERYEIVSTLGEGTFGRVVQCV...,365.0,6FYK,1,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,6FYK,A:138:_:ARG,A,138.0,Arg,,4.0,40.34
4,1,156528563,.,G,A,rs764657941,"RasGAP_protein,_C-terminal",T,T,D,T,D,T,D,T,8e-06,D,D,T,T,D,D,T,4e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCt/gTt,p.Ala1540Val,c.4619C>T,IQGAP3,NM_178229.4,36,A,2,4619,>,Ala,Val,1540,subst,.,1,156528563,rs764657941,0.0,1,9/20,9,1,1,0,NM_178229.4,Q86VI3,IQGAP3,3ISU,1.88,IQGA3_HUMAN,1529.0,1631.0,3ISU,1529.0,1631.0,Q86VI3,A,19.0,121.0,UNP,1529,GKKQPSLHYTAAQLLEKGVLVEIEDLPASHFRNVIFDITPGDEAGK...,104.0,3ISU,0,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,3ISU,A:1540:_:ALA,A,1540.0,Ala,H,6.0,20.7
5,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,5J1A,1.86,CD1A_HUMAN,1.0,295.0,5J1A,-16.0,278.0,P06126,A,1.0,295.0,UNP,1,MLFLLLPLLAVLPGDGNADGLKEPLSFHVTWIASFYNHSWKQNLVS...,298.0,5J1A,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,5J1A,A:59:_:SER,A,59.0,Ser,,4.0,58.38
6,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,4X6F,1.91,CD1A_HUMAN,21.0,295.0,4X6F,4.0,278.0,P06126,A,1.0,275.0,UNP,21,LKEPLSFHVTWIASFYNHSWKQNLVSGWLSDLQTHTWDSNSSTIVF...,278.0,4X6F,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,4X6F,A:59:_:SER,A,59.0,Ser,,3.0,63.97
7,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,4X6E,2.1,CD1A_HUMAN,21.0,295.0,4X6E,4.0,278.0,P06126,A,1.0,275.0,UNP,21,LKEPLSFHVTWIASFYNHSWKQNLVSGWLSDLQTHTWDSNSSTIVF...,278.0,4X6E,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,4X6E,A:59:_:SER,A,59.0,Ser,,3.0,55.98
8,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,1ONQ,2.15,CD1A_HUMAN,18.0,294.0,1ONQ,1.0,277.0,P06126,A,1.0,277.0,UNP,18,ADGLKEPLSFHVIWIASFYNHSWKQNLVSGWLSDLQTHTWDSNSST...,280.0,1ONQ,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,1ONQ,A:59:_:SER,A,59.0,Ser,,16.0,10.42
9,1,160880651,.,G,A,rs1022786885,"Fibrinogen,_alpha/beta/gamma_chain,_C-terminal...",T,T,D,T,D,T,D,T,0.0,P,D,T,T,D,N,T,0.0,.,D,T,T,P,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cct/Tct,p.Pro208Ser,c.622C>T,ITLN1,NM_017625.2,6,A,1,622,>,Pro,Ser,208,subst,.,1,160880651,rs1022786885,0.0,1,7/20,7,1,1,0,NM_017625.2,Q8WWA0,"ITLN1,INTL,ITLN,LFR,UNQ640/PRO1270",6USC,1.59,ITLN1_HUMAN,35.0,313.0,6USC,35.0,313.0,Q8WWA0,A,1.0,279.0,UNP,35,PSLPRSCKEIKDECPSAFDGLYFLRTENGVIYQTFCDMTSGGGGWT...,282.0,6USC,-1,nonpolar,polar,nonpolarTOpolar,0,0,0TO0,0,6USC,A:208:_:PRO,A,208.0,Pro,E,161.0,16.98


##43.1 Reading the RING edges database


The **edgesDB_proc.csv** file was generated in the **TrataRINs** notebook which is located in the **TrataArqsRING** folder of this drive

In [None]:
import pandas as pd
df_edges_RING = pd.read_csv("drive/My Drive/ProcessaNovaBase/TrataArqsRING/edgesDB_proc.csv",sep='\t')

In [None]:
df_edges_RING.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12871779 entries, 0 to 12871778
Data columns (total 20 columns):
 #   Column                   Dtype 
---  ------                   ----- 
 0   PDB_id_RING              object
 1   Node_RING                object
 2   Node_pos_RING            int64 
 3   Node_chain_RING          object
 4   Inter_Lig_tot            int64 
 5   Inter_Res_tot            int64 
 6   Inter_IAC_Lig_tot        int64 
 7   Inter_VDW_Lig_tot        int64 
 8   Inter_HBOND_Lig_tot      int64 
 9   Inter_PIPISTACK_Lig_tot  int64 
 10  Inter_IONIC_Lig_tot      int64 
 11  Inter_SSBOND_Lig_tot     int64 
 12  Inter_PICATION_Lig_tot   int64 
 13  Inter_IAC_Res_tot        int64 
 14  Inter_VDW_Res_tot        int64 
 15  Inter_HBOND_Res_tot      int64 
 16  Inter_PIPISTACK_Res_tot  int64 
 17  Inter_IONIC_Res_tot      int64 
 18  Inter_SSBOND_Res_tot     int64 
 19  Inter_PICATION_Res_tot   int64 
dtypes: int64(17), object(3)
memory usage: 1.9+ GB


In [None]:
df_edges_RING.head()

Unnamed: 0,PDB_id_RING,Node_RING,Node_pos_RING,Node_chain_RING,Inter_Lig_tot,Inter_Res_tot,Inter_IAC_Lig_tot,Inter_VDW_Lig_tot,Inter_HBOND_Lig_tot,Inter_PIPISTACK_Lig_tot,Inter_IONIC_Lig_tot,Inter_SSBOND_Lig_tot,Inter_PICATION_Lig_tot,Inter_IAC_Res_tot,Inter_VDW_Res_tot,Inter_HBOND_Res_tot,Inter_PIPISTACK_Res_tot,Inter_IONIC_Res_tot,Inter_SSBOND_Res_tot,Inter_PICATION_Res_tot
0,10GS,Ala,15,A,0,5,0,0,0,0,0,0,0,0,2,3,0,0,0,0
1,10GS,Ala,15,B,0,6,0,0,0,0,0,0,0,0,3,3,0,0,0,0
2,10GS,Ala,16,A,0,3,0,0,0,0,0,0,0,0,1,2,0,0,0,0
3,10GS,Ala,16,B,0,3,0,0,0,0,0,0,0,0,1,2,0,0,0,0
4,10GS,Ala,22,A,51,5,51,0,0,0,0,0,0,0,3,2,0,0,0,0


##43.2 Joining the ACC table (through the fields PDB_wild_id, aminBefore, pdbx_strand_id and poschangeProt) with the  RING edges table (through the fields PDB_id_RING, Node_RING, Node_chain_RING and Node_pos_RING), to filter, only, mutations that have PDB_wild annotated in RING

In [None]:
#Attributes that will be the key in the join with RING
def categories_column(df):
    for col in ['PDB_id',	'aminBefore', 'pdbx_strand_id',	'poschangeProt']:
        mydic= df[col].value_counts().to_dict()
        print(col, mydic)
        print('\n')

categories_column(base_ACC)

PDB_id {'3Q05': 24, '2ADY': 7, '1TSR': 7, '3KZ8': 7, '2AC0': 7, '3IGK': 7, '3KMD': 7, '1KZY': 7, '2PCX': 7, '4XR8': 7, '2YBG': 7, '5MF7': 7, '6FJ5': 7, '5MCW': 7, '4QO1': 7, '2ATA': 7, '5MCT': 7, '3IGL': 7, '2XWR': 7, '5MCV': 7, '2OCJ': 7, '5MG7': 7, '4HJE': 7, '1YCS': 7, '5BUA': 7, '2AHI': 7, '3D0A': 7, '1TUP': 7, '4IBW': 6, '4IBV': 6, '4IBU': 6, '6FOF': 6, '6UD7': 4, '2R7G': 4, '4HAN': 4, '3VKL': 4, '3POM': 4, '6M92': 3, '6M90': 3, '6M91': 3, '6WNX': 3, '6M93': 3, '1GUX': 2, '5FQD': 2, '4KX8': 2, '6Y7F': 2, '1YQR': 1, '2QS9': 1, '4F9B': 1, '6J4O': 1, '2XG3': 1, '4FAL': 1, '3KUQ': 1, '6BOQ': 1, '3A1J': 1, '4FDI': 1, '3ZNO': 1, '6DJD': 1, '6RZI': 1, '3R6I': 1, '4LBN': 1, '2F38': 1, '1KJR': 1, '3AP6': 1, '5VVC': 1, '1M9R': 1, '6QLT': 1, '6DZ2': 1, '3NOS': 1, '6L4B': 1, '5J1A': 1, '4FA3': 1, '3ZNN': 1, '6DJJ': 1, '6I74': 1, '6RTW': 1, '2NOZ': 1, '6QLU': 1, '4WMY': 1, '1M9K': 1, '6QLO': 1, '6W13': 1, '6BOV': 1, '4ITS': 1, '5E8A': 1, '6RHL': 1, '6BOT': 1, '6Q17': 1, '1LWV': 1, '3AP9': 1, '

In [None]:
#Attributes that will be the key in the join with the ACC database
def categories_column(df):
    for col in ['PDB_id_RING',	'Node_RING', 'Node_chain_RING',	'Node_pos_RING']:
        mydic= df[col].value_counts().to_dict()
        print(col, mydic)
        print('\n')

categories_column(df_edges_RING)

PDB_id_RING {'1QO5': 6046, '5L5U': 6022, '5LE5': 6014, '3LK4': 6005, '5L5F': 6000, '5LF7': 5993, '5L5H': 5987, '5LF4': 5987, '5L5S': 5984, '5L5A': 5984, '5LF0': 5982, '5LF3': 5982, '5LF6': 5971, '5L5O': 5967, '5LEY': 5966, '5LF1': 5966, '6HTR': 5957, '5LEZ': 5900, '5LEX': 5886, '4R3O': 5799, '5DOU': 5521, '2Q3E': 5401, '4DVQ': 5317, '5K9Q': 5163, '4XGZ': 4786, '2F5Z': 4751, '1ZY8': 4558, '3B2U': 4478, '4DL1': 4387, '4AY1': 4175, '4ZUL': 3964, '2A3W': 3944, '4ZUK': 3943, '2J6L': 3924, '3N80': 3904, '1YDE': 3887, '3SZ9': 3885, '1O02': 3877, '5W08': 3871, '5L13': 3871, '1O01': 3869, '3INJ': 3865, '1CW3': 3860, '1O00': 3851, '1NZZ': 3848, '4KWG': 3841, '1NZX': 3837, '2VLE': 3836, '1O05': 3833, '5L2O': 3832, '4ZVW': 3826, '1N4S': 3819, '3PVN': 3813, '4KWF': 3806, '1N4Q': 3803, '6VR6': 3793, '3PNW': 3765, '4CQM': 3759, '6Z86': 3754, '5Z2C': 3690, '6I34': 3690, '6I35': 3689, '1ZMD': 3664, '1ZMC': 3663, '5LHD': 3623, '6QAK': 3614, '6X5T': 3602, '4BL5': 3598, '2QG4': 3594, '5NHG': 3589, '3SOM':

In [None]:
base_ACC.query("PDB_id == '3ISU' and aminBefore == 'Ala' and pdbx_strand_id == 'A' and poschangeProt == '1540'")

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING
4,1,156528563,.,G,A,rs764657941,"RasGAP_protein,_C-terminal",T,T,D,T,D,T,D,T,8e-06,D,D,T,T,D,D,T,4e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCt/gTt,p.Ala1540Val,c.4619C>T,IQGAP3,NM_178229.4,36,A,2,4619,>,Ala,Val,1540,subst,.,1,156528563,rs764657941,0.0,1,9/20,9,1,1,0,NM_178229.4,Q86VI3,IQGAP3,3ISU,1.88,IQGA3_HUMAN,1529.0,1631.0,3ISU,1529.0,1631.0,Q86VI3,A,19.0,121.0,UNP,1529,GKKQPSLHYTAAQLLEKGVLVEIEDLPASHFRNVIFDITPGDEAGK...,104.0,3ISU,0,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,3ISU,A:1540:_:ALA,A,1540.0,Ala,H,6.0,20.7


In [None]:
df_edges_RING.query("PDB_id_RING == '3ISU' and Node_RING == 'Ala' and Node_chain_RING == 'A' and Node_pos_RING == '1540'")

Unnamed: 0,PDB_id_RING,Node_RING,Node_pos_RING,Node_chain_RING,Inter_Lig_tot,Inter_Res_tot,Inter_IAC_Lig_tot,Inter_VDW_Lig_tot,Inter_HBOND_Lig_tot,Inter_PIPISTACK_Lig_tot,Inter_IONIC_Lig_tot,Inter_SSBOND_Lig_tot,Inter_PICATION_Lig_tot,Inter_IAC_Res_tot,Inter_VDW_Res_tot,Inter_HBOND_Res_tot,Inter_PIPISTACK_Res_tot,Inter_IONIC_Res_tot,Inter_SSBOND_Res_tot,Inter_PICATION_Res_tot
3565415,3ISU,Ala,1540,A,0,6,0,0,0,0,0,0,0,0,3,3,0,0,0,0


This join will perform a filter on the **ACC** database, selecting only mutations that have a PDB wild annotated in RING, where we will obtain information on the interaction networks of PDBs, in the context of edges.

In [None]:
import pandas as pd
base_merge_edge_RING = pd.merge(base_ACC, df_edges_RING, left_on=['PDB_id','aminBefore','pdbx_strand_id','poschangeProt'], right_on=['PDB_id_RING','Node_RING','Node_chain_RING','Node_pos_RING'], how='left')


In [None]:
base_merge_edge_RING.info(max_cols=120)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 717 entries, 0 to 716
Data columns (total 114 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         717 non-null    int64  
 1   POS                           717 non-null    int64  
 2   ID                            717 non-null    object 
 3   REF                           717 non-null    object 
 4   ALT                           717 non-null    object 
 5   avsnp150                      717 non-null    object 
 6   Interpro_domain               717 non-null    object 
 7   dbNSFP_DEOGEN2_pred           717 non-null    object 
 8   dbNSFP_MetaSVM_pred           717 non-null    object 
 9   dbNSFP_fathmmMKL_coding_pred  717 non-null    object 
 10  dbNSFP_PrimateAI_pred         717 non-null    object 
 11  dbNSFP_PROVEAN_pred           717 non-null    object 
 12  dbNSFP_MCAP_pred              717 non-null    object 
 13  dbNS

In [None]:
base_merge_edge_RING.head(20)

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING_x,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING,PDB_id_RING_y,Node_RING,Node_pos_RING,Node_chain_RING,Inter_Lig_tot,Inter_Res_tot,Inter_IAC_Lig_tot,Inter_VDW_Lig_tot,Inter_HBOND_Lig_tot,Inter_PIPISTACK_Lig_tot,Inter_IONIC_Lig_tot,Inter_SSBOND_Lig_tot,Inter_PICATION_Lig_tot,Inter_IAC_Res_tot,Inter_VDW_Res_tot,Inter_HBOND_Res_tot,Inter_PIPISTACK_Res_tot,Inter_IONIC_Res_tot,Inter_SSBOND_Res_tot,Inter_PICATION_Res_tot
0,1,40292580,.,G,A,rs748601004,Peptidase_M48,T,T,D,T,N,T,D,T,2.5e-05,B,T,T,T,D,D,T,1.2e-05,.,D,T,D,P,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtt/Att,p.Val447Ile,c.1339G>A,ZMPSTE24,NM_005857.4,10,A,1,1339,>,Val,Ile,447,subst,.,1,40292580,rs748601004,0.0,1,7/20,7,1,1,0,NM_005857.4,O75844,"ZMPSTE24,FACE1,STE24",5SYT,2.0,FACE1_HUMAN,1.0,474.0,5SYT,1.0,474.0,O75844,A,1.0,474.0,UNP,1,MGMWASLDALWEMPAEKRIFGAVLLFSWTVYLWETFLAQRQRRIYK...,479.0,5SYT,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0,5SYT,A:447:_:VAL,A,447.0,Val,,2.0,39.45,5SYT,Val,447,A,0,2,0,0,0,0,0,0,0,0,2,0,0,0,0,0
1,1,40406739,.,C,T,rs759545887,.,T,D,D,D,D,D,D,D,8e-06,D,D,T,D,D,D,D,4e-06,.,D,D,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro36Leu,c.107C>T,SMAP2,NM_022733.2,2,T,2,107,>,Pro,Leu,36,subst,.,1,40406739,rs759545887,0.0,1,17/20,17,1,1,1,NM_022733.2,Q8WU79,"SMAP2,SMAP1L",2IQJ,1.9,SMAP2_HUMAN,1.0,132.0,2IQJ,1.0,132.0,Q8WU79,A,3.0,134.0,UNP,1,-,0.0,2IQJ,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2IQJ,A:36:_:PRO,A,36.0,Pro,,21.0,23.19,2IQJ,Pro,36,A,16,5,16,0,0,0,0,0,0,0,5,0,0,0,0,0
2,1,56949695,.,G,A,rs150146785,Membrane_attack_complex_component/perforin_(MA...,T,T,N,T,N,T,T,T,0.000264,P,D,T,T,N,N,T,0.000223,.,N,T,T,D,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg242Cys,c.724C>T,C8B,NM_000066.3,6,A,1,724,>,Arg,Cys,242,subst,.,1,56949695,rs150146785,1.0,1,2/20,2,0,0,0,NM_000066.3,P07358,C8B,3OJY,2.51,CO8B_HUMAN,55.0,591.0,3OJY,1.0,537.0,P07358,B,1.0,537.0,UNP,55,SVDVTLMPIDCELSSWSSWTTCDPCQKKRYRYAYLLQPSQFHGEPC...,543.0,3OJY,-3,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,3OJY,B:242:_:ARG,B,242.0,Arg,E,4.0,42.32,3OJY,Arg,242,B,0,4,0,0,0,0,0,0,0,0,2,2,0,0,0,0
3,1,155268782,.,C,T,.,.,T,T,D,T,N,T,D,T,0.0,B,T,T,T,D,D,T,0.0,.,D,T,.,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg138Gln,c.413G>A,CLK2,NM_001294338.1,4,T,2,413,>,Arg,Gln,138,subst,.,1,155268782,rs1477026654,0.0,0,5/20,5,0,0,0,NM_001294338.1,P49760,CLK2,6FYK,2.39,CLK2_HUMAN,136.0,496.0,6FYK,136.0,496.0,P49760,A,3.0,363.0,UNP,136,SSRRAKSVEDDAEGHLIYHVGDWLQERYEIVSTLGEGTFGRVVQCV...,365.0,6FYK,1,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,6FYK,A:138:_:ARG,A,138.0,Arg,,4.0,40.34,6FYK,Arg,138,A,0,4,0,0,0,0,0,0,0,0,2,1,0,1,0,0
4,1,156528563,.,G,A,rs764657941,"RasGAP_protein,_C-terminal",T,T,D,T,D,T,D,T,8e-06,D,D,T,T,D,D,T,4e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCt/gTt,p.Ala1540Val,c.4619C>T,IQGAP3,NM_178229.4,36,A,2,4619,>,Ala,Val,1540,subst,.,1,156528563,rs764657941,0.0,1,9/20,9,1,1,0,NM_178229.4,Q86VI3,IQGAP3,3ISU,1.88,IQGA3_HUMAN,1529.0,1631.0,3ISU,1529.0,1631.0,Q86VI3,A,19.0,121.0,UNP,1529,GKKQPSLHYTAAQLLEKGVLVEIEDLPASHFRNVIFDITPGDEAGK...,104.0,3ISU,0,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,3ISU,A:1540:_:ALA,A,1540.0,Ala,H,6.0,20.7,3ISU,Ala,1540,A,0,6,0,0,0,0,0,0,0,0,3,3,0,0,0,0
5,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,5J1A,1.86,CD1A_HUMAN,1.0,295.0,5J1A,-16.0,278.0,P06126,A,1.0,295.0,UNP,1,MLFLLLPLLAVLPGDGNADGLKEPLSFHVTWIASFYNHSWKQNLVS...,298.0,5J1A,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,5J1A,A:59:_:SER,A,59.0,Ser,,4.0,58.38,5J1A,Ser,59,A,0,4,0,0,0,0,0,0,0,0,1,3,0,0,0,0
6,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,4X6F,1.91,CD1A_HUMAN,21.0,295.0,4X6F,4.0,278.0,P06126,A,1.0,275.0,UNP,21,LKEPLSFHVTWIASFYNHSWKQNLVSGWLSDLQTHTWDSNSSTIVF...,278.0,4X6F,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,4X6F,A:59:_:SER,A,59.0,Ser,,3.0,63.97,4X6F,Ser,59,A,0,3,0,0,0,0,0,0,0,0,0,3,0,0,0,0
7,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,4X6E,2.1,CD1A_HUMAN,21.0,295.0,4X6E,4.0,278.0,P06126,A,1.0,275.0,UNP,21,LKEPLSFHVTWIASFYNHSWKQNLVSGWLSDLQTHTWDSNSSTIVF...,278.0,4X6E,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,4X6E,A:59:_:SER,A,59.0,Ser,,3.0,55.98,4X6E,Ser,59,A,0,3,0,0,0,0,0,0,0,0,0,3,0,0,0,0
8,1,158255202,.,C,A,rs559573053,MHC_class_I-like_antigen_recognition-like|MHC_...,T,T,N,T,D,T,T,T,4.1e-05,B,T,T,T,N,N,T,2.8e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser59Arg,c.177C>A,CD1A,NM_001763.2,2,A,3,177,>,Ser,Arg,59,subst,.,1,158255202,rs559573053,0.0,0,1/20,1,0,0,0,NM_001763.2,P06126,CD1A,1ONQ,2.15,CD1A_HUMAN,18.0,294.0,1ONQ,1.0,277.0,P06126,A,1.0,277.0,UNP,18,ADGLKEPLSFHVIWIASFYNHSWKQNLVSGWLSDLQTHTWDSNSST...,280.0,1ONQ,-1,polar,positivecharge,polarTOpositivecharge,0,0,0TO0,1,1ONQ,A:59:_:SER,A,59.0,Ser,,16.0,10.42,1ONQ,Ser,59,A,12,4,12,0,0,0,0,0,0,0,1,3,0,0,0,0
9,1,160880651,.,G,A,rs1022786885,"Fibrinogen,_alpha/beta/gamma_chain,_C-terminal...",T,T,D,T,D,T,D,T,0.0,P,D,T,T,D,N,T,0.0,.,D,T,T,P,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cct/Tct,p.Pro208Ser,c.622C>T,ITLN1,NM_017625.2,6,A,1,622,>,Pro,Ser,208,subst,.,1,160880651,rs1022786885,0.0,1,7/20,7,1,1,0,NM_017625.2,Q8WWA0,"ITLN1,INTL,ITLN,LFR,UNQ640/PRO1270",6USC,1.59,ITLN1_HUMAN,35.0,313.0,6USC,35.0,313.0,Q8WWA0,A,1.0,279.0,UNP,35,PSLPRSCKEIKDECPSAFDGLYFLRTENGVIYQTFCDMTSGGGGWT...,282.0,6USC,-1,nonpolar,polar,nonpolarTOpolar,0,0,0TO0,0,6USC,A:208:_:PRO,A,208.0,Pro,E,161.0,16.98,6USC,Pro,208,A,160,1,160,0,0,0,0,0,0,0,1,0,0,0,0,0


In [None]:
#Identify duplicates records in the data
dupes=base_merge_edge_RING.duplicated()
sum(dupes)

0

##43.3 Generating an intermediate file with the **ACC** database with the attributes from the **RING** edges file

In [None]:
base_merge_edge_RING.to_csv("drive/My Drive/ProcessaNovaBase/MontagemdeArqscomRINGeBetwennessClust/Bases15Tecidos/ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDBcomDuplicidade_PDBWild_Blosum62_Group_Change_Essential_substitution_nodes_RING_edges_RING.csv",sep='\t',index=False)

#44 - Handling Duplicates

As a UNIPROT can be associated with more than one PDB, during the join of the mutation database with the pdbtosp file (extended with the data from the XML) several records were generated for one record of the mutation database. It is necessary to treat to discard the others records and only one remains.

The code below generates a new dataframe (**df_dup**) with the rows that have duplicates in the fields specified in the command. This occurred because a Uniprot can have multiple PDBs associated with it.

In [None]:
#Increasing the display capacity of columns and rows
import pandas as pd

pd.set_option('display.max_columns', 7000)
pd.set_option('display.max_rows',90000)
pd.set_option('display.width', 7000)

In [None]:
#Reading the ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniprot database
import pandas as pd
base_merge_pdb = pd.read_csv("drive/My Drive/ProcessaNovaBase/MontagemdeArqscomRINGeBetwennessClust/Bases15Tecidos/ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDBcomDuplicidade_PDBWild_Blosum62_Group_Change_Essential_substitution_nodes_RING_edges_RING.csv", delimiter='\t')

In [None]:
base_merge_pdb.info(max_cols=120)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 717 entries, 0 to 716
Data columns (total 114 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         717 non-null    int64  
 1   POS                           717 non-null    int64  
 2   ID                            717 non-null    object 
 3   REF                           717 non-null    object 
 4   ALT                           717 non-null    object 
 5   avsnp150                      717 non-null    object 
 6   Interpro_domain               717 non-null    object 
 7   dbNSFP_DEOGEN2_pred           717 non-null    object 
 8   dbNSFP_MetaSVM_pred           717 non-null    object 
 9   dbNSFP_fathmmMKL_coding_pred  717 non-null    object 
 10  dbNSFP_PrimateAI_pred         717 non-null    object 
 11  dbNSFP_PROVEAN_pred           717 non-null    object 
 12  dbNSFP_MCAP_pred              717 non-null    object 
 13  dbNS

In [None]:
df_dup = base_merge_pdb[base_merge_pdb.duplicated(['CHROM','POS','ID', 'REF', 'ALT', 'avsnp150', 'Interpro_domain',
                                                   'dbNSFP_DEOGEN2_pred', 'dbNSFP_MetaSVM_pred', 'dbNSFP_fathmmMKL_coding_pred',
                                                   'dbNSFP_PrimateAI_pred', 'dbNSFP_PROVEAN_pred', 'dbNSFP_MCAP_pred',
                                                   'dbNSFP_ClinPred_pred', 'dbNSFP_BayesDel_addAF_pred', 'dbNSFP_ExAC_AF',
                                                   'dbNSFP_Polyphen2_HVAR_pred', 'dbNSFP_SIFT_pred', 'dbNSFP_FATHMM_pred',
                                                   'dbNSFP_SIFT4G_pred', 'dbNSFP_LRT_pred', 'dbNSFP_fathmmXF_coding_pred',
                                                   'dbNSFP_BayesDel_noAF_pred', 'dbNSFP_gnomAD_exomes_AF', 'dbNSFP_Aloft_pred',
                                                   'dbNSFP_MutationTaster_pred', 'dbNSFP_MetaLR_pred', 'dbNSFP_LISTS2_pred',
                                                   'dbNSFP_Polyphen2_HDIV_pred', 'dbNSFP_MutationAssessor_pred',
                                                   'VariantEffect_EFF', 'Risco_Mut_EFF', 'Tipo_Mut_EFF', 'Point_Mutation_EFF',
                                                   'changeProt_EFF', 'changecDNA_EFF', 'Gene_EFF', 'RefSeq_EFF', 'Exon_EFF',
                                                   'ALT_EFF', 'Pos_Point_Mutation_EFF', 'poschangecDNA_EFF', 'typechangecDNA_EFF',
                                                   'aminBefore', 'aminAfter', 'poschangeProt', 'typechangeProt', 'pos_terminalchangeProt',
                                                   'Chrom', 'Pos', 'SNP_ID_COMMON', 'COMMON', 'PolyPhen2_Dam_pred', 'Ndamage',
                                                   'NdamageCalc', 'Deleteria', 'Deleteria5', 'Deleteria10', 'transcript_NCBI_id',
                                                   'Uniprot_id','Genes_Uniprot'], keep=False)]


In [None]:
df_dup.info(max_cols=120)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 670 entries, 5 to 715
Data columns (total 114 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         670 non-null    int64  
 1   POS                           670 non-null    int64  
 2   ID                            670 non-null    object 
 3   REF                           670 non-null    object 
 4   ALT                           670 non-null    object 
 5   avsnp150                      670 non-null    object 
 6   Interpro_domain               670 non-null    object 
 7   dbNSFP_DEOGEN2_pred           670 non-null    object 
 8   dbNSFP_MetaSVM_pred           670 non-null    object 
 9   dbNSFP_fathmmMKL_coding_pred  670 non-null    object 
 10  dbNSFP_PrimateAI_pred         670 non-null    object 
 11  dbNSFP_PROVEAN_pred           670 non-null    object 
 12  dbNSFP_MCAP_pred              670 non-null    object 
 13  dbNS

In [None]:
consulta = df_dup.head(670)

In [None]:
consulta[['Uniprot_id','PDB_id', 'Resolution','poschangeProt', 'pdbx_auth_seq_align_beg', 'pdbx_auth_seq_align_end', 'seq_align_beg','seq_align_end']]

Unnamed: 0,Uniprot_id,PDB_id,Resolution,poschangeProt,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,seq_align_beg,seq_align_end
5,P06126,5J1A,1.86,59,-16.0,278.0,1.0,295.0
6,P06126,4X6F,1.91,59,4.0,278.0,1.0,275.0
7,P06126,4X6E,2.1,59,4.0,278.0,1.0,275.0
8,P06126,1ONQ,2.15,59,1.0,277.0,1.0,277.0
9,Q8WWA0,6USC,1.59,208,35.0,313.0,1.0,279.0
10,Q8WWA0,4WMY,1.6,208,29.0,313.0,22.0,306.0
11,Q8WWA0,4WMQ,1.8,208,29.0,313.0,22.0,306.0
12,Q86V25,6J4P,1.6,81,46.0,296.0,1.0,251.0
13,Q86V25,6QBY,2.09,81,40.0,295.0,2.0,257.0
14,Q86V25,6J4V,2.1,81,46.0,296.0,1.0,251.0


Let's remove this subset that has duplicates from the **base_merge_pdb** database, generating a database (**new_base_merge_pdb**) with only the records that have Uniprot associated with a single PDB

In [None]:
new_base_merge_pdb = base_merge_pdb.drop(df_dup.index)

In [None]:
new_base_merge_pdb.info(max_cols=120)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 47 entries, 0 to 716
Data columns (total 114 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         47 non-null     int64  
 1   POS                           47 non-null     int64  
 2   ID                            47 non-null     object 
 3   REF                           47 non-null     object 
 4   ALT                           47 non-null     object 
 5   avsnp150                      47 non-null     object 
 6   Interpro_domain               47 non-null     object 
 7   dbNSFP_DEOGEN2_pred           47 non-null     object 
 8   dbNSFP_MetaSVM_pred           47 non-null     object 
 9   dbNSFP_fathmmMKL_coding_pred  47 non-null     object 
 10  dbNSFP_PrimateAI_pred         47 non-null     object 
 11  dbNSFP_PROVEAN_pred           47 non-null     object 
 12  dbNSFP_MCAP_pred              47 non-null     object 
 13  dbNSF

In [None]:
new_base_merge_pdb

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING_x,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING,PDB_id_RING_y,Node_RING,Node_pos_RING,Node_chain_RING,Inter_Lig_tot,Inter_Res_tot,Inter_IAC_Lig_tot,Inter_VDW_Lig_tot,Inter_HBOND_Lig_tot,Inter_PIPISTACK_Lig_tot,Inter_IONIC_Lig_tot,Inter_SSBOND_Lig_tot,Inter_PICATION_Lig_tot,Inter_IAC_Res_tot,Inter_VDW_Res_tot,Inter_HBOND_Res_tot,Inter_PIPISTACK_Res_tot,Inter_IONIC_Res_tot,Inter_SSBOND_Res_tot,Inter_PICATION_Res_tot
0,1,40292580,.,G,A,rs748601004,Peptidase_M48,T,T,D,T,N,T,D,T,2.5e-05,B,T,T,T,D,D,T,1.2e-05,.,D,T,D,P,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtt/Att,p.Val447Ile,c.1339G>A,ZMPSTE24,NM_005857.4,10,A,1,1339,>,Val,Ile,447,subst,.,1,40292580,rs748601004,0.0,1,7/20,7,1,1,0,NM_005857.4,O75844,"ZMPSTE24,FACE1,STE24",5SYT,2.0,FACE1_HUMAN,1.0,474.0,5SYT,1.0,474.0,O75844,A,1.0,474.0,UNP,1,MGMWASLDALWEMPAEKRIFGAVLLFSWTVYLWETFLAQRQRRIYK...,479.0,5SYT,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0,5SYT,A:447:_:VAL,A,447.0,Val,,2.0,39.45,5SYT,Val,447,A,0,2,0,0,0,0,0,0,0,0,2,0,0,0,0,0
1,1,40406739,.,C,T,rs759545887,.,T,D,D,D,D,D,D,D,8e-06,D,D,T,D,D,D,D,4e-06,.,D,D,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro36Leu,c.107C>T,SMAP2,NM_022733.2,2,T,2,107,>,Pro,Leu,36,subst,.,1,40406739,rs759545887,0.0,1,17/20,17,1,1,1,NM_022733.2,Q8WU79,"SMAP2,SMAP1L",2IQJ,1.9,SMAP2_HUMAN,1.0,132.0,2IQJ,1.0,132.0,Q8WU79,A,3.0,134.0,UNP,1,-,0.0,2IQJ,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2IQJ,A:36:_:PRO,A,36.0,Pro,,21.0,23.19,2IQJ,Pro,36,A,16,5,16,0,0,0,0,0,0,0,5,0,0,0,0,0
2,1,56949695,.,G,A,rs150146785,Membrane_attack_complex_component/perforin_(MA...,T,T,N,T,N,T,T,T,0.000264,P,D,T,T,N,N,T,0.000223,.,N,T,T,D,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg242Cys,c.724C>T,C8B,NM_000066.3,6,A,1,724,>,Arg,Cys,242,subst,.,1,56949695,rs150146785,1.0,1,2/20,2,0,0,0,NM_000066.3,P07358,C8B,3OJY,2.51,CO8B_HUMAN,55.0,591.0,3OJY,1.0,537.0,P07358,B,1.0,537.0,UNP,55,SVDVTLMPIDCELSSWSSWTTCDPCQKKRYRYAYLLQPSQFHGEPC...,543.0,3OJY,-3,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,3OJY,B:242:_:ARG,B,242.0,Arg,E,4.0,42.32,3OJY,Arg,242,B,0,4,0,0,0,0,0,0,0,0,2,2,0,0,0,0
3,1,155268782,.,C,T,.,.,T,T,D,T,N,T,D,T,0.0,B,T,T,T,D,D,T,0.0,.,D,T,.,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg138Gln,c.413G>A,CLK2,NM_001294338.1,4,T,2,413,>,Arg,Gln,138,subst,.,1,155268782,rs1477026654,0.0,0,5/20,5,0,0,0,NM_001294338.1,P49760,CLK2,6FYK,2.39,CLK2_HUMAN,136.0,496.0,6FYK,136.0,496.0,P49760,A,3.0,363.0,UNP,136,SSRRAKSVEDDAEGHLIYHVGDWLQERYEIVSTLGEGTFGRVVQCV...,365.0,6FYK,1,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,6FYK,A:138:_:ARG,A,138.0,Arg,,4.0,40.34,6FYK,Arg,138,A,0,4,0,0,0,0,0,0,0,0,2,1,0,1,0,0
4,1,156528563,.,G,A,rs764657941,"RasGAP_protein,_C-terminal",T,T,D,T,D,T,D,T,8e-06,D,D,T,T,D,D,T,4e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCt/gTt,p.Ala1540Val,c.4619C>T,IQGAP3,NM_178229.4,36,A,2,4619,>,Ala,Val,1540,subst,.,1,156528563,rs764657941,0.0,1,9/20,9,1,1,0,NM_178229.4,Q86VI3,IQGAP3,3ISU,1.88,IQGA3_HUMAN,1529.0,1631.0,3ISU,1529.0,1631.0,Q86VI3,A,19.0,121.0,UNP,1529,GKKQPSLHYTAAQLLEKGVLVEIEDLPASHFRNVIFDITPGDEAGK...,104.0,3ISU,0,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,3ISU,A:1540:_:ALA,A,1540.0,Ala,H,6.0,20.7,3ISU,Ala,1540,A,0,6,0,0,0,0,0,0,0,0,3,3,0,0,0,0
35,2,73830885,.,C,T,rs767316172,.,T,T,D,T,D,T,T,T,3.3e-05,B,D,T,D,D,D,T,5.2e-05,.,D,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro10Leu,c.29C>T,STAMBP,NM_006463.4,3,T,2,29,>,Pro,Leu,10,subst,.,2,73830885,rs767316172,0.0,0,7/20,7,1,1,0,NM_006463.4,O95630,"STAMBP,AMSH",2XZE,1.75,STABP_HUMAN,1.0,146.0,2XZE,1.0,146.0,O95630,A,1.0,146.0,UNP,-,-,0.0,2XZE,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2XZE,A:10:_:PRO,A,10.0,Pro,,2.0,31.63,2XZE,Pro,10,A,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0
36,2,74887977,.,G,T,.,"Hexokinase,_C-terminal",D,D,D,D,D,D,D,D,0.0,D,D,D,D,D,D,D,8e-06,.,D,D,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gGa/gTa,p.Gly765Val,c.2294G>T,HK2,NM_000189.4,16,T,2,2294,>,Gly,Val,765,subst,.,2,74887977,rs1264600281,0.0,1,18/20,18,1,1,1,NM_000189.4,P52789,HK2,2NZT,2.45,HXK2_HUMAN,17.0,916.0,2NZT,17.0,916.0,P52789,A,3.0,902.0,UNP,17,DQVQKVDQYLYHMRLSDETLLEISKRFRKEMEKGLGATTHPTAAVK...,911.0,2NZT,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,1,2NZT,A:765:_:GLY,A,765.0,Gly,H,1.0,54.42,2NZT,Gly,765,A,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0
37,2,96296982,.,T,C,.,DEAD/DEAH_box_helicase_domain|Helicase_superfa...,T,T,D,T,D,T,T,T,0.0,B,T,T,T,N,N,T,4e-06,.,D,T,D,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,tAc/tGc,p.Tyr489Cys,c.1466A>G,SNRNP200,NM_014014.4,12,C,2,1466,>,Tyr,Cys,489,subst,.,2,96296982,rs1259944697,0.0,0,4/20,4,1,0,0,NM_014014.4,O75643,"SNRNP200,ASCC3L1,HELIC2,KIAA0788",6S8Q,2.39,U520_HUMAN,394.0,2136.0,6S8Q,394.0,2136.0,O75643,B,5.0,1747.0,UNP,394,MDLDQGGEALAPRQVLDLEDLVFTQGSHFMANKRCQLPDGSFRRQR...,1764.0,6S8Q,-2,aromatic,polar,aromaticTOpolar,0,0,0TO0,0,6S8Q,B:489:_:TYR,B,489.0,Tyr,H,20.0,35.19,6S8Q,Tyr,489,B,0,20,0,0,0,0,0,0,0,0,17,3,0,0,0,0
38,1,114713909,.,G,T,rs121913254,P-loop_containing_nucleoside_triphosphate_hydr...,D,D,D,D,D,D,D,D,0.0,P,D,D,D,U,D,D,0.0,.,D,D,D,P,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Caa/Aaa,p.Gln61Lys,c.181C>A,NRAS,NM_002524.4,3,T,1,181,>,Gln,Lys,61,subst,.,1,114713909,rs121913254,0.0,1,17/20,17,1,1,1,NM_002524.4,P01111,"NRAS,HRAS1",5UHV,1.67,RASN_HUMAN,1.0,166.0,5UHV,1.0,166.0,P01111,A,1.0,166.0,UNP,1,MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...,168.0,5UHV,1,polar,positivecharge,polarTOpositivecharge,0,1,0TO1,1,5UHV,A:61:_:GLN,A,61.0,Gln,S,22.0,41.02,5UHV,Gln,61,A,20,2,20,0,0,0,0,0,0,0,1,1,0,0,0,0
39,1,211673529,.,G,A,.,Protein_kinase_domain|Protein_kinase-like_domain,T,T,D,T,D,D,D,D,0.0,D,D,T,D,D,D,D,4e-06,.,D,T,D,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aCg/aTg,p.Thr170Met,c.509C>T,NEK2,NM_002497.3,3,A,2,509,>,Thr,Met,170,subst,.,1,211673529,rs759517583,0.0,1,13/20,13,1,1,1,NM_002497.3,P51955,"NEK2,NEK2A,NLK1",2W5A,1.55,NEK2_HUMAN,1.0,271.0,2W5A,1.0,271.0,P51955,A,1.0,271.0,UNP,-,-,0.0,2W5A,-1,polar,nonpolar,polarTOnonpolar,1,1,1TO1,0,2W5A,A:170:_:THR,A,170.0,Thr,S,9.0,69.16,2W5A,Thr,170,A,1,8,1,0,0,0,0,0,0,0,6,2,0,0,0,0


The **Descartar** attribute will be added to the **new_base_merge_pdb** database to identify which records the position of the amino acid that has the mutation is not contained in the alignment sequence, as these will not have RING data and will only be part of the database that does not have the RING data that will be submitted to machine learning. Pattern:

- Descartar = 0, the **poschangeProt** attribute is in the range of **pdbx_auth_seq_align_beg** and **pdbx_auth_seq_align_end** or in the range of **seq_align_beg** and **seq_align_end**.

- Descartar = 1, the record has no value in the attributes: **pdbx_auth_seq_align_beg**, **pdbx_auth_seq_align_end**, **seq_align_beg** and **seq_align_end**

- Descartar = 2, the **poschangeProt** attribute is not in the range of **pdbx_auth_seq_align_beg** and **pdbx_auth_seq_align_end** nor in the range of **seq_align_beg** and **seq_align_end**

In [None]:
l = []
def process_reg(df):
  uniprot = 0
  pos = 0
  selecionou = False
  l1=[]
  count = 0
  for i in df.itertuples():
    flag_Descartar = 1
    if ((i.pdbx_auth_seq_align_beg != '-') & (i.pdbx_auth_seq_align_end != '-')):
      if ((i.poschangeProt >= float(i.pdbx_auth_seq_align_beg)) &
          (i.poschangeProt <= float(i.pdbx_auth_seq_align_end))):
        flag_Descartar = 0
      elif ((i.seq_align_beg != '-') & (i.seq_align_end != '-')):
          if ((i.poschangeProt >= float(i.seq_align_beg)) &
              (i.poschangeProt <= float(i.seq_align_end))):
            flag_Descartar = 0
          else:
            flag_Descartar = 2 #poschangeprot is not in either range
      else: #It has no value in seq_align_beg and seq_align_end
        count = count + 1
        flag_Descartar = 1
    else:   #It has no value in pdbx_auth_seq_align_beg and pdbx_auth_seq_align_end
      count = count + 1
      flag_Descartar = 1
    l1.append(flag_Descartar)
  print("Quantidade de regs. que Uniprot não se relaciona com o PDB no arquivo XML: ", count)
  return l1

l = process_reg(new_base_merge_pdb)
new_base_merge_pdb['Descartar'] = l


Quantidade de regs. que Uniprot não se relaciona com o PDB no arquivo XML:  0


In [None]:
consulta = new_base_merge_pdb.head(137)

In [None]:
consulta[['Uniprot_id','PDB_id', 'poschangeProt', 'pdbx_auth_seq_align_beg', 'pdbx_auth_seq_align_end', 'seq_align_beg','seq_align_end', 'Descartar']]

Unnamed: 0,Uniprot_id,PDB_id,poschangeProt,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,seq_align_beg,seq_align_end,Descartar
0,O75844,5SYT,447,1.0,474.0,1.0,474.0,0
1,Q8WU79,2IQJ,36,1.0,132.0,3.0,134.0,0
2,P07358,3OJY,242,1.0,537.0,1.0,537.0,0
3,P49760,6FYK,138,136.0,496.0,3.0,363.0,0
4,Q86VI3,3ISU,1540,1529.0,1631.0,19.0,121.0,0
35,O95630,2XZE,10,1.0,146.0,1.0,146.0,0
36,P52789,2NZT,765,17.0,916.0,3.0,902.0,0
37,O75643,6S8Q,489,394.0,2136.0,5.0,1747.0,0
38,P01111,5UHV,61,1.0,166.0,1.0,166.0,0
39,P51955,2W5A,170,1.0,271.0,1.0,271.0,0


The **Descartar** attribute will be added to the **df_dup database** to identify which records are duplicates, as these will not be part of the database that will be submitted to machine learning. Pattern:

- Descartar = 0, the **poschangeProt** attribute is in the range of **pdbx_auth_seq_align_beg** and **pdbx_auth_seq_align_end** or in the range of **seq_align_beg** and **seq_align_end**.

- Descartar = 1, the record has no value in the attributes: **pdbx_auth_seq_align_beg**, **pdbx_auth_seq_align_end**, **seq_align_beg** and **seq_align_end**

- Descartar = 2, the **poschangeProt** attribute is not in the range of **pdbx_auth_seq_align_beg** and **pdbx_auth_seq_align_end** nor in the range of **seq_align_beg** and **seq_align_end**

- Descartar = 3, it is a duplicate record, the one with the best resolution has already been selected.

In [None]:
l = []
def process_reg_dup(df):
  uniprot = 0
  pos = 0
  selecionou = False
  l1=[]
  count = 0
  for i in df.itertuples():
    #if (i.Index > 234):
      #break;
    flag_Descartar = 1
    if ((i.Uniprot_id != uniprot) or (i.Uniprot_id == uniprot and i.poschangeProt != pos)):
      selecionou = False
    #print('******************************')
    #print('uniprot: ', uniprot)
    #print('pos: ', pos)
    #print('selecionou: ', selecionou)
    #print('i.Uniprot_id: ', i.Uniprot_id )
    #print('i.poschangeProt: ', i.poschangeProt)
    if ((i.pdbx_auth_seq_align_beg != '-') & (i.pdbx_auth_seq_align_end != '-')):
      if ((i.poschangeProt >= float(i.pdbx_auth_seq_align_beg)) &
          (i.poschangeProt <= float(i.pdbx_auth_seq_align_end))):
        if (not selecionou):
          flag_Descartar = 0
          selecionou = True
        else:
          flag_Descartar = 3 #Registro duplicado, já foi selecionado o de melhor resolução
      elif ((i.seq_align_beg != '-') & (i.seq_align_end != '-')):
          if ((i.poschangeProt >= float(i.seq_align_beg)) &
              (i.poschangeProt <= float(i.seq_align_end))):
            if (not selecionou):
              flag_Descartar = 0
              selecionou = True
            else:
              flag_Descartar = 3 #Registro duplicado, já foi selecionado o de melhor resolução
          elif (selecionou == False):
            flag_Descartar = 2 #poschangeprot não esta em nenhum dos dois intervalo
          else:
            flag_Descartar = 3 #Registro duplicado, já foi selecionado o de melhor resolução
      elif (selecionou == False):
        count = count + 1
        flag_Descartar = 1  #seq_align_beg  e seq_align_end não possui valor
      else:
        flag_Descartar = 3 #Registro duplicado, já foi selecionado o de melhor resolução
    elif (selecionou == False):
      count = count + 1
      flag_Descartar = 1  #Não possui valor em pdbx_auth_seq_align_beg e pdbx_auth_seq_align_end
    else:
      flag_Descartar = 3
    #print('flag_Descartar: ', flag_Descartar)
    l1.append(flag_Descartar)
    uniprot = i.Uniprot_id
    pos = i.poschangeProt
  print("Quantidade de regs. que Uniprot não se relaciona com o PDB no arquivo XML: ", count)
  return l1

l = process_reg_dup(df_dup)
df_dup['Descartar'] = l


Quantidade de regs. que Uniprot não se relaciona com o PDB no arquivo XML:  0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [None]:
consulta = df_dup.head(670)

In [None]:
consulta[['Uniprot_id','PDB_id', 'poschangeProt', 'pdbx_auth_seq_align_beg', 'pdbx_auth_seq_align_end', 'seq_align_beg','seq_align_end', 'Descartar']]

Unnamed: 0,Uniprot_id,PDB_id,poschangeProt,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,seq_align_beg,seq_align_end,Descartar
5,P06126,5J1A,59,-16.0,278.0,1.0,295.0,0
6,P06126,4X6F,59,4.0,278.0,1.0,275.0,3
7,P06126,4X6E,59,4.0,278.0,1.0,275.0,3
8,P06126,1ONQ,59,1.0,277.0,1.0,277.0,3
9,Q8WWA0,6USC,208,35.0,313.0,1.0,279.0,0
10,Q8WWA0,4WMY,208,29.0,313.0,22.0,306.0,3
11,Q8WWA0,4WMQ,208,29.0,313.0,22.0,306.0,3
12,Q86V25,6J4P,81,46.0,296.0,1.0,251.0,0
13,Q86V25,6QBY,81,40.0,295.0,2.0,257.0,3
14,Q86V25,6J4V,81,46.0,296.0,1.0,251.0,3


In [None]:
df_dup.query("pdbx_auth_seq_align_beg == '-' and pdbx_auth_seq_align_end == '-'")

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING_x,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING,PDB_id_RING_y,Node_RING,Node_pos_RING,Node_chain_RING,Inter_Lig_tot,Inter_Res_tot,Inter_IAC_Lig_tot,Inter_VDW_Lig_tot,Inter_HBOND_Lig_tot,Inter_PIPISTACK_Lig_tot,Inter_IONIC_Lig_tot,Inter_SSBOND_Lig_tot,Inter_PICATION_Lig_tot,Inter_IAC_Res_tot,Inter_VDW_Res_tot,Inter_HBOND_Res_tot,Inter_PIPISTACK_Res_tot,Inter_IONIC_Res_tot,Inter_SSBOND_Res_tot,Inter_PICATION_Res_tot,Descartar


We will join the two databases:

- new_base_merge_pdb
- df_dup

Generating the database: **base_merge**

Where the **Descartar** field will identify who should be discarded, due to the following conditions:

- the position of the **changeProt** attribute does not belong to the alignment sequence between Uniprot and PDB.
- the record is a duplicate, since a Uniprot can be associated with more than one PDB.

In [None]:
base_merge = new_base_merge_pdb.append(df_dup,ignore_index=True )

In [None]:
base_merge.info(max_cols=120)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 717 entries, 0 to 716
Data columns (total 115 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         717 non-null    int64  
 1   POS                           717 non-null    int64  
 2   ID                            717 non-null    object 
 3   REF                           717 non-null    object 
 4   ALT                           717 non-null    object 
 5   avsnp150                      717 non-null    object 
 6   Interpro_domain               717 non-null    object 
 7   dbNSFP_DEOGEN2_pred           717 non-null    object 
 8   dbNSFP_MetaSVM_pred           717 non-null    object 
 9   dbNSFP_fathmmMKL_coding_pred  717 non-null    object 
 10  dbNSFP_PrimateAI_pred         717 non-null    object 
 11  dbNSFP_PROVEAN_pred           717 non-null    object 
 12  dbNSFP_MCAP_pred              717 non-null    object 
 13  dbNS

In [None]:
base_merge.head(30)

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING_x,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING,PDB_id_RING_y,Node_RING,Node_pos_RING,Node_chain_RING,Inter_Lig_tot,Inter_Res_tot,Inter_IAC_Lig_tot,Inter_VDW_Lig_tot,Inter_HBOND_Lig_tot,Inter_PIPISTACK_Lig_tot,Inter_IONIC_Lig_tot,Inter_SSBOND_Lig_tot,Inter_PICATION_Lig_tot,Inter_IAC_Res_tot,Inter_VDW_Res_tot,Inter_HBOND_Res_tot,Inter_PIPISTACK_Res_tot,Inter_IONIC_Res_tot,Inter_SSBOND_Res_tot,Inter_PICATION_Res_tot,Descartar
0,1,40292580,.,G,A,rs748601004,Peptidase_M48,T,T,D,T,N,T,D,T,2.5e-05,B,T,T,T,D,D,T,1.2e-05,.,D,T,D,P,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtt/Att,p.Val447Ile,c.1339G>A,ZMPSTE24,NM_005857.4,10,A,1,1339,>,Val,Ile,447,subst,.,1,40292580,rs748601004,0.0,1,7/20,7,1,1,0,NM_005857.4,O75844,"ZMPSTE24,FACE1,STE24",5SYT,2.0,FACE1_HUMAN,1.0,474.0,5SYT,1.0,474.0,O75844,A,1.0,474.0,UNP,1,MGMWASLDALWEMPAEKRIFGAVLLFSWTVYLWETFLAQRQRRIYK...,479.0,5SYT,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0,5SYT,A:447:_:VAL,A,447.0,Val,,2.0,39.45,5SYT,Val,447,A,0,2,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0
1,1,40406739,.,C,T,rs759545887,.,T,D,D,D,D,D,D,D,8e-06,D,D,T,D,D,D,D,4e-06,.,D,D,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro36Leu,c.107C>T,SMAP2,NM_022733.2,2,T,2,107,>,Pro,Leu,36,subst,.,1,40406739,rs759545887,0.0,1,17/20,17,1,1,1,NM_022733.2,Q8WU79,"SMAP2,SMAP1L",2IQJ,1.9,SMAP2_HUMAN,1.0,132.0,2IQJ,1.0,132.0,Q8WU79,A,3.0,134.0,UNP,1,-,0.0,2IQJ,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2IQJ,A:36:_:PRO,A,36.0,Pro,,21.0,23.19,2IQJ,Pro,36,A,16,5,16,0,0,0,0,0,0,0,5,0,0,0,0,0,0
2,1,56949695,.,G,A,rs150146785,Membrane_attack_complex_component/perforin_(MA...,T,T,N,T,N,T,T,T,0.000264,P,D,T,T,N,N,T,0.000223,.,N,T,T,D,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg242Cys,c.724C>T,C8B,NM_000066.3,6,A,1,724,>,Arg,Cys,242,subst,.,1,56949695,rs150146785,1.0,1,2/20,2,0,0,0,NM_000066.3,P07358,C8B,3OJY,2.51,CO8B_HUMAN,55.0,591.0,3OJY,1.0,537.0,P07358,B,1.0,537.0,UNP,55,SVDVTLMPIDCELSSWSSWTTCDPCQKKRYRYAYLLQPSQFHGEPC...,543.0,3OJY,-3,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,3OJY,B:242:_:ARG,B,242.0,Arg,E,4.0,42.32,3OJY,Arg,242,B,0,4,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0
3,1,155268782,.,C,T,.,.,T,T,D,T,N,T,D,T,0.0,B,T,T,T,D,D,T,0.0,.,D,T,.,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg138Gln,c.413G>A,CLK2,NM_001294338.1,4,T,2,413,>,Arg,Gln,138,subst,.,1,155268782,rs1477026654,0.0,0,5/20,5,0,0,0,NM_001294338.1,P49760,CLK2,6FYK,2.39,CLK2_HUMAN,136.0,496.0,6FYK,136.0,496.0,P49760,A,3.0,363.0,UNP,136,SSRRAKSVEDDAEGHLIYHVGDWLQERYEIVSTLGEGTFGRVVQCV...,365.0,6FYK,1,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,6FYK,A:138:_:ARG,A,138.0,Arg,,4.0,40.34,6FYK,Arg,138,A,0,4,0,0,0,0,0,0,0,0,2,1,0,1,0,0,0
4,1,156528563,.,G,A,rs764657941,"RasGAP_protein,_C-terminal",T,T,D,T,D,T,D,T,8e-06,D,D,T,T,D,D,T,4e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCt/gTt,p.Ala1540Val,c.4619C>T,IQGAP3,NM_178229.4,36,A,2,4619,>,Ala,Val,1540,subst,.,1,156528563,rs764657941,0.0,1,9/20,9,1,1,0,NM_178229.4,Q86VI3,IQGAP3,3ISU,1.88,IQGA3_HUMAN,1529.0,1631.0,3ISU,1529.0,1631.0,Q86VI3,A,19.0,121.0,UNP,1529,GKKQPSLHYTAAQLLEKGVLVEIEDLPASHFRNVIFDITPGDEAGK...,104.0,3ISU,0,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,3ISU,A:1540:_:ALA,A,1540.0,Ala,H,6.0,20.7,3ISU,Ala,1540,A,0,6,0,0,0,0,0,0,0,0,3,3,0,0,0,0,0
5,2,73830885,.,C,T,rs767316172,.,T,T,D,T,D,T,T,T,3.3e-05,B,D,T,D,D,D,T,5.2e-05,.,D,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro10Leu,c.29C>T,STAMBP,NM_006463.4,3,T,2,29,>,Pro,Leu,10,subst,.,2,73830885,rs767316172,0.0,0,7/20,7,1,1,0,NM_006463.4,O95630,"STAMBP,AMSH",2XZE,1.75,STABP_HUMAN,1.0,146.0,2XZE,1.0,146.0,O95630,A,1.0,146.0,UNP,-,-,0.0,2XZE,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2XZE,A:10:_:PRO,A,10.0,Pro,,2.0,31.63,2XZE,Pro,10,A,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0
6,2,74887977,.,G,T,.,"Hexokinase,_C-terminal",D,D,D,D,D,D,D,D,0.0,D,D,D,D,D,D,D,8e-06,.,D,D,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gGa/gTa,p.Gly765Val,c.2294G>T,HK2,NM_000189.4,16,T,2,2294,>,Gly,Val,765,subst,.,2,74887977,rs1264600281,0.0,1,18/20,18,1,1,1,NM_000189.4,P52789,HK2,2NZT,2.45,HXK2_HUMAN,17.0,916.0,2NZT,17.0,916.0,P52789,A,3.0,902.0,UNP,17,DQVQKVDQYLYHMRLSDETLLEISKRFRKEMEKGLGATTHPTAAVK...,911.0,2NZT,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,1,2NZT,A:765:_:GLY,A,765.0,Gly,H,1.0,54.42,2NZT,Gly,765,A,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
7,2,96296982,.,T,C,.,DEAD/DEAH_box_helicase_domain|Helicase_superfa...,T,T,D,T,D,T,T,T,0.0,B,T,T,T,N,N,T,4e-06,.,D,T,D,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,tAc/tGc,p.Tyr489Cys,c.1466A>G,SNRNP200,NM_014014.4,12,C,2,1466,>,Tyr,Cys,489,subst,.,2,96296982,rs1259944697,0.0,0,4/20,4,1,0,0,NM_014014.4,O75643,"SNRNP200,ASCC3L1,HELIC2,KIAA0788",6S8Q,2.39,U520_HUMAN,394.0,2136.0,6S8Q,394.0,2136.0,O75643,B,5.0,1747.0,UNP,394,MDLDQGGEALAPRQVLDLEDLVFTQGSHFMANKRCQLPDGSFRRQR...,1764.0,6S8Q,-2,aromatic,polar,aromaticTOpolar,0,0,0TO0,0,6S8Q,B:489:_:TYR,B,489.0,Tyr,H,20.0,35.19,6S8Q,Tyr,489,B,0,20,0,0,0,0,0,0,0,0,17,3,0,0,0,0,0
8,1,114713909,.,G,T,rs121913254,P-loop_containing_nucleoside_triphosphate_hydr...,D,D,D,D,D,D,D,D,0.0,P,D,D,D,U,D,D,0.0,.,D,D,D,P,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Caa/Aaa,p.Gln61Lys,c.181C>A,NRAS,NM_002524.4,3,T,1,181,>,Gln,Lys,61,subst,.,1,114713909,rs121913254,0.0,1,17/20,17,1,1,1,NM_002524.4,P01111,"NRAS,HRAS1",5UHV,1.67,RASN_HUMAN,1.0,166.0,5UHV,1.0,166.0,P01111,A,1.0,166.0,UNP,1,MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...,168.0,5UHV,1,polar,positivecharge,polarTOpositivecharge,0,1,0TO1,1,5UHV,A:61:_:GLN,A,61.0,Gln,S,22.0,41.02,5UHV,Gln,61,A,20,2,20,0,0,0,0,0,0,0,1,1,0,0,0,0,0
9,1,211673529,.,G,A,.,Protein_kinase_domain|Protein_kinase-like_domain,T,T,D,T,D,D,D,D,0.0,D,D,T,D,D,D,D,4e-06,.,D,T,D,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aCg/aTg,p.Thr170Met,c.509C>T,NEK2,NM_002497.3,3,A,2,509,>,Thr,Met,170,subst,.,1,211673529,rs759517583,0.0,1,13/20,13,1,1,1,NM_002497.3,P51955,"NEK2,NEK2A,NLK1",2W5A,1.55,NEK2_HUMAN,1.0,271.0,2W5A,1.0,271.0,P51955,A,1.0,271.0,UNP,-,-,0.0,2W5A,-1,polar,nonpolar,polarTOnonpolar,1,1,1TO1,0,2W5A,A:170:_:THR,A,170.0,Thr,S,9.0,69.16,2W5A,Thr,170,A,1,8,1,0,0,0,0,0,0,0,6,2,0,0,0,0,0


In [None]:
#How many records have Discard == 0
res0 = base_merge.query('Descartar == 0')

In [None]:
len(res0)

98

In [None]:
#How many records have Discard != 0.
res = base_merge.query('Descartar !=0')

In [None]:
len(res)

619

In [None]:
res1 = base_merge.query('Descartar == 1')

In [None]:
len(res1)

0

In [None]:
res2 = base_merge.query('Descartar == 2')

In [None]:
len(res2)

0

In [None]:
res3 = base_merge.query('Descartar == 3')

In [None]:
len(res3)

619

In [None]:
base_merge['Descartar'].value_counts()

3    619
0     98
Name: Descartar, dtype: int64

##44.1 Generating an intermediate file with the **base_ACC** with the **Descartar** attribute, which identifies which records are duplicates.

In [None]:
base_merge.to_csv("drive/My Drive/ProcessaNovaBase/MontagemdeArqscomRINGeBetwennessClust/Bases15Tecidos/ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDBcomDuplicidade_PDBWild_Blosum62_Group_Change_Essential_substitution_nodes_RING_edges_RING_Descartar.csv",sep='\t',index=False)

#45 - Generating an intermediate file with wild PDB_ids after introducing RING attributes

In [None]:
#Increasing the display capacity of columns and rows.
import pandas as pd

pd.set_option('display.max_columns', 7000)
pd.set_option('display.max_rows',90000)
pd.set_option('display.width', 7000)

In [None]:
#Reading ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniprot database
import pandas as pd
base_merge_pdb = pd.read_csv("drive/My Drive/ProcessaNovaBase/MontagemdeArqscomRINGeBetwennessClust/Bases15Tecidos/ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDBcomDuplicidade_PDBWild_Blosum62_Group_Change_Essential_substitution_nodes_RING_edges_RING_Descartar.csv", delimiter='\t')

In [None]:
base_merge_pdb.info(max_cols=120)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 717 entries, 0 to 716
Data columns (total 115 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         717 non-null    int64  
 1   POS                           717 non-null    int64  
 2   ID                            717 non-null    object 
 3   REF                           717 non-null    object 
 4   ALT                           717 non-null    object 
 5   avsnp150                      717 non-null    object 
 6   Interpro_domain               717 non-null    object 
 7   dbNSFP_DEOGEN2_pred           717 non-null    object 
 8   dbNSFP_MetaSVM_pred           717 non-null    object 
 9   dbNSFP_fathmmMKL_coding_pred  717 non-null    object 
 10  dbNSFP_PrimateAI_pred         717 non-null    object 
 11  dbNSFP_PROVEAN_pred           717 non-null    object 
 12  dbNSFP_MCAP_pred              717 non-null    object 
 13  dbNS

In [None]:
base_merge_validos = base_merge_pdb.query('Descartar == 0')

In [None]:
base_merge_validos.info(max_cols=120)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98 entries, 0 to 687
Data columns (total 115 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         98 non-null     int64  
 1   POS                           98 non-null     int64  
 2   ID                            98 non-null     object 
 3   REF                           98 non-null     object 
 4   ALT                           98 non-null     object 
 5   avsnp150                      98 non-null     object 
 6   Interpro_domain               98 non-null     object 
 7   dbNSFP_DEOGEN2_pred           98 non-null     object 
 8   dbNSFP_MetaSVM_pred           98 non-null     object 
 9   dbNSFP_fathmmMKL_coding_pred  98 non-null     object 
 10  dbNSFP_PrimateAI_pred         98 non-null     object 
 11  dbNSFP_PROVEAN_pred           98 non-null     object 
 12  dbNSFP_MCAP_pred              98 non-null     object 
 13  dbNSF

In [None]:
bd_id = base_merge_validos['PDB_wild_id'].value_counts()

In [None]:
bd_id

5MCT    7
4KX8    2
6Y7F    2
5OTF    1
3NWN    1
6S8Q    1
4YNM    1
6UPR    1
3AP9    1
3ABH    1
3KUQ    1
5OYJ    1
1XJV    1
3ZNN    1
3EI3    1
6QJU    1
3OJY    1
1JKG    1
6PXU    1
3ZSJ    1
6MKK    1
2IQJ    1
4CGV    1
6J8Y    1
6DF3    1
4QN1    1
6FYK    1
4WGK    1
6U1U    1
2A1I    1
2CBZ    1
3F70    1
4W7Z    1
5I9J    1
3ISU    1
1PIN    1
2P5S    1
2NZT    1
5MJ6    1
5UHV    1
4J37    1
6H45    1
6YA6    1
4UU5    1
6OC0    1
5A1M    1
4D1P    1
5KYC    1
2XZE    1
6RTW    1
1AIE    1
3BUV    1
5TC6    1
6I6R    1
3PS4    1
6N0D    1
3CWW    1
6FPY    1
2EYI    1
2F9L    1
6C6N    1
4QFT    1
2Q5E    1
2QS9    1
5SYT    1
4KGQ    1
6L4B    1
3HIL    1
6USC    1
2W5A    1
2GCG    1
3M03    1
2R7G    1
4ZFG    1
4FDI    1
1OZN    1
5XF7    1
4XH9    1
5J1A    1
2XHI    1
3U2P    1
1QMV    1
6NEW    1
6D5X    1
2XR5    1
6J4P    1
1S1P    1
6M90    1
6S4M    1
4ROC    1
Name: PDB_wild_id, dtype: int64

In [None]:
type(bd_id)


pandas.core.series.Series

In [None]:
#converting to a dataframe
bd_id = bd_id.to_frame()

In [None]:
type(bd_id)

pandas.core.frame.DataFrame

In [None]:
bd_id.info()

<class 'pandas.core.frame.DataFrame'>
Index: 90 entries, 5MCT to 4ROC
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PDB_wild_id  90 non-null     int64
dtypes: int64(1)
memory usage: 3.9+ KB


In [None]:
bd_id.head()

Unnamed: 0,PDB_wild_id
5MCT,7
4KX8,2
6Y7F,2
5OTF,1
3NWN,1


In [None]:
bd_id['PDB_wild_id'] = bd_id.index

In [None]:
bd_id = bd_id.loc[:,['PDB_wild_id']]

In [None]:
bd_id.head()

Unnamed: 0,PDB_wild_id
5MCT,5MCT
4KX8,4KX8
6Y7F,6Y7F
5OTF,5OTF
3NWN,3NWN


In [None]:
#Identify duplicates records in the data
dupes=bd_id.duplicated()
sum(dupes)

0

In [None]:
bd_id.to_csv("drive/My Drive/ProcessaNovaBase/Junta_PDBs_id_Pos_RING/Tecidos_PDB_wild_id/ACC_PDB_wild_id.csv",sep='\t',index=False)

#46 - Generating the clustering coefficient and betweenness attributes, obtained from processing the RING edges files

The edges files of all PDBs that were submitted to RING are the input of the R script that calculates the clustering coefficient and betweenness of all nodes that make up the graph of a PDB.

The **TrataArqsScriptR** notebook handles the file that contains these attributes and is located in the **drive/My Drive/ProcessaNovaBase/TrataArqsScriptDiego** folder. The database that has this processing is **cifs_pdbs_NodesResult_proc**.

The attributes in this database are:

- node_ScriptR: The node can be an amino acid or a ligand. It was obtained from the edge file (RING output). It has the following format:
$<chain> : <index> : <insertion_code> : <residue_3_letter_code>$
- degree_node_ScriptR: the degree of the node, obtained from the edge file (RING output)
- triangles_node: Number of triangles that this node forms with other residues it interacts with.
- clusteringCoef_node: node clustering coefficient.
- betweennessWeighted_node: node betweenness.
- filename: edge file name.
- node_id_ScriptR: 3-letter code for the amino acid or ligand.
- node_pos_ScriptR: node position.
- node_chain_ScriptR: node chain.
- PDB_id_ScriptR: PDB code.
   

In [None]:
#Increasing the display capacity of columns and rows
import pandas as pd

pd.set_option('display.max_columns', 7000)
pd.set_option('display.max_rows',90000)
pd.set_option('display.width', 7000)

In [None]:
#Reading the ACC database
import pandas as pd
base_ACC = pd.read_csv("drive/My Drive/ProcessaNovaBase/MontagemdeArqscomRINGeBetwennessClust/Bases15Tecidos/ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDBcomDuplicidade_PDBWild_Blosum62_Group_Change_Essential_substitution_nodes_RING_edges_RING_Descartar.csv", delimiter='\t')

In [None]:
base_ACC.info(max_cols=120)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 717 entries, 0 to 716
Data columns (total 115 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         717 non-null    int64  
 1   POS                           717 non-null    int64  
 2   ID                            717 non-null    object 
 3   REF                           717 non-null    object 
 4   ALT                           717 non-null    object 
 5   avsnp150                      717 non-null    object 
 6   Interpro_domain               717 non-null    object 
 7   dbNSFP_DEOGEN2_pred           717 non-null    object 
 8   dbNSFP_MetaSVM_pred           717 non-null    object 
 9   dbNSFP_fathmmMKL_coding_pred  717 non-null    object 
 10  dbNSFP_PrimateAI_pred         717 non-null    object 
 11  dbNSFP_PROVEAN_pred           717 non-null    object 
 12  dbNSFP_MCAP_pred              717 non-null    object 
 13  dbNS

In [None]:
base_ACC.head(15)

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING_x,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING,PDB_id_RING_y,Node_RING,Node_pos_RING,Node_chain_RING,Inter_Lig_tot,Inter_Res_tot,Inter_IAC_Lig_tot,Inter_VDW_Lig_tot,Inter_HBOND_Lig_tot,Inter_PIPISTACK_Lig_tot,Inter_IONIC_Lig_tot,Inter_SSBOND_Lig_tot,Inter_PICATION_Lig_tot,Inter_IAC_Res_tot,Inter_VDW_Res_tot,Inter_HBOND_Res_tot,Inter_PIPISTACK_Res_tot,Inter_IONIC_Res_tot,Inter_SSBOND_Res_tot,Inter_PICATION_Res_tot,Descartar
0,1,40292580,.,G,A,rs748601004,Peptidase_M48,T,T,D,T,N,T,D,T,2.5e-05,B,T,T,T,D,D,T,1.2e-05,.,D,T,D,P,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtt/Att,p.Val447Ile,c.1339G>A,ZMPSTE24,NM_005857.4,10,A,1,1339,>,Val,Ile,447,subst,.,1,40292580,rs748601004,0.0,1,7/20,7,1,1,0,NM_005857.4,O75844,"ZMPSTE24,FACE1,STE24",5SYT,2.0,FACE1_HUMAN,1.0,474.0,5SYT,1.0,474.0,O75844,A,1.0,474.0,UNP,1,MGMWASLDALWEMPAEKRIFGAVLLFSWTVYLWETFLAQRQRRIYK...,479.0,5SYT,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0,5SYT,A:447:_:VAL,A,447.0,Val,,2.0,39.45,5SYT,Val,447,A,0,2,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0
1,1,40406739,.,C,T,rs759545887,.,T,D,D,D,D,D,D,D,8e-06,D,D,T,D,D,D,D,4e-06,.,D,D,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro36Leu,c.107C>T,SMAP2,NM_022733.2,2,T,2,107,>,Pro,Leu,36,subst,.,1,40406739,rs759545887,0.0,1,17/20,17,1,1,1,NM_022733.2,Q8WU79,"SMAP2,SMAP1L",2IQJ,1.9,SMAP2_HUMAN,1.0,132.0,2IQJ,1.0,132.0,Q8WU79,A,3.0,134.0,UNP,1,-,0.0,2IQJ,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2IQJ,A:36:_:PRO,A,36.0,Pro,,21.0,23.19,2IQJ,Pro,36,A,16,5,16,0,0,0,0,0,0,0,5,0,0,0,0,0,0
2,1,56949695,.,G,A,rs150146785,Membrane_attack_complex_component/perforin_(MA...,T,T,N,T,N,T,T,T,0.000264,P,D,T,T,N,N,T,0.000223,.,N,T,T,D,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg242Cys,c.724C>T,C8B,NM_000066.3,6,A,1,724,>,Arg,Cys,242,subst,.,1,56949695,rs150146785,1.0,1,2/20,2,0,0,0,NM_000066.3,P07358,C8B,3OJY,2.51,CO8B_HUMAN,55.0,591.0,3OJY,1.0,537.0,P07358,B,1.0,537.0,UNP,55,SVDVTLMPIDCELSSWSSWTTCDPCQKKRYRYAYLLQPSQFHGEPC...,543.0,3OJY,-3,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,3OJY,B:242:_:ARG,B,242.0,Arg,E,4.0,42.32,3OJY,Arg,242,B,0,4,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0
3,1,155268782,.,C,T,.,.,T,T,D,T,N,T,D,T,0.0,B,T,T,T,D,D,T,0.0,.,D,T,.,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg138Gln,c.413G>A,CLK2,NM_001294338.1,4,T,2,413,>,Arg,Gln,138,subst,.,1,155268782,rs1477026654,0.0,0,5/20,5,0,0,0,NM_001294338.1,P49760,CLK2,6FYK,2.39,CLK2_HUMAN,136.0,496.0,6FYK,136.0,496.0,P49760,A,3.0,363.0,UNP,136,SSRRAKSVEDDAEGHLIYHVGDWLQERYEIVSTLGEGTFGRVVQCV...,365.0,6FYK,1,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,6FYK,A:138:_:ARG,A,138.0,Arg,,4.0,40.34,6FYK,Arg,138,A,0,4,0,0,0,0,0,0,0,0,2,1,0,1,0,0,0
4,1,156528563,.,G,A,rs764657941,"RasGAP_protein,_C-terminal",T,T,D,T,D,T,D,T,8e-06,D,D,T,T,D,D,T,4e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCt/gTt,p.Ala1540Val,c.4619C>T,IQGAP3,NM_178229.4,36,A,2,4619,>,Ala,Val,1540,subst,.,1,156528563,rs764657941,0.0,1,9/20,9,1,1,0,NM_178229.4,Q86VI3,IQGAP3,3ISU,1.88,IQGA3_HUMAN,1529.0,1631.0,3ISU,1529.0,1631.0,Q86VI3,A,19.0,121.0,UNP,1529,GKKQPSLHYTAAQLLEKGVLVEIEDLPASHFRNVIFDITPGDEAGK...,104.0,3ISU,0,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,3ISU,A:1540:_:ALA,A,1540.0,Ala,H,6.0,20.7,3ISU,Ala,1540,A,0,6,0,0,0,0,0,0,0,0,3,3,0,0,0,0,0
5,2,73830885,.,C,T,rs767316172,.,T,T,D,T,D,T,T,T,3.3e-05,B,D,T,D,D,D,T,5.2e-05,.,D,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro10Leu,c.29C>T,STAMBP,NM_006463.4,3,T,2,29,>,Pro,Leu,10,subst,.,2,73830885,rs767316172,0.0,0,7/20,7,1,1,0,NM_006463.4,O95630,"STAMBP,AMSH",2XZE,1.75,STABP_HUMAN,1.0,146.0,2XZE,1.0,146.0,O95630,A,1.0,146.0,UNP,-,-,0.0,2XZE,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2XZE,A:10:_:PRO,A,10.0,Pro,,2.0,31.63,2XZE,Pro,10,A,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0
6,2,74887977,.,G,T,.,"Hexokinase,_C-terminal",D,D,D,D,D,D,D,D,0.0,D,D,D,D,D,D,D,8e-06,.,D,D,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gGa/gTa,p.Gly765Val,c.2294G>T,HK2,NM_000189.4,16,T,2,2294,>,Gly,Val,765,subst,.,2,74887977,rs1264600281,0.0,1,18/20,18,1,1,1,NM_000189.4,P52789,HK2,2NZT,2.45,HXK2_HUMAN,17.0,916.0,2NZT,17.0,916.0,P52789,A,3.0,902.0,UNP,17,DQVQKVDQYLYHMRLSDETLLEISKRFRKEMEKGLGATTHPTAAVK...,911.0,2NZT,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,1,2NZT,A:765:_:GLY,A,765.0,Gly,H,1.0,54.42,2NZT,Gly,765,A,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
7,2,96296982,.,T,C,.,DEAD/DEAH_box_helicase_domain|Helicase_superfa...,T,T,D,T,D,T,T,T,0.0,B,T,T,T,N,N,T,4e-06,.,D,T,D,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,tAc/tGc,p.Tyr489Cys,c.1466A>G,SNRNP200,NM_014014.4,12,C,2,1466,>,Tyr,Cys,489,subst,.,2,96296982,rs1259944697,0.0,0,4/20,4,1,0,0,NM_014014.4,O75643,"SNRNP200,ASCC3L1,HELIC2,KIAA0788",6S8Q,2.39,U520_HUMAN,394.0,2136.0,6S8Q,394.0,2136.0,O75643,B,5.0,1747.0,UNP,394,MDLDQGGEALAPRQVLDLEDLVFTQGSHFMANKRCQLPDGSFRRQR...,1764.0,6S8Q,-2,aromatic,polar,aromaticTOpolar,0,0,0TO0,0,6S8Q,B:489:_:TYR,B,489.0,Tyr,H,20.0,35.19,6S8Q,Tyr,489,B,0,20,0,0,0,0,0,0,0,0,17,3,0,0,0,0,0
8,1,114713909,.,G,T,rs121913254,P-loop_containing_nucleoside_triphosphate_hydr...,D,D,D,D,D,D,D,D,0.0,P,D,D,D,U,D,D,0.0,.,D,D,D,P,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Caa/Aaa,p.Gln61Lys,c.181C>A,NRAS,NM_002524.4,3,T,1,181,>,Gln,Lys,61,subst,.,1,114713909,rs121913254,0.0,1,17/20,17,1,1,1,NM_002524.4,P01111,"NRAS,HRAS1",5UHV,1.67,RASN_HUMAN,1.0,166.0,5UHV,1.0,166.0,P01111,A,1.0,166.0,UNP,1,MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...,168.0,5UHV,1,polar,positivecharge,polarTOpositivecharge,0,1,0TO1,1,5UHV,A:61:_:GLN,A,61.0,Gln,S,22.0,41.02,5UHV,Gln,61,A,20,2,20,0,0,0,0,0,0,0,1,1,0,0,0,0,0
9,1,211673529,.,G,A,.,Protein_kinase_domain|Protein_kinase-like_domain,T,T,D,T,D,D,D,D,0.0,D,D,T,D,D,D,D,4e-06,.,D,T,D,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aCg/aTg,p.Thr170Met,c.509C>T,NEK2,NM_002497.3,3,A,2,509,>,Thr,Met,170,subst,.,1,211673529,rs759517583,0.0,1,13/20,13,1,1,1,NM_002497.3,P51955,"NEK2,NEK2A,NLK1",2W5A,1.55,NEK2_HUMAN,1.0,271.0,2W5A,1.0,271.0,P51955,A,1.0,271.0,UNP,-,-,0.0,2W5A,-1,polar,nonpolar,polarTOnonpolar,1,1,1TO1,0,2W5A,A:170:_:THR,A,170.0,Thr,S,9.0,69.16,2W5A,Thr,170,A,1,8,1,0,0,0,0,0,0,0,6,2,0,0,0,0,0


In [None]:
#Checking for 'missing' values
base_ACC.isna().sum()

CHROM                           0
POS                             0
ID                              0
REF                             0
ALT                             0
avsnp150                        0
Interpro_domain                 0
dbNSFP_DEOGEN2_pred             0
dbNSFP_MetaSVM_pred             0
dbNSFP_fathmmMKL_coding_pred    0
dbNSFP_PrimateAI_pred           0
dbNSFP_PROVEAN_pred             0
dbNSFP_MCAP_pred                0
dbNSFP_ClinPred_pred            0
dbNSFP_BayesDel_addAF_pred      0
dbNSFP_ExAC_AF                  0
dbNSFP_Polyphen2_HVAR_pred      0
dbNSFP_SIFT_pred                0
dbNSFP_FATHMM_pred              0
dbNSFP_SIFT4G_pred              0
dbNSFP_LRT_pred                 0
dbNSFP_fathmmXF_coding_pred     0
dbNSFP_BayesDel_noAF_pred       0
dbNSFP_gnomAD_exomes_AF         0
dbNSFP_Aloft_pred               0
dbNSFP_MutationTaster_pred      0
dbNSFP_MetaLR_pred              0
dbNSFP_LISTS2_pred              0
dbNSFP_Polyphen2_HDIV_pred      0
dbNSFP_Mutatio

##46.1 Removing records that represent duplicates (*Descartar != 0)

The valid records of the **ACC** database are the records that have **Descartar** = 0, as the others are duplicates or records where the position where the mutation occurred is not present in the PDB. So we will only work with the records where **Descartar** = 0

In [None]:
base_ACC_valida  = base_ACC.query('Descartar == 0')

In [None]:
base_ACC_valida.info(max_cols=120)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98 entries, 0 to 687
Data columns (total 115 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         98 non-null     int64  
 1   POS                           98 non-null     int64  
 2   ID                            98 non-null     object 
 3   REF                           98 non-null     object 
 4   ALT                           98 non-null     object 
 5   avsnp150                      98 non-null     object 
 6   Interpro_domain               98 non-null     object 
 7   dbNSFP_DEOGEN2_pred           98 non-null     object 
 8   dbNSFP_MetaSVM_pred           98 non-null     object 
 9   dbNSFP_fathmmMKL_coding_pred  98 non-null     object 
 10  dbNSFP_PrimateAI_pred         98 non-null     object 
 11  dbNSFP_PROVEAN_pred           98 non-null     object 
 12  dbNSFP_MCAP_pred              98 non-null     object 
 13  dbNSF

##46.2 Reading the cifs_pdbs_NodesResult_proc database


In [None]:
#Reading cifs_pdbs_NodesResult_proc database
import pandas as pd
base_Node = pd.read_csv("drive/My Drive/ProcessaNovaBase/TrataArqsScriptDiego/cifs_pdbs_NodesResult_proc.csv", delimiter=',',keep_default_na=False)

In [None]:
base_Node.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3048870 entries, 0 to 3048869
Data columns (total 10 columns):
 #   Column                    Dtype  
---  ------                    -----  
 0   node_ScriptR              object 
 1   degree_node_ScriptR       int64  
 2   triangles_node            int64  
 3   clusteringCoef_node       float64
 4   betweennessWeighted_node  float64
 5   filename                  object 
 6   node_id_ScriptR           object 
 7   node_pos_ScriptR          int64  
 8   node_chain_ScriptR        object 
 9   PDB_id_ScriptR            object 
dtypes: float64(2), int64(3), object(5)
memory usage: 232.6+ MB


In [None]:
base_Node.head()

Unnamed: 0,node_ScriptR,degree_node_ScriptR,triangles_node,clusteringCoef_node,betweennessWeighted_node,filename,node_id_ScriptR,node_pos_ScriptR,node_chain_ScriptR,PDB_id_ScriptR
0,A:10:_:VAL,84,2,0.000574,0.022039,10GS.pdb.edges,Val,10,A,10GS
1,A:100:_:ARG,11,3,0.054545,0.032282,10GS.pdb.edges,Arg,100,A,10GS
2,A:101:_:CYS,5,0,0.0,0.014851,10GS.pdb.edges,Cys,101,A,10GS
3,A:102:_:LYS,7,2,0.095238,0.035155,10GS.pdb.edges,Lys,102,A,10GS
4,A:103:_:TYR,16,1,0.008333,0.029593,10GS.pdb.edges,Tyr,103,A,10GS


In [None]:
#Checking for 'missing' values
base_Node.isna().sum()

node_ScriptR                0
degree_node_ScriptR         0
triangles_node              0
clusteringCoef_node         0
betweennessWeighted_node    0
filename                    0
node_id_ScriptR             0
node_pos_ScriptR            0
node_chain_ScriptR          0
PDB_id_ScriptR              0
dtype: int64

##46.3 Joining the ACC_valida table (through the fields PDB_id, aminBefore, pdbx_strand_id and poschangeProt) with cifs_pdbs_NodeResult_proc table (through the fields PDB_id_ScriptR, node_id_ScriptR, node_chain_ScriptR e node_pos_ScriptR), to add clustering coefficient and betweenness information

In [None]:
#ACC attributes that will be the key in the join with cifs_pdbs_NodeResult_proc
def categories_column(df):
    for col in ['PDB_id',	'aminBefore', 'pdbx_strand_id',	'poschangeProt']:
        mydic= df[col].value_counts().to_dict()
        print(col, mydic)
        print('\n')

categories_column(base_ACC_valida)

PDB_id {'5MCT': 7, '6Y7F': 2, '4KX8': 2, '4QN1': 1, '6S8Q': 1, '3PS4': 1, '2CBZ': 1, '6NEW': 1, '4UU5': 1, '2XHI': 1, '6PXU': 1, '3ZSJ': 1, '4D1P': 1, '1OZN': 1, '5XF7': 1, '6J8Y': 1, '4ZFG': 1, '4FDI': 1, '2A1I': 1, '3BUV': 1, '3ZNN': 1, '3EI3': 1, '2QS9': 1, '2NZT': 1, '6FYK': 1, '3ISU': 1, '6RTW': 1, '1JKG': 1, '4W7Z': 1, '6L4B': 1, '5OYJ': 1, '3F70': 1, '6OC0': 1, '6H45': 1, '2Q5E': 1, '3ABH': 1, '6FPY': 1, '6UPR': 1, '4WGK': 1, '3M03': 1, '6YA6': 1, '2XZE': 1, '2XR5': 1, '3OJY': 1, '4CGV': 1, '2W5A': 1, '5MJ6': 1, '1AIE': 1, '6D5X': 1, '6S4M': 1, '4XH9': 1, '6N0D': 1, '5SYT': 1, '5A1M': 1, '1QMV': 1, '1XJV': 1, '6DF3': 1, '3NWN': 1, '5I9J': 1, '5OTF': 1, '2EYI': 1, '1S1P': 1, '4QFT': 1, '2F9L': 1, '4YNM': 1, '3AP9': 1, '6M90': 1, '6C6N': 1, '3KUQ': 1, '4KGQ': 1, '2IQJ': 1, '4J37': 1, '6U1U': 1, '6USC': 1, '6J4P': 1, '3CWW': 1, '4ROC': 1, '5KYC': 1, '5J1A': 1, '2P5S': 1, '5TC6': 1, '3HIL': 1, '1PIN': 1, '6QJU': 1, '6I6R': 1, '3U2P': 1, '2R7G': 1, '5UHV': 1, '2GCG': 1, '6MKK': 1}




In [None]:
#Attributes that will be the key in the join with the ACC database
def categories_column(df):
    for col in ['PDB_id_ScriptR', 'node_id_ScriptR', 'node_chain_ScriptR', 'node_pos_ScriptR']:
        mydic= df[col].value_counts().to_dict()
        print(col, mydic)
        print('\n')

categories_column(base_Node)

PDB_id_ScriptR {'5LE5': 6046, '1QO5': 6046, '5LF4': 6013, '5LF6': 6004, '2Q3E': 5401, '4DVQ': 5317, '1ZY8': 4558, '4AY1': 4175, '4ZUL': 3964, '3N80': 3911, '1YDE': 3887, '1O02': 3885, '1O01': 3877, '1CW3': 3860, '1O00': 3859, '1NZZ': 3856, '1NZX': 3845, '1O05': 3841, '1N4S': 3819, '1N4Q': 3803, '5Z2C': 3690, '6I35': 3689, '1ZMD': 3664, '1ZMC': 3663, '3SOM': 3588, '6K0R': 3497, '4EJH': 3491, '2VCV': 3376, '3LPP': 3322, '6Y41': 3306, '5OKM': 3267, '3HHD': 3212, '5EOM': 3192, '5K1A': 3067, '1MX1': 3015, '6F3T': 2877, '7JNT': 2836, '1O7A': 2801, '3GJX': 2797, '7JOV': 2792, '1R9M': 2784, '1R9N': 2763, '3IWP': 2715, '2C10': 2694, '5FQD': 2686, '3P8C': 2567, '6YND': 2567, '4OKN': 2545, '3T3P': 2531, '1I10': 2529, '6I7S': 2524, '6UEL': 2515, '1HL5': 2489, '5Q0C': 2482, '4I5L': 2423, '5JYO': 2371, '3HN3': 2365, '3U1K': 2348, '1H6K': 2301, '5UZ0': 2272, '1Z6T': 2240, '1PKX': 2235, '4A63': 2214, '3HEI': 2213, '1PL0': 2203, '2VX2': 2197, '1DO8': 2165, '1IRI': 2156, '1GZ4': 2155, '1JIQ': 2149, '3UO

In [None]:
import pandas as pd
base_merge_ACC_Node = pd.merge(base_ACC_valida, base_Node, left_on=['PDB_id',	'aminBefore', 'pdbx_strand_id',	'poschangeProt'], right_on=['PDB_id_ScriptR', 'node_id_ScriptR', 'node_chain_ScriptR', 'node_pos_ScriptR'], how='left')


In [None]:
base_merge_ACC_Node.info(max_cols=150)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 98 entries, 0 to 97
Data columns (total 125 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         98 non-null     int64  
 1   POS                           98 non-null     int64  
 2   ID                            98 non-null     object 
 3   REF                           98 non-null     object 
 4   ALT                           98 non-null     object 
 5   avsnp150                      98 non-null     object 
 6   Interpro_domain               98 non-null     object 
 7   dbNSFP_DEOGEN2_pred           98 non-null     object 
 8   dbNSFP_MetaSVM_pred           98 non-null     object 
 9   dbNSFP_fathmmMKL_coding_pred  98 non-null     object 
 10  dbNSFP_PrimateAI_pred         98 non-null     object 
 11  dbNSFP_PROVEAN_pred           98 non-null     object 
 12  dbNSFP_MCAP_pred              98 non-null     object 
 13  dbNSFP

In [None]:
base_merge_ACC_Node.head(20)

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING_x,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING,PDB_id_RING_y,Node_RING,Node_pos_RING,Node_chain_RING,Inter_Lig_tot,Inter_Res_tot,Inter_IAC_Lig_tot,Inter_VDW_Lig_tot,Inter_HBOND_Lig_tot,Inter_PIPISTACK_Lig_tot,Inter_IONIC_Lig_tot,Inter_SSBOND_Lig_tot,Inter_PICATION_Lig_tot,Inter_IAC_Res_tot,Inter_VDW_Res_tot,Inter_HBOND_Res_tot,Inter_PIPISTACK_Res_tot,Inter_IONIC_Res_tot,Inter_SSBOND_Res_tot,Inter_PICATION_Res_tot,Descartar,node_ScriptR,degree_node_ScriptR,triangles_node,clusteringCoef_node,betweennessWeighted_node,filename,node_id_ScriptR,node_pos_ScriptR,node_chain_ScriptR,PDB_id_ScriptR
0,1,40292580,.,G,A,rs748601004,Peptidase_M48,T,T,D,T,N,T,D,T,2.5e-05,B,T,T,T,D,D,T,1.2e-05,.,D,T,D,P,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtt/Att,p.Val447Ile,c.1339G>A,ZMPSTE24,NM_005857.4,10,A,1,1339,>,Val,Ile,447,subst,.,1,40292580,rs748601004,0.0,1,7/20,7,1,1,0,NM_005857.4,O75844,"ZMPSTE24,FACE1,STE24",5SYT,2.0,FACE1_HUMAN,1.0,474.0,5SYT,1.0,474.0,O75844,A,1.0,474.0,UNP,1,MGMWASLDALWEMPAEKRIFGAVLLFSWTVYLWETFLAQRQRRIYK...,479.0,5SYT,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0,5SYT,A:447:_:VAL,A,447.0,Val,,2.0,39.45,5SYT,Val,447,A,0,2,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,A:447:_:VAL,2,1,1.0,0.0,5SYT.pdb.edges,Val,447,A,5SYT
1,1,40406739,.,C,T,rs759545887,.,T,D,D,D,D,D,D,D,8e-06,D,D,T,D,D,D,D,4e-06,.,D,D,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro36Leu,c.107C>T,SMAP2,NM_022733.2,2,T,2,107,>,Pro,Leu,36,subst,.,1,40406739,rs759545887,0.0,1,17/20,17,1,1,1,NM_022733.2,Q8WU79,"SMAP2,SMAP1L",2IQJ,1.9,SMAP2_HUMAN,1.0,132.0,2IQJ,1.0,132.0,Q8WU79,A,3.0,134.0,UNP,1,-,0.0,2IQJ,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2IQJ,A:36:_:PRO,A,36.0,Pro,,21.0,23.19,2IQJ,Pro,36,A,16,5,16,0,0,0,0,0,0,0,5,0,0,0,0,0,0,A:36:_:PRO,21,4,0.019048,0.017387,2IQJ.pdb.edges,Pro,36,A,2IQJ
2,1,56949695,.,G,A,rs150146785,Membrane_attack_complex_component/perforin_(MA...,T,T,N,T,N,T,T,T,0.000264,P,D,T,T,N,N,T,0.000223,.,N,T,T,D,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg242Cys,c.724C>T,C8B,NM_000066.3,6,A,1,724,>,Arg,Cys,242,subst,.,1,56949695,rs150146785,1.0,1,2/20,2,0,0,0,NM_000066.3,P07358,C8B,3OJY,2.51,CO8B_HUMAN,55.0,591.0,3OJY,1.0,537.0,P07358,B,1.0,537.0,UNP,55,SVDVTLMPIDCELSSWSSWTTCDPCQKKRYRYAYLLQPSQFHGEPC...,543.0,3OJY,-3,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,3OJY,B:242:_:ARG,B,242.0,Arg,E,4.0,42.32,3OJY,Arg,242,B,0,4,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0,B:242:_:ARG,4,0,0.0,0.001848,3OJY.pdb.edges,Arg,242,B,3OJY
3,1,155268782,.,C,T,.,.,T,T,D,T,N,T,D,T,0.0,B,T,T,T,D,D,T,0.0,.,D,T,.,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg138Gln,c.413G>A,CLK2,NM_001294338.1,4,T,2,413,>,Arg,Gln,138,subst,.,1,155268782,rs1477026654,0.0,0,5/20,5,0,0,0,NM_001294338.1,P49760,CLK2,6FYK,2.39,CLK2_HUMAN,136.0,496.0,6FYK,136.0,496.0,P49760,A,3.0,363.0,UNP,136,SSRRAKSVEDDAEGHLIYHVGDWLQERYEIVSTLGEGTFGRVVQCV...,365.0,6FYK,1,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,6FYK,A:138:_:ARG,A,138.0,Arg,,4.0,40.34,6FYK,Arg,138,A,0,4,0,0,0,0,0,0,0,0,2,1,0,1,0,0,0,A:138:_:ARG,4,0,0.0,0.00124,6FYK.pdb.edges,Arg,138,A,6FYK
4,1,156528563,.,G,A,rs764657941,"RasGAP_protein,_C-terminal",T,T,D,T,D,T,D,T,8e-06,D,D,T,T,D,D,T,4e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCt/gTt,p.Ala1540Val,c.4619C>T,IQGAP3,NM_178229.4,36,A,2,4619,>,Ala,Val,1540,subst,.,1,156528563,rs764657941,0.0,1,9/20,9,1,1,0,NM_178229.4,Q86VI3,IQGAP3,3ISU,1.88,IQGA3_HUMAN,1529.0,1631.0,3ISU,1529.0,1631.0,Q86VI3,A,19.0,121.0,UNP,1529,GKKQPSLHYTAAQLLEKGVLVEIEDLPASHFRNVIFDITPGDEAGK...,104.0,3ISU,0,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,3ISU,A:1540:_:ALA,A,1540.0,Ala,H,6.0,20.7,3ISU,Ala,1540,A,0,6,0,0,0,0,0,0,0,0,3,3,0,0,0,0,0,A:1540:_:ALA,6,0,0.0,0.029384,3ISU.pdb.edges,Ala,1540,A,3ISU
5,2,73830885,.,C,T,rs767316172,.,T,T,D,T,D,T,T,T,3.3e-05,B,D,T,D,D,D,T,5.2e-05,.,D,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro10Leu,c.29C>T,STAMBP,NM_006463.4,3,T,2,29,>,Pro,Leu,10,subst,.,2,73830885,rs767316172,0.0,0,7/20,7,1,1,0,NM_006463.4,O95630,"STAMBP,AMSH",2XZE,1.75,STABP_HUMAN,1.0,146.0,2XZE,1.0,146.0,O95630,A,1.0,146.0,UNP,-,-,0.0,2XZE,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2XZE,A:10:_:PRO,A,10.0,Pro,,2.0,31.63,2XZE,Pro,10,A,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,A:10:_:PRO,2,0,0.0,0.00029,2XZE.pdb.edges,Pro,10,A,2XZE
6,2,74887977,.,G,T,.,"Hexokinase,_C-terminal",D,D,D,D,D,D,D,D,0.0,D,D,D,D,D,D,D,8e-06,.,D,D,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gGa/gTa,p.Gly765Val,c.2294G>T,HK2,NM_000189.4,16,T,2,2294,>,Gly,Val,765,subst,.,2,74887977,rs1264600281,0.0,1,18/20,18,1,1,1,NM_000189.4,P52789,HK2,2NZT,2.45,HXK2_HUMAN,17.0,916.0,2NZT,17.0,916.0,P52789,A,3.0,902.0,UNP,17,DQVQKVDQYLYHMRLSDETLLEISKRFRKEMEKGLGATTHPTAAVK...,911.0,2NZT,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,1,2NZT,A:765:_:GLY,A,765.0,Gly,H,1.0,54.42,2NZT,Gly,765,A,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,A:765:_:GLY,1,0,0.0,0.0,2NZT.pdb.edges,Gly,765,A,2NZT
7,2,96296982,.,T,C,.,DEAD/DEAH_box_helicase_domain|Helicase_superfa...,T,T,D,T,D,T,T,T,0.0,B,T,T,T,N,N,T,4e-06,.,D,T,D,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,tAc/tGc,p.Tyr489Cys,c.1466A>G,SNRNP200,NM_014014.4,12,C,2,1466,>,Tyr,Cys,489,subst,.,2,96296982,rs1259944697,0.0,0,4/20,4,1,0,0,NM_014014.4,O75643,"SNRNP200,ASCC3L1,HELIC2,KIAA0788",6S8Q,2.39,U520_HUMAN,394.0,2136.0,6S8Q,394.0,2136.0,O75643,B,5.0,1747.0,UNP,394,MDLDQGGEALAPRQVLDLEDLVFTQGSHFMANKRCQLPDGSFRRQR...,1764.0,6S8Q,-2,aromatic,polar,aromaticTOpolar,0,0,0TO0,0,6S8Q,B:489:_:TYR,B,489.0,Tyr,H,20.0,35.19,6S8Q,Tyr,489,B,0,20,0,0,0,0,0,0,0,0,17,3,0,0,0,0,0,B:489:_:TYR,20,6,0.031579,0.0039,6S8Q.pdb.edges,Tyr,489,B,6S8Q
8,1,114713909,.,G,T,rs121913254,P-loop_containing_nucleoside_triphosphate_hydr...,D,D,D,D,D,D,D,D,0.0,P,D,D,D,U,D,D,0.0,.,D,D,D,P,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Caa/Aaa,p.Gln61Lys,c.181C>A,NRAS,NM_002524.4,3,T,1,181,>,Gln,Lys,61,subst,.,1,114713909,rs121913254,0.0,1,17/20,17,1,1,1,NM_002524.4,P01111,"NRAS,HRAS1",5UHV,1.67,RASN_HUMAN,1.0,166.0,5UHV,1.0,166.0,P01111,A,1.0,166.0,UNP,1,MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...,168.0,5UHV,1,polar,positivecharge,polarTOpositivecharge,0,1,0TO1,1,5UHV,A:61:_:GLN,A,61.0,Gln,S,22.0,41.02,5UHV,Gln,61,A,20,2,20,0,0,0,0,0,0,0,1,1,0,0,0,0,0,A:61:_:GLN,22,0,0.0,0.0,5UHV.pdb.edges,Gln,61,A,5UHV
9,1,211673529,.,G,A,.,Protein_kinase_domain|Protein_kinase-like_domain,T,T,D,T,D,D,D,D,0.0,D,D,T,D,D,D,D,4e-06,.,D,T,D,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aCg/aTg,p.Thr170Met,c.509C>T,NEK2,NM_002497.3,3,A,2,509,>,Thr,Met,170,subst,.,1,211673529,rs759517583,0.0,1,13/20,13,1,1,1,NM_002497.3,P51955,"NEK2,NEK2A,NLK1",2W5A,1.55,NEK2_HUMAN,1.0,271.0,2W5A,1.0,271.0,P51955,A,1.0,271.0,UNP,-,-,0.0,2W5A,-1,polar,nonpolar,polarTOnonpolar,1,1,1TO1,0,2W5A,A:170:_:THR,A,170.0,Thr,S,9.0,69.16,2W5A,Thr,170,A,1,8,1,0,0,0,0,0,0,0,6,2,0,0,0,0,0,A:170:_:THR,9,2,0.055556,0.044739,2W5A.pdb.edges,Thr,170,A,2W5A


In [None]:
#Identify duplicates records in the data
dupes=base_merge_ACC_Node.duplicated()
sum(dupes)

0

In [None]:
base_merge_ACC_Node[['PDB_id','aminBefore','pdbx_strand_id','poschangeProt','Degree_RING','PDB_id_ScriptR','degree_node_ScriptR','node_ScriptR']]

Unnamed: 0,PDB_id,aminBefore,pdbx_strand_id,poschangeProt,Degree_RING,PDB_id_ScriptR,degree_node_ScriptR,node_ScriptR
0,5SYT,Val,A,447,2.0,5SYT,2,A:447:_:VAL
1,2IQJ,Pro,A,36,21.0,2IQJ,21,A:36:_:PRO
2,3OJY,Arg,B,242,4.0,3OJY,4,B:242:_:ARG
3,6FYK,Arg,A,138,4.0,6FYK,4,A:138:_:ARG
4,3ISU,Ala,A,1540,6.0,3ISU,6,A:1540:_:ALA
5,2XZE,Pro,A,10,2.0,2XZE,2,A:10:_:PRO
6,2NZT,Gly,A,765,1.0,2NZT,1,A:765:_:GLY
7,6S8Q,Tyr,B,489,20.0,6S8Q,20,B:489:_:TYR
8,5UHV,Gln,A,61,22.0,5UHV,22,A:61:_:GLN
9,2W5A,Thr,A,170,9.0,2W5A,9,A:170:_:THR


##46.4 Generating an intermediate file with the *ACC* database and the attributes: *clustering coefficient* and *betweenness*.

In [None]:
base_merge_ACC_Node.to_csv("drive/My Drive/ProcessaNovaBase/MontagemdeArqscomRINGeBetwennessClust/Bases15Tecidos/ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDBcomDuplicidade_PDBWild_Blosum62_Group_Change_Essential_substitution_nodes_RING_edges_RING_Descartar_ClusBet.csv",sep='\t',index=False)

#47 - Inclusion of the *Tecido* attribute

In [None]:
#Increasing the display capacity of columns and rows
import pandas as pd

pd.set_option('display.max_columns', 7000)
pd.set_option('display.max_rows',90000)
pd.set_option('display.width', 7000)

In [None]:
#Reading the  ACC database
import pandas as pd
base_ACC = pd.read_csv("drive/My Drive/ProcessaNovaBase/MontagemdeArqscomRINGeBetwennessClust/Bases15Tecidos/ACC_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt_COMMON_Pred_PolyPhen2_Dam_ExAC_AF_exomes_AF_Ndamage_Clean_Deleteria_Uniptot_PDBcomDuplicidade_PDBWild_Blosum62_Group_Change_Essential_substitution_nodes_RING_edges_RING_Descartar_ClusBet.csv", delimiter='\t')

In [None]:
base_ACC.info(max_cols=150)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 125 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         98 non-null     int64  
 1   POS                           98 non-null     int64  
 2   ID                            98 non-null     object 
 3   REF                           98 non-null     object 
 4   ALT                           98 non-null     object 
 5   avsnp150                      98 non-null     object 
 6   Interpro_domain               98 non-null     object 
 7   dbNSFP_DEOGEN2_pred           98 non-null     object 
 8   dbNSFP_MetaSVM_pred           98 non-null     object 
 9   dbNSFP_fathmmMKL_coding_pred  98 non-null     object 
 10  dbNSFP_PrimateAI_pred         98 non-null     object 
 11  dbNSFP_PROVEAN_pred           98 non-null     object 
 12  dbNSFP_MCAP_pred              98 non-null     object 
 13  dbNSFP

In [None]:
base_ACC.head(15)

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING_x,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING,PDB_id_RING_y,Node_RING,Node_pos_RING,Node_chain_RING,Inter_Lig_tot,Inter_Res_tot,Inter_IAC_Lig_tot,Inter_VDW_Lig_tot,Inter_HBOND_Lig_tot,Inter_PIPISTACK_Lig_tot,Inter_IONIC_Lig_tot,Inter_SSBOND_Lig_tot,Inter_PICATION_Lig_tot,Inter_IAC_Res_tot,Inter_VDW_Res_tot,Inter_HBOND_Res_tot,Inter_PIPISTACK_Res_tot,Inter_IONIC_Res_tot,Inter_SSBOND_Res_tot,Inter_PICATION_Res_tot,Descartar,node_ScriptR,degree_node_ScriptR,triangles_node,clusteringCoef_node,betweennessWeighted_node,filename,node_id_ScriptR,node_pos_ScriptR,node_chain_ScriptR,PDB_id_ScriptR
0,1,40292580,.,G,A,rs748601004,Peptidase_M48,T,T,D,T,N,T,D,T,2.5e-05,B,T,T,T,D,D,T,1.2e-05,.,D,T,D,P,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtt/Att,p.Val447Ile,c.1339G>A,ZMPSTE24,NM_005857.4,10,A,1,1339,>,Val,Ile,447,subst,.,1,40292580,rs748601004,0.0,1,7/20,7,1,1,0,NM_005857.4,O75844,"ZMPSTE24,FACE1,STE24",5SYT,2.0,FACE1_HUMAN,1.0,474.0,5SYT,1.0,474.0,O75844,A,1.0,474.0,UNP,1,MGMWASLDALWEMPAEKRIFGAVLLFSWTVYLWETFLAQRQRRIYK...,479.0,5SYT,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0,5SYT,A:447:_:VAL,A,447.0,Val,,2.0,39.45,5SYT,Val,447,A,0,2,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,A:447:_:VAL,2,1,1.0,0.0,5SYT.pdb.edges,Val,447,A,5SYT
1,1,40406739,.,C,T,rs759545887,.,T,D,D,D,D,D,D,D,8e-06,D,D,T,D,D,D,D,4e-06,.,D,D,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro36Leu,c.107C>T,SMAP2,NM_022733.2,2,T,2,107,>,Pro,Leu,36,subst,.,1,40406739,rs759545887,0.0,1,17/20,17,1,1,1,NM_022733.2,Q8WU79,"SMAP2,SMAP1L",2IQJ,1.9,SMAP2_HUMAN,1.0,132.0,2IQJ,1.0,132.0,Q8WU79,A,3.0,134.0,UNP,1,-,0.0,2IQJ,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2IQJ,A:36:_:PRO,A,36.0,Pro,,21.0,23.19,2IQJ,Pro,36,A,16,5,16,0,0,0,0,0,0,0,5,0,0,0,0,0,0,A:36:_:PRO,21,4,0.019048,0.017387,2IQJ.pdb.edges,Pro,36,A,2IQJ
2,1,56949695,.,G,A,rs150146785,Membrane_attack_complex_component/perforin_(MA...,T,T,N,T,N,T,T,T,0.000264,P,D,T,T,N,N,T,0.000223,.,N,T,T,D,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg242Cys,c.724C>T,C8B,NM_000066.3,6,A,1,724,>,Arg,Cys,242,subst,.,1,56949695,rs150146785,1.0,1,2/20,2,0,0,0,NM_000066.3,P07358,C8B,3OJY,2.51,CO8B_HUMAN,55.0,591.0,3OJY,1.0,537.0,P07358,B,1.0,537.0,UNP,55,SVDVTLMPIDCELSSWSSWTTCDPCQKKRYRYAYLLQPSQFHGEPC...,543.0,3OJY,-3,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,3OJY,B:242:_:ARG,B,242.0,Arg,E,4.0,42.32,3OJY,Arg,242,B,0,4,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0,B:242:_:ARG,4,0,0.0,0.001848,3OJY.pdb.edges,Arg,242,B,3OJY
3,1,155268782,.,C,T,.,.,T,T,D,T,N,T,D,T,0.0,B,T,T,T,D,D,T,0.0,.,D,T,.,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg138Gln,c.413G>A,CLK2,NM_001294338.1,4,T,2,413,>,Arg,Gln,138,subst,.,1,155268782,rs1477026654,0.0,0,5/20,5,0,0,0,NM_001294338.1,P49760,CLK2,6FYK,2.39,CLK2_HUMAN,136.0,496.0,6FYK,136.0,496.0,P49760,A,3.0,363.0,UNP,136,SSRRAKSVEDDAEGHLIYHVGDWLQERYEIVSTLGEGTFGRVVQCV...,365.0,6FYK,1,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,6FYK,A:138:_:ARG,A,138.0,Arg,,4.0,40.34,6FYK,Arg,138,A,0,4,0,0,0,0,0,0,0,0,2,1,0,1,0,0,0,A:138:_:ARG,4,0,0.0,0.00124,6FYK.pdb.edges,Arg,138,A,6FYK
4,1,156528563,.,G,A,rs764657941,"RasGAP_protein,_C-terminal",T,T,D,T,D,T,D,T,8e-06,D,D,T,T,D,D,T,4e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCt/gTt,p.Ala1540Val,c.4619C>T,IQGAP3,NM_178229.4,36,A,2,4619,>,Ala,Val,1540,subst,.,1,156528563,rs764657941,0.0,1,9/20,9,1,1,0,NM_178229.4,Q86VI3,IQGAP3,3ISU,1.88,IQGA3_HUMAN,1529.0,1631.0,3ISU,1529.0,1631.0,Q86VI3,A,19.0,121.0,UNP,1529,GKKQPSLHYTAAQLLEKGVLVEIEDLPASHFRNVIFDITPGDEAGK...,104.0,3ISU,0,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,3ISU,A:1540:_:ALA,A,1540.0,Ala,H,6.0,20.7,3ISU,Ala,1540,A,0,6,0,0,0,0,0,0,0,0,3,3,0,0,0,0,0,A:1540:_:ALA,6,0,0.0,0.029384,3ISU.pdb.edges,Ala,1540,A,3ISU
5,2,73830885,.,C,T,rs767316172,.,T,T,D,T,D,T,T,T,3.3e-05,B,D,T,D,D,D,T,5.2e-05,.,D,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro10Leu,c.29C>T,STAMBP,NM_006463.4,3,T,2,29,>,Pro,Leu,10,subst,.,2,73830885,rs767316172,0.0,0,7/20,7,1,1,0,NM_006463.4,O95630,"STAMBP,AMSH",2XZE,1.75,STABP_HUMAN,1.0,146.0,2XZE,1.0,146.0,O95630,A,1.0,146.0,UNP,-,-,0.0,2XZE,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2XZE,A:10:_:PRO,A,10.0,Pro,,2.0,31.63,2XZE,Pro,10,A,0,2,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,A:10:_:PRO,2,0,0.0,0.00029,2XZE.pdb.edges,Pro,10,A,2XZE
6,2,74887977,.,G,T,.,"Hexokinase,_C-terminal",D,D,D,D,D,D,D,D,0.0,D,D,D,D,D,D,D,8e-06,.,D,D,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gGa/gTa,p.Gly765Val,c.2294G>T,HK2,NM_000189.4,16,T,2,2294,>,Gly,Val,765,subst,.,2,74887977,rs1264600281,0.0,1,18/20,18,1,1,1,NM_000189.4,P52789,HK2,2NZT,2.45,HXK2_HUMAN,17.0,916.0,2NZT,17.0,916.0,P52789,A,3.0,902.0,UNP,17,DQVQKVDQYLYHMRLSDETLLEISKRFRKEMEKGLGATTHPTAAVK...,911.0,2NZT,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,1,2NZT,A:765:_:GLY,A,765.0,Gly,H,1.0,54.42,2NZT,Gly,765,A,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,A:765:_:GLY,1,0,0.0,0.0,2NZT.pdb.edges,Gly,765,A,2NZT
7,2,96296982,.,T,C,.,DEAD/DEAH_box_helicase_domain|Helicase_superfa...,T,T,D,T,D,T,T,T,0.0,B,T,T,T,N,N,T,4e-06,.,D,T,D,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,tAc/tGc,p.Tyr489Cys,c.1466A>G,SNRNP200,NM_014014.4,12,C,2,1466,>,Tyr,Cys,489,subst,.,2,96296982,rs1259944697,0.0,0,4/20,4,1,0,0,NM_014014.4,O75643,"SNRNP200,ASCC3L1,HELIC2,KIAA0788",6S8Q,2.39,U520_HUMAN,394.0,2136.0,6S8Q,394.0,2136.0,O75643,B,5.0,1747.0,UNP,394,MDLDQGGEALAPRQVLDLEDLVFTQGSHFMANKRCQLPDGSFRRQR...,1764.0,6S8Q,-2,aromatic,polar,aromaticTOpolar,0,0,0TO0,0,6S8Q,B:489:_:TYR,B,489.0,Tyr,H,20.0,35.19,6S8Q,Tyr,489,B,0,20,0,0,0,0,0,0,0,0,17,3,0,0,0,0,0,B:489:_:TYR,20,6,0.031579,0.0039,6S8Q.pdb.edges,Tyr,489,B,6S8Q
8,1,114713909,.,G,T,rs121913254,P-loop_containing_nucleoside_triphosphate_hydr...,D,D,D,D,D,D,D,D,0.0,P,D,D,D,U,D,D,0.0,.,D,D,D,P,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Caa/Aaa,p.Gln61Lys,c.181C>A,NRAS,NM_002524.4,3,T,1,181,>,Gln,Lys,61,subst,.,1,114713909,rs121913254,0.0,1,17/20,17,1,1,1,NM_002524.4,P01111,"NRAS,HRAS1",5UHV,1.67,RASN_HUMAN,1.0,166.0,5UHV,1.0,166.0,P01111,A,1.0,166.0,UNP,1,MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...,168.0,5UHV,1,polar,positivecharge,polarTOpositivecharge,0,1,0TO1,1,5UHV,A:61:_:GLN,A,61.0,Gln,S,22.0,41.02,5UHV,Gln,61,A,20,2,20,0,0,0,0,0,0,0,1,1,0,0,0,0,0,A:61:_:GLN,22,0,0.0,0.0,5UHV.pdb.edges,Gln,61,A,5UHV
9,1,211673529,.,G,A,.,Protein_kinase_domain|Protein_kinase-like_domain,T,T,D,T,D,D,D,D,0.0,D,D,T,D,D,D,D,4e-06,.,D,T,D,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aCg/aTg,p.Thr170Met,c.509C>T,NEK2,NM_002497.3,3,A,2,509,>,Thr,Met,170,subst,.,1,211673529,rs759517583,0.0,1,13/20,13,1,1,1,NM_002497.3,P51955,"NEK2,NEK2A,NLK1",2W5A,1.55,NEK2_HUMAN,1.0,271.0,2W5A,1.0,271.0,P51955,A,1.0,271.0,UNP,-,-,0.0,2W5A,-1,polar,nonpolar,polarTOnonpolar,1,1,1TO1,0,2W5A,A:170:_:THR,A,170.0,Thr,S,9.0,69.16,2W5A,Thr,170,A,1,8,1,0,0,0,0,0,0,0,6,2,0,0,0,0,0,A:170:_:THR,9,2,0.055556,0.044739,2W5A.pdb.edges,Thr,170,A,2W5A


In [None]:
#Checking for 'missing' values
base_ACC.isna().sum()

CHROM                           0
POS                             0
ID                              0
REF                             0
ALT                             0
avsnp150                        0
Interpro_domain                 0
dbNSFP_DEOGEN2_pred             0
dbNSFP_MetaSVM_pred             0
dbNSFP_fathmmMKL_coding_pred    0
dbNSFP_PrimateAI_pred           0
dbNSFP_PROVEAN_pred             0
dbNSFP_MCAP_pred                0
dbNSFP_ClinPred_pred            0
dbNSFP_BayesDel_addAF_pred      0
dbNSFP_ExAC_AF                  0
dbNSFP_Polyphen2_HVAR_pred      0
dbNSFP_SIFT_pred                0
dbNSFP_FATHMM_pred              0
dbNSFP_SIFT4G_pred              0
dbNSFP_LRT_pred                 0
dbNSFP_fathmmXF_coding_pred     0
dbNSFP_BayesDel_noAF_pred       0
dbNSFP_gnomAD_exomes_AF         0
dbNSFP_Aloft_pred               0
dbNSFP_MutationTaster_pred      0
dbNSFP_MetaLR_pred              0
dbNSFP_LISTS2_pred              0
dbNSFP_Polyphen2_HDIV_pred      0
dbNSFP_Mutatio

In [None]:
base_ACC['Tecido'] = 'ACC'

In [None]:
base_ACC.info(max_cols=150)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98 entries, 0 to 97
Data columns (total 126 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         98 non-null     int64  
 1   POS                           98 non-null     int64  
 2   ID                            98 non-null     object 
 3   REF                           98 non-null     object 
 4   ALT                           98 non-null     object 
 5   avsnp150                      98 non-null     object 
 6   Interpro_domain               98 non-null     object 
 7   dbNSFP_DEOGEN2_pred           98 non-null     object 
 8   dbNSFP_MetaSVM_pred           98 non-null     object 
 9   dbNSFP_fathmmMKL_coding_pred  98 non-null     object 
 10  dbNSFP_PrimateAI_pred         98 non-null     object 
 11  dbNSFP_PROVEAN_pred           98 non-null     object 
 12  dbNSFP_MCAP_pred              98 non-null     object 
 13  dbNSFP

In [None]:
base_ACC.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID_COMMON,COMMON,PolyPhen2_Dam_pred,Ndamage,NdamageCalc,Deleteria,Deleteria5,Deleteria10,transcript_NCBI_id,Uniprot_id,Genes_Uniprot,PDB_id,Resolution,Swiss-Prot,db_align_beg,db_align_end,pdbx_PDB_id_code,pdbx_auth_seq_align_beg,pdbx_auth_seq_align_end,pdbx_db_accession,pdbx_strand_id,seq_align_beg,seq_align_end,db_name,pdbx_align_begin,pdbx_seq_one_letter_code,len_seq,PDB_wild_id,Blosum62,groupBefore,groupAfter,groupChange,aminBeforeEssential,aminAfterEssential,essencialChange,substitution,PDB_id_RING_x,NodeId_RING,Chain_RING,Position_RING,Residue_RING,Dssp_RING,Degree_RING,Bfactor_CA_RING,PDB_id_RING_y,Node_RING,Node_pos_RING,Node_chain_RING,Inter_Lig_tot,Inter_Res_tot,Inter_IAC_Lig_tot,Inter_VDW_Lig_tot,Inter_HBOND_Lig_tot,Inter_PIPISTACK_Lig_tot,Inter_IONIC_Lig_tot,Inter_SSBOND_Lig_tot,Inter_PICATION_Lig_tot,Inter_IAC_Res_tot,Inter_VDW_Res_tot,Inter_HBOND_Res_tot,Inter_PIPISTACK_Res_tot,Inter_IONIC_Res_tot,Inter_SSBOND_Res_tot,Inter_PICATION_Res_tot,Descartar,node_ScriptR,degree_node_ScriptR,triangles_node,clusteringCoef_node,betweennessWeighted_node,filename,node_id_ScriptR,node_pos_ScriptR,node_chain_ScriptR,PDB_id_ScriptR,Tecido
0,1,40292580,.,G,A,rs748601004,Peptidase_M48,T,T,D,T,N,T,D,T,2.5e-05,B,T,T,T,D,D,T,1.2e-05,.,D,T,D,P,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtt/Att,p.Val447Ile,c.1339G>A,ZMPSTE24,NM_005857.4,10,A,1,1339,>,Val,Ile,447,subst,.,1,40292580,rs748601004,0.0,1,7/20,7,1,1,0,NM_005857.4,O75844,"ZMPSTE24,FACE1,STE24",5SYT,2.0,FACE1_HUMAN,1.0,474.0,5SYT,1.0,474.0,O75844,A,1.0,474.0,UNP,1,MGMWASLDALWEMPAEKRIFGAVLLFSWTVYLWETFLAQRQRRIYK...,479.0,5SYT,3,nonpolar,nonpolar,nonpolarTOnonpolar,1,1,1TO1,0,5SYT,A:447:_:VAL,A,447.0,Val,,2.0,39.45,5SYT,Val,447,A,0,2,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,A:447:_:VAL,2,1,1.0,0.0,5SYT.pdb.edges,Val,447,A,5SYT,ACC
1,1,40406739,.,C,T,rs759545887,.,T,D,D,D,D,D,D,D,8e-06,D,D,T,D,D,D,D,4e-06,.,D,D,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro36Leu,c.107C>T,SMAP2,NM_022733.2,2,T,2,107,>,Pro,Leu,36,subst,.,1,40406739,rs759545887,0.0,1,17/20,17,1,1,1,NM_022733.2,Q8WU79,"SMAP2,SMAP1L",2IQJ,1.9,SMAP2_HUMAN,1.0,132.0,2IQJ,1.0,132.0,Q8WU79,A,3.0,134.0,UNP,1,-,0.0,2IQJ,-3,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,2IQJ,A:36:_:PRO,A,36.0,Pro,,21.0,23.19,2IQJ,Pro,36,A,16,5,16,0,0,0,0,0,0,0,5,0,0,0,0,0,0,A:36:_:PRO,21,4,0.019048,0.017387,2IQJ.pdb.edges,Pro,36,A,2IQJ,ACC
2,1,56949695,.,G,A,rs150146785,Membrane_attack_complex_component/perforin_(MA...,T,T,N,T,N,T,T,T,0.000264,P,D,T,T,N,N,T,0.000223,.,N,T,T,D,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg242Cys,c.724C>T,C8B,NM_000066.3,6,A,1,724,>,Arg,Cys,242,subst,.,1,56949695,rs150146785,1.0,1,2/20,2,0,0,0,NM_000066.3,P07358,C8B,3OJY,2.51,CO8B_HUMAN,55.0,591.0,3OJY,1.0,537.0,P07358,B,1.0,537.0,UNP,55,SVDVTLMPIDCELSSWSSWTTCDPCQKKRYRYAYLLQPSQFHGEPC...,543.0,3OJY,-3,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,3OJY,B:242:_:ARG,B,242.0,Arg,E,4.0,42.32,3OJY,Arg,242,B,0,4,0,0,0,0,0,0,0,0,2,2,0,0,0,0,0,B:242:_:ARG,4,0,0.0,0.001848,3OJY.pdb.edges,Arg,242,B,3OJY,ACC
3,1,155268782,.,C,T,.,.,T,T,D,T,N,T,D,T,0.0,B,T,T,T,D,D,T,0.0,.,D,T,.,B,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg138Gln,c.413G>A,CLK2,NM_001294338.1,4,T,2,413,>,Arg,Gln,138,subst,.,1,155268782,rs1477026654,0.0,0,5/20,5,0,0,0,NM_001294338.1,P49760,CLK2,6FYK,2.39,CLK2_HUMAN,136.0,496.0,6FYK,136.0,496.0,P49760,A,3.0,363.0,UNP,136,SSRRAKSVEDDAEGHLIYHVGDWLQERYEIVSTLGEGTFGRVVQCV...,365.0,6FYK,1,positivecharge,polar,positivechargeTOpolar,0,0,0TO0,0,6FYK,A:138:_:ARG,A,138.0,Arg,,4.0,40.34,6FYK,Arg,138,A,0,4,0,0,0,0,0,0,0,0,2,1,0,1,0,0,0,A:138:_:ARG,4,0,0.0,0.00124,6FYK.pdb.edges,Arg,138,A,6FYK,ACC
4,1,156528563,.,G,A,rs764657941,"RasGAP_protein,_C-terminal",T,T,D,T,D,T,D,T,8e-06,D,D,T,T,D,D,T,4e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCt/gTt,p.Ala1540Val,c.4619C>T,IQGAP3,NM_178229.4,36,A,2,4619,>,Ala,Val,1540,subst,.,1,156528563,rs764657941,0.0,1,9/20,9,1,1,0,NM_178229.4,Q86VI3,IQGAP3,3ISU,1.88,IQGA3_HUMAN,1529.0,1631.0,3ISU,1529.0,1631.0,Q86VI3,A,19.0,121.0,UNP,1529,GKKQPSLHYTAAQLLEKGVLVEIEDLPASHFRNVIFDITPGDEAGK...,104.0,3ISU,0,nonpolar,nonpolar,nonpolarTOnonpolar,0,1,0TO1,0,3ISU,A:1540:_:ALA,A,1540.0,Ala,H,6.0,20.7,3ISU,Ala,1540,A,0,6,0,0,0,0,0,0,0,0,3,3,0,0,0,0,0,A:1540:_:ALA,6,0,0.0,0.029384,3ISU.pdb.edges,Ala,1540,A,3ISU,ACC


In [None]:
#Identify duplicates records in the data
dupes=base_ACC.duplicated()
sum(dupes)

0

##47.1 Generating a file with the final *ACC* database

In [None]:
base_ACC.to_csv("drive/My Drive/ProcessaNovaBase/MontagemdeArqscomRINGeBetwennessClust/Bases15Tecidos/ACC_Final.csv",sep='\t',index=False)