This notebook performs the join of the **ACC** database (data from the somatic variant calls - MUTECT)) with the data from the COMMON database.
 Here, the **Adenoid cystic carcinoma** (**ACC**) is processed. To process the other 32 cancers, just change the input file (in the section 2.1) as the processing is the same.

#1 - Basic Settings

In [None]:
#Permission to access any file on Google Drive
from google.colab import drive
drive.mount('/content/drive')
#drive.mount("/content/drive", force_remount=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
#Increased column and row display capacity
import pandas as pd

pd.set_option('display.max_columns', 7000)
pd.set_option('display.max_rows',70000)

In [None]:
#Identifying which libraries are installed
!pip freeze

absl-py==0.10.0
alabaster==0.7.12
albumentations==0.1.12
altair==4.1.0
argon2-cffi==20.1.0
asgiref==3.3.1
astor==0.8.1
astropy==4.1
astunparse==1.6.3
async-generator==1.10
atari-py==0.2.6
atomicwrites==1.4.0
attrs==20.3.0
audioread==2.1.9
autograd==1.3
Babel==2.9.0
backcall==0.2.0
beautifulsoup4==4.6.3
bleach==3.2.1
blis==0.4.1
bokeh==2.1.1
Bottleneck==1.3.2
branca==0.4.1
bs4==0.0.1
CacheControl==0.12.6
cachetools==4.1.1
catalogue==1.0.0
certifi==2020.12.5
cffi==1.14.4
chainer==7.4.0
chardet==3.0.4
click==7.1.2
cloudpickle==1.3.0
cmake==3.12.0
cmdstanpy==0.9.5
colorlover==0.3.0
community==1.0.0b1
contextlib2==0.5.5
convertdate==2.2.0
coverage==3.7.1
coveralls==0.5
crcmod==1.7
cufflinks==0.17.3
cvxopt==1.2.5
cvxpy==1.0.31
cycler==0.10.0
cymem==2.0.5
Cython==0.29.21
daft==0.0.4
dask==2.12.0
dataclasses==0.8
datascience==0.10.6
debugpy==1.0.0
decorator==4.4.2
defusedxml==0.6.0
descartes==1.1.0
dill==0.3.3
distributed==1.25.3
Django==3.1.4
dlib==19.18.0
dm-tree==0.1.5
docopt==0.6.2
docutil

#2 - Processing the *ACC* database with the MUTECT caller

##2.1 - Reading the *ACC* cancer database


In this section, we read the TCGA database of mutations from the **Adrenocortical Carcinoma** (**ACC**) cancer submitted to the **MUTECT** caller and the ANNOVAR and SnpEFF annotators (ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt). All using the hg38 human genome as a reference.




In [None]:
#Reading ACC_muse_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt.csv
import pandas as pd

ACC_MUTECT = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/BasescomANNOVAR_SnpEFF/ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt.csv",delimiter='\t')


In [None]:
ACC_MUTECT.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6098 entries, 0 to 6097
Data columns (total 51 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   CHROM                         6098 non-null   int64 
 1   POS                           6098 non-null   int64 
 2   ID                            6098 non-null   object
 3   REF                           6098 non-null   object
 4   ALT                           6098 non-null   object
 5   QUAL                          6098 non-null   object
 6   FILTER                        6098 non-null   object
 7   FORMAT                        6098 non-null   object
 8   avsnp150                      6098 non-null   object
 9   Interpro_domain               6098 non-null   object
 10  dbNSFP_DEOGEN2_pred           6098 non-null   object
 11  dbNSFP_MetaSVM_pred           6098 non-null   object
 12  dbNSFP_fathmmMKL_coding_pred  6098 non-null   object
 13  dbNSFP_PrimateAI_p

In [None]:
ACC_MUTECT.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt
0,1,1312143,.,G,T,.,PASS,GT:AD:DP,.,Metallo-beta-lactamase\x3bMetallo-beta-lactama...,".,.,.,T,T,T,.,.,.",T,D,T,"N,N,.,.,N,N,.,N,N",D,D,T,.,".,.,.,.,B,B,.,B,B","D,D,.,.,D,D,.,D,D","T,T,.,.,T,.,.,.,T","D,D,D,D,D,D,T,D,D",D,D,T,.,".,.,.,.,.,.,.,.,.","D,D,D,D,D,D,D",T,"D,D,D,D,D,D,D,D,D",".,.,.,.,B,P,.,P,B",".,.,.,.,M,.,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Ctg/Atg,p.Leu544Met,c.1630C>A,CPSF3L,NM_001256456.1,18,T,1,1630,>,Leu,Met,544,subst,.
1,1,1398653,.,G,A,.,PASS,GT:AD:DP,.,"Cyclin,_N-terminal|Cyclin-like",".,.,.",.,D,.,".,.,.",.,.,D,.,".,.,.",".,.,.",".,.,.",".,.,.",D,D,D,.,".,.,.","A,A",.,".,.,.",".,.,.",".,.,.",STOP_GAINED,HIGH,NONSENSE,Cag/Tag,p.Gln103*,c.307C>T,CCNL2,NM_030937.4,2,A,1,307,>,Gln,*,103,translation termination,.
2,1,1916767,.,A,T,.,PASS,GT:AD:DP,.,EF-hand_domain|EF-hand_domain_pair,"T,.",T,D,T,"D,D",D,D,T,.,"D,.","D,D","T,T","D,D",.,N,T,3.981906e-06,".,.","N,N",D,"T,T","D,.","M,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aAc/aTc,p.Asn90Ile,c.269A>T,CALML6,NM_138705.2,4,T,2,269,>,Asn,Ile,90,subst,.
3,1,2027599,.,G,A,.,PASS,GT:AD:DP,.,Neurotransmitter-gated_ion-channel_ligand-bind...,"T,.,.,.,.,.","T,.","D,D","D,.","N,.,.,.,.,.","D,D","T,T","T,T",".,.","D,.,.,.,.,.","T,.,.,.,.,.","T,.,.,.,.,.","T,.,.,.,.,.","D,.","D,D","T,T","4.006442e-06,4.006442e-06",".,.,.,.,.,.","D,D","T,.","D,D,D,D,D,T","D,.,.,.,.,.","N,.,.,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gac/Aac,p.Asp165Asn,c.493G>A,GABRD,NM_000815.4,5,A,1,493,>,Asp,Asn,165,subst,.
4,1,2303896,.,C,T,.,PASS,GT:AD:DP,rs752779978,.,D,D,D,T,D,D,D,D,8.284e-06,P,D,D,T,D,D,D,8.239065e-06,.,D,D,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro423Leu,c.1268C>T,SKI,NM_003036.3,4,T,2,1268,>,Pro,Leu,423,subst,.


In [None]:
#Checking if there is any tuple ('CHROM', 'POS', 'REF', 'ALT') associated with more than one Snp_id
print (ACC_MUTECT.groupby(['CHROM', 'POS', 'REF','ALT'],
                  as_index=False)['avsnp150']
          .agg(lambda x: list(x)))

      CHROM        POS REF ALT                                       avsnp150
0         1    1312143   G   T                                            [.]
1         1    1398653   G   A                                            [.]
2         1    1916767   A   T                                            [.]
3         1    2027599   G   A                                            [.]
4         1    2303896   C   T                                  [rs752779978]
5         1    2385062   G   A                                  [rs772157368]
6         1    2591033   C   T                                  [rs199926063]
7         1    3755488   G   T                                            [.]
8         1    3816294   G   A                                            [.]
9         1    3891041   T   G                                            [.]
10        1    6144098   G   T                                            [.]
11        1    6428274   G   C                                  

In [None]:
#Identify duplicates records in the data
dupes=ACC_MUTECT.duplicated()
sum(dupes)

0

In [None]:
ACC_MUTECT.columns

Index(['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'FORMAT',
       'avsnp150', 'Interpro_domain', 'dbNSFP_DEOGEN2_pred',
       'dbNSFP_MetaSVM_pred', 'dbNSFP_fathmmMKL_coding_pred',
       'dbNSFP_PrimateAI_pred', 'dbNSFP_PROVEAN_pred', 'dbNSFP_MCAP_pred',
       'dbNSFP_ClinPred_pred', 'dbNSFP_BayesDel_addAF_pred', 'dbNSFP_ExAC_AF',
       'dbNSFP_Polyphen2_HVAR_pred', 'dbNSFP_SIFT_pred', 'dbNSFP_FATHMM_pred',
       'dbNSFP_SIFT4G_pred', 'dbNSFP_LRT_pred', 'dbNSFP_fathmmXF_coding_pred',
       'dbNSFP_BayesDel_noAF_pred', 'dbNSFP_gnomAD_exomes_AF',
       'dbNSFP_Aloft_pred', 'dbNSFP_MutationTaster_pred', 'dbNSFP_MetaLR_pred',
       'dbNSFP_LISTS2_pred', 'dbNSFP_Polyphen2_HDIV_pred',
       'dbNSFP_MutationAssessor_pred', 'VariantEffect_EFF', 'Risco_Mut_EFF',
       'Tipo_Mut_EFF', 'Point_Mutation_EFF', 'changeProt_EFF',
       'changecDNA_EFF', 'Gene_EFF', 'RefSeq_EFF', 'Exon_EFF', 'ALT_EFF',
       'Pos_Point_Mutation_EFF', 'poschangecDNA_EFF', 'typechangecDNA_EFF',
  

In [None]:
def categories_column(df):
    for col in ['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'FORMAT',
       'avsnp150', 'Interpro_domain', 'dbNSFP_DEOGEN2_pred',
       'dbNSFP_MetaSVM_pred', 'dbNSFP_fathmmMKL_coding_pred',
       'dbNSFP_PrimateAI_pred', 'dbNSFP_PROVEAN_pred', 'dbNSFP_MCAP_pred',
       'dbNSFP_ClinPred_pred', 'dbNSFP_BayesDel_addAF_pred', 'dbNSFP_ExAC_AF',
       'dbNSFP_Polyphen2_HVAR_pred', 'dbNSFP_SIFT_pred', 'dbNSFP_FATHMM_pred',
       'dbNSFP_SIFT4G_pred', 'dbNSFP_LRT_pred', 'dbNSFP_fathmmXF_coding_pred',
       'dbNSFP_BayesDel_noAF_pred', 'dbNSFP_gnomAD_exomes_AF',
       'dbNSFP_Aloft_pred', 'dbNSFP_MutationTaster_pred', 'dbNSFP_MetaLR_pred',
       'dbNSFP_LISTS2_pred', 'dbNSFP_Polyphen2_HDIV_pred',
       'dbNSFP_MutationAssessor_pred', 'VariantEffect_EFF', 'Risco_Mut_EFF',
       'Tipo_Mut_EFF', 'Point_Mutation_EFF', 'changeProt_EFF',
       'changecDNA_EFF', 'Gene_EFF', 'RefSeq_EFF', 'Exon_EFF', 'ALT_EFF',
       'Pos_Point_Mutation_EFF', 'poschangecDNA_EFF', 'typechangecDNA_EFF',
       'aminBefore', 'aminAfter', 'poschangeProt', 'typechangeProt',
       'pos_terminalchangeProt']:
        mydic= df[col].value_counts().to_dict()
        print(col, mydic)
        print('\n')

categories_column(ACC_MUTECT)

CHROM {1: 603, 2: 419, 19: 406, 5: 374, 23: 360, 12: 353, 6: 338, 3: 321, 7: 304, 10: 285, 4: 274, 17: 265, 11: 262, 9: 239, 8: 213, 16: 208, 14: 178, 20: 174, 15: 160, 22: 114, 13: 99, 18: 94, 21: 55}


POS {141009770: 15, 36169136: 6, 99435056: 2, 76256288: 2, 32027084: 2, 123256839: 2, 30335925: 2, 41224613: 2, 86542292: 2, 102715800: 2, 159819893: 2, 56823616: 2, 86505860: 2, 117820859: 2, 132673267: 2, 30329606: 2, 112389583: 2, 22329263: 2, 53765545: 2, 10388782: 2, 36174795: 2, 110137224: 2, 49515669: 2, 99425535: 2, 71812015: 2, 134371015: 2, 41845758: 2, 10379771: 2, 71800667: 2, 13389823: 2, 47999719: 2, 102716166: 2, 74463458: 2, 66843500: 2, 23308117: 2, 89522826: 2, 124314218: 2, 102716658: 2, 10726403: 1, 8846220: 1, 116001218: 1, 10525342: 1, 102050448: 1, 55760221: 1, 10712975: 1, 64280156: 1, 99834530: 1, 49164963: 1, 29117092: 1, 158636674: 1, 81384067: 1, 8952474: 1, 105888294: 1, 111746316: 1, 768658: 1, 141351566: 1, 100123277: 1, 19192486: 1, 129905589: 1, 1208246

#3 - Generating the *COMMON* field: Reading the databases with the COMMON field (hg38 version)


From the file **00-All.vcf.gz** available at the website https://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/ the following fields were extracted: **Chrom, Pos, SNP_ID, REF, ALT, COMMON** and the file **00-All.tsv** was generated. This was split into 7 files, due to its size of 20 GB (using the Linux command: $ split -b3000000000 00-All.tsv).

The COMMON field identifies whether the mutation is frequent or not in the population. Possible values:
- 0 (it is not frequent in the population)
- 1 (frequent in the population)

The database that has the COMMON field is very large, around 20GB, so it was split into multiple files

##3.1 Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table (through fields  *CHROM*, *POS*, *REF*, *ALT*) com a tabela *Common_hg38_1* (through fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_1
import pandas as pd
df_common1 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_1.csv", delimiter='\t')

In [None]:
df_common1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91461250 entries, 0 to 91461249
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 4.1+ GB


In [None]:
#Reading base_ACC
import pandas as pd
base_ACC = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/BasescomANNOVAR_SnpEFF/ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt.csv", delimiter='\t')

In [None]:
base_ACC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6098 entries, 0 to 6097
Data columns (total 51 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   CHROM                         6098 non-null   int64 
 1   POS                           6098 non-null   int64 
 2   ID                            6098 non-null   object
 3   REF                           6098 non-null   object
 4   ALT                           6098 non-null   object
 5   QUAL                          6098 non-null   object
 6   FILTER                        6098 non-null   object
 7   FORMAT                        6098 non-null   object
 8   avsnp150                      6098 non-null   object
 9   Interpro_domain               6098 non-null   object
 10  dbNSFP_DEOGEN2_pred           6098 non-null   object
 11  dbNSFP_MetaSVM_pred           6098 non-null   object
 12  dbNSFP_fathmmMKL_coding_pred  6098 non-null   object
 13  dbNSFP_PrimateAI_p

In [None]:
import pandas as pd
base_merge = pd.merge(base_ACC, df_common1, left_on=['CHROM', 'POS', 'REF', 'ALT'], right_on=['Chrom','Pos', 'REF', 'ALT'], how='inner')

In [None]:
tam_merge = 0
tam_merge = len(base_merge.index)
print(tam_merge)

168


In [None]:
base_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168 entries, 0 to 167
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         168 non-null    int64  
 1   POS                           168 non-null    int64  
 2   ID                            168 non-null    object 
 3   REF                           168 non-null    object 
 4   ALT                           168 non-null    object 
 5   QUAL                          168 non-null    object 
 6   FILTER                        168 non-null    object 
 7   FORMAT                        168 non-null    object 
 8   avsnp150                      168 non-null    object 
 9   Interpro_domain               168 non-null    object 
 10  dbNSFP_DEOGEN2_pred           168 non-null    object 
 11  dbNSFP_MetaSVM_pred           168 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  168 non-null    object 
 13  dbNSF

In [None]:
base_merge.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,1,2027599,.,G,A,.,PASS,GT:AD:DP,.,Neurotransmitter-gated_ion-channel_ligand-bind...,"T,.,.,.,.,.","T,.","D,D","D,.","N,.,.,.,.,.","D,D","T,T","T,T",".,.","D,.,.,.,.,.","T,.,.,.,.,.","T,.,.,.,.,.","T,.,.,.,.,.","D,.","D,D","T,T","4.006442e-06,4.006442e-06",".,.,.,.,.,.","D,D","T,.","D,D,D,D,D,T","D,.,.,.,.,.","N,.,.,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gac/Aac,p.Asp165Asn,c.493G>A,GABRD,NM_000815.4,5,A,1,493,>,Asp,Asn,165,subst,.,1,2027599,rs1477740666,0.0
1,1,2303896,.,C,T,.,PASS,GT:AD:DP,rs752779978,.,D,D,D,T,D,D,D,D,8.284e-06,P,D,D,T,D,D,D,8.239065e-06,.,D,D,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro423Leu,c.1268C>T,SKI,NM_003036.3,4,T,2,1268,>,Pro,Leu,423,subst,.,1,2303896,rs752779978,0.0
2,1,2385062,.,G,A,.,PASS,GT:AD:DP,rs772157368,.,.,T,N,T,D,.,T,T,2.502e-05,B,.,.,.,.,N,T,.,.,"D,D,D",T,.,P,.,SYNONYMOUS_CODING,LOW,SILENT,aaC/aaT,p.Asn151Asn,c.453C>T,MORN1,NM_024848.2,6,A,3,453,>,Asn,Asn,151,subst,.,1,2385062,rs772157368,0.0
3,1,2591033,.,C,T,.,PASS,GT:AD:DP,rs199926063,"Metallopeptidase,_catalytic_domain|Peptidase_M...",".,T,.",T,N,T,".,N,N",D,T,T,8.264e-05,".,B,.",".,T,T","D,D,D","T,T,T",N,N,T,1.351819e-04,".,.,.","N,N,N,N,N,N,N",T,"T,T,T",".,B,.",".,N,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg766Gln,c.2297G>A,MMEL1,NM_033467.3.2,24,T,2,2297,>,Arg,Gln,766,subst,.,1,2591033,rs199926063,0.0
4,1,3816294,.,G,A,.,PASS,GT:AD:DP,.,.,T,T,N,T,N,T,T,T,.,B,T,T,T,N,N,T,.,.,N,T,T,B,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro883Leu,c.2648C>T,CEP104,NM_014704.3,21,A,2,2648,>,Pro,Leu,883,subst,.,1,3816294,rs1197412379,0.0


##3.2 Joining the ***ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt*** table(through the fields *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_mult_1* table (through the fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_mult_1
import pandas as pd
df_common_mult_1 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_mult_1.csv", delimiter='\t')

In [None]:
df_common_mult_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6502181 entries, 0 to 6502180
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 297.6+ MB


In [None]:
import pandas as pd
base_merge_mult = pd.merge(base_ACC, df_common_mult_1, left_on=['CHROM', 'POS', 'REF'], right_on=['Chrom','Pos', 'REF'], how='inner')

In [None]:
tam_merge_mult = 0
tam_merge_mult = len(base_merge_mult.index)
print(tam_merge_mult)

69


In [None]:
base_merge_mult.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 69 entries, 0 to 68
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         69 non-null     int64  
 1   POS                           69 non-null     int64  
 2   ID                            69 non-null     object 
 3   REF                           69 non-null     object 
 4   ALT_x                         69 non-null     object 
 5   QUAL                          69 non-null     object 
 6   FILTER                        69 non-null     object 
 7   FORMAT                        69 non-null     object 
 8   avsnp150                      69 non-null     object 
 9   Interpro_domain               69 non-null     object 
 10  dbNSFP_DEOGEN2_pred           69 non-null     object 
 11  dbNSFP_MetaSVM_pred           69 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  69 non-null     object 
 13  dbNSFP_

In [None]:
base_merge_mult.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT_x,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,ALT_y,COMMON
0,1,1916767,.,A,T,.,PASS,GT:AD:DP,.,EF-hand_domain|EF-hand_domain_pair,"T,.",T,D,T,"D,D",D,D,T,.,"D,.","D,D","T,T","D,D",.,N,T,3.981906e-06,".,.","N,N",D,"T,T","D,.","M,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aAc/aTc,p.Asn90Ile,c.269A>T,CALML6,NM_138705.2,4,T,2,269,>,Asn,Ile,90,subst,.,1,1916767,rs200415259,"C,T",0.0
1,1,6428274,.,G,C,.,PASS,GT:AD:DP,rs765867875,Ankyrin_repeat-containing_domain,"T,T,T",D,D,T,".,.,D",D,D,T,8.238e-06,"D,.,D",".,.,D",".,.,T",".,.,D",D,D,D,4.004164e-06,".,.,.",D,D,".,D,D","D,.,D","M,.,M",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Ggc/Cgc,p.Gly115Arg,c.343G>C,ESPN,NM_031475.2,2,C,1,343,>,Gly,Arg,115,subst,.,1,6428274,rs765867875,"A,C",0.0
2,1,19219308,.,G,A,.,PASS,GT:AD:DP,.,"ER_membrane_protein_complex_subunit_1,_C-terminal",".,.,.",.,D,.,".,.,.",.,D,D,.,".,.,.",".,.,.",".,.,.",".,.,.",D,N,D,7.957412e-06,".,.,.","D,D,D",.,".,.,.",".,.,.",".,.,.",STOP_GAINED,HIGH,NONSENSE,Cga/Tga,p.Arg993*,c.2977C>T,EMC1,NM_015047.2,23,A,1,2977,>,Arg,*,993,translation termination,.,1,19219308,rs758888994,"A,C",0.0
3,1,21855807,.,G,A,.,PASS,GT:AD:DP,rs151035968,Immunoglobulin_I-set|Immunoglobulin_subtype|Im...,D,T,N,T,D,T,D,T,1.647e-05,B,T,T,T,N,N,T,1.603952e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aCg/aTg,p.Thr1895Met,c.5684C>T,HSPG2,NM_001291860.1,45,A,2,5684,>,Thr,Met,1895,subst,.,1,21855807,rs151035968,"A,T",0.0
4,1,21865019,.,C,A,.,PASS,GT:AD:DP,.,Laminin_IV,T,T,D,T,N,T,D,T,.,P,D,T,D,N,D,T,5.3786e-06,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gca/Tca,p.Ala1485Ser,c.4453G>T,HSPG2,NM_001291860.1,36,A,1,4453,>,Ala,Ser,1485,subst,.,1,21865019,rs1220618419,"A,T",0.0


In [None]:
#Converting the value of the ALTy field into a list
base_merge_mult["ALT_y"] = base_merge_mult["ALT_y"].apply(lambda x: x.split(","))

In [None]:
print(base_merge_mult[['ALT_x','ALT_y']])

   ALT_x      ALT_y
0      T     [C, T]
1      C     [A, C]
2      A     [A, C]
3      A     [A, T]
4      A     [A, T]
5      A     [G, T]
6      G     [C, G]
7      A     [A, T]
8      G     [A, T]
9      A     [A, T]
10     A     [A, T]
11     A     [A, T]
12     A     [A, T]
13     T     [A, T]
14     A  [A, C, T]
15     A     [A, T]
16     T  [A, C, T]
17     A  [A, C, T]
18     T     [A, C]
19     A  [A, C, T]
20     A     [A, T]
21     T     [C, T]
22     G     [A, T]
23     A     [A, C]
24     T     [A, T]
25     T     [A, T]
26     T     [A, C]
27     A     [G, T]
28     A     [A, C]
29     A     [A, T]
30     C  [A, C, T]
31     T     [A, T]
32     A     [A, C]
33     A     [A, T]
34     T     [A, C]
35     T     [A, T]
36     A     [A, C]
37     A     [A, T]
38     A  [A, C, T]
39     G     [G, T]
40     A  [A, G, T]
41     A     [C, T]
42     T     [A, T]
43     T     [G, T]
44     T     [G, T]
45     C     [A, C]
46     T     [G, T]
47     A     [A, T]
48     T  [A, G, T]


In [None]:
#Generating a new dataframe that only contains the rows where the value of ALT_x (ACC database) is contained in ALT_y (COMMON database)
def find_value_column(row):
            return row.ALT_x in row.ALT_y

base_merge_mult_ok = base_merge_mult[base_merge_mult.apply(find_value_column, axis=1)]

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60 entries, 0 to 68
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         60 non-null     int64  
 1   POS                           60 non-null     int64  
 2   ID                            60 non-null     object 
 3   REF                           60 non-null     object 
 4   ALT_x                         60 non-null     object 
 5   QUAL                          60 non-null     object 
 6   FILTER                        60 non-null     object 
 7   FORMAT                        60 non-null     object 
 8   avsnp150                      60 non-null     object 
 9   Interpro_domain               60 non-null     object 
 10  dbNSFP_DEOGEN2_pred           60 non-null     object 
 11  dbNSFP_MetaSVM_pred           60 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  60 non-null     object 
 13  dbNSFP_

In [None]:
base_merge_mult_ok[['ALT_x', 'ALT_y']]

Unnamed: 0,ALT_x,ALT_y
0,T,"[C, T]"
1,C,"[A, C]"
2,A,"[A, C]"
3,A,"[A, T]"
4,A,"[A, T]"
6,G,"[C, G]"
7,A,"[A, T]"
9,A,"[A, T]"
10,A,"[A, T]"
11,A,"[A, T]"


In [None]:
#Rename the ALT_x column to ALT
base_merge_mult_ok.rename(columns={'ALT_x': 'ALT'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
#Let's remove redundant fields
base_merge_mult_ok = base_merge_mult_ok.drop('ALT_y', 1)

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60 entries, 0 to 68
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         60 non-null     int64  
 1   POS                           60 non-null     int64  
 2   ID                            60 non-null     object 
 3   REF                           60 non-null     object 
 4   ALT                           60 non-null     object 
 5   QUAL                          60 non-null     object 
 6   FILTER                        60 non-null     object 
 7   FORMAT                        60 non-null     object 
 8   avsnp150                      60 non-null     object 
 9   Interpro_domain               60 non-null     object 
 10  dbNSFP_DEOGEN2_pred           60 non-null     object 
 11  dbNSFP_MetaSVM_pred           60 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  60 non-null     object 
 13  dbNSFP_

In [None]:
base_ACC_COMMON_MUTECT = base_merge.append([base_merge_mult_ok], ignore_index=True)

In [None]:
base_ACC_COMMON_MUTECT.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228 entries, 0 to 227
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         228 non-null    int64  
 1   POS                           228 non-null    int64  
 2   ID                            228 non-null    object 
 3   REF                           228 non-null    object 
 4   ALT                           228 non-null    object 
 5   QUAL                          228 non-null    object 
 6   FILTER                        228 non-null    object 
 7   FORMAT                        228 non-null    object 
 8   avsnp150                      228 non-null    object 
 9   Interpro_domain               228 non-null    object 
 10  dbNSFP_DEOGEN2_pred           228 non-null    object 
 11  dbNSFP_MetaSVM_pred           228 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  228 non-null    object 
 13  dbNSF

In [None]:
base_ACC_COMMON_MUTECT.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,1,2027599,.,G,A,.,PASS,GT:AD:DP,.,Neurotransmitter-gated_ion-channel_ligand-bind...,"T,.,.,.,.,.","T,.","D,D","D,.","N,.,.,.,.,.","D,D","T,T","T,T",".,.","D,.,.,.,.,.","T,.,.,.,.,.","T,.,.,.,.,.","T,.,.,.,.,.","D,.","D,D","T,T","4.006442e-06,4.006442e-06",".,.,.,.,.,.","D,D","T,.","D,D,D,D,D,T","D,.,.,.,.,.","N,.,.,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gac/Aac,p.Asp165Asn,c.493G>A,GABRD,NM_000815.4,5,A,1,493,>,Asp,Asn,165,subst,.,1,2027599,rs1477740666,0.0
1,1,2303896,.,C,T,.,PASS,GT:AD:DP,rs752779978,.,D,D,D,T,D,D,D,D,8.284e-06,P,D,D,T,D,D,D,8.239065e-06,.,D,D,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro423Leu,c.1268C>T,SKI,NM_003036.3,4,T,2,1268,>,Pro,Leu,423,subst,.,1,2303896,rs752779978,0.0
2,1,2385062,.,G,A,.,PASS,GT:AD:DP,rs772157368,.,.,T,N,T,D,.,T,T,2.502e-05,B,.,.,.,.,N,T,.,.,"D,D,D",T,.,P,.,SYNONYMOUS_CODING,LOW,SILENT,aaC/aaT,p.Asn151Asn,c.453C>T,MORN1,NM_024848.2,6,A,3,453,>,Asn,Asn,151,subst,.,1,2385062,rs772157368,0.0
3,1,2591033,.,C,T,.,PASS,GT:AD:DP,rs199926063,"Metallopeptidase,_catalytic_domain|Peptidase_M...",".,T,.",T,N,T,".,N,N",D,T,T,8.264e-05,".,B,.",".,T,T","D,D,D","T,T,T",N,N,T,1.351819e-04,".,.,.","N,N,N,N,N,N,N",T,"T,T,T",".,B,.",".,N,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg766Gln,c.2297G>A,MMEL1,NM_033467.3.2,24,T,2,2297,>,Arg,Gln,766,subst,.,1,2591033,rs199926063,0.0
4,1,3816294,.,G,A,.,PASS,GT:AD:DP,.,.,T,T,N,T,N,T,T,T,.,B,T,T,T,N,N,T,.,.,N,T,T,B,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro883Leu,c.2648C>T,CEP104,NM_014704.3,21,A,2,2648,>,Pro,Leu,883,subst,.,1,3816294,rs1197412379,0.0


###3.2.1 Generating a file with the ACC Mutect and COMMON_01 database

In [None]:
base_ACC_COMMON_MUTECT.to_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_01.csv",sep='\t',index=False)

##3.3 Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table (through fields *CHROM*, *POS*, *REF*, *ALT*) com a tabela *Common_hg38_2* (through fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_2
import pandas as pd
df_common2 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_2.csv", delimiter='\t')

In [None]:
df_common2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91754536 entries, 0 to 91754535
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 4.1+ GB


In [None]:
#Reading base_ACC
import pandas as pd
base_ACC = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/BasescomANNOVAR_SnpEFF/ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt.csv", delimiter='\t')

In [None]:
base_ACC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6098 entries, 0 to 6097
Data columns (total 51 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   CHROM                         6098 non-null   int64 
 1   POS                           6098 non-null   int64 
 2   ID                            6098 non-null   object
 3   REF                           6098 non-null   object
 4   ALT                           6098 non-null   object
 5   QUAL                          6098 non-null   object
 6   FILTER                        6098 non-null   object
 7   FORMAT                        6098 non-null   object
 8   avsnp150                      6098 non-null   object
 9   Interpro_domain               6098 non-null   object
 10  dbNSFP_DEOGEN2_pred           6098 non-null   object
 11  dbNSFP_MetaSVM_pred           6098 non-null   object
 12  dbNSFP_fathmmMKL_coding_pred  6098 non-null   object
 13  dbNSFP_PrimateAI_p

In [None]:
import pandas as pd
base_merge = pd.merge(base_ACC, df_common2, left_on=['CHROM', 'POS', 'REF', 'ALT'], right_on=['Chrom','Pos', 'REF', 'ALT'], how='inner')

In [None]:
tam_merge = 0
tam_merge = len(base_merge.index)
print(tam_merge)

127


In [None]:
base_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 127 entries, 0 to 126
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         127 non-null    int64  
 1   POS                           127 non-null    int64  
 2   ID                            127 non-null    object 
 3   REF                           127 non-null    object 
 4   ALT                           127 non-null    object 
 5   QUAL                          127 non-null    object 
 6   FILTER                        127 non-null    object 
 7   FORMAT                        127 non-null    object 
 8   avsnp150                      127 non-null    object 
 9   Interpro_domain               127 non-null    object 
 10  dbNSFP_DEOGEN2_pred           127 non-null    object 
 11  dbNSFP_MetaSVM_pred           127 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  127 non-null    object 
 13  dbNSF

In [None]:
base_merge.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,2,206305544,.,G,A,.,PASS,GT:AD:DP,rs773248063,.,"T,T,.,.,.,.,T",T,N,T,"N,.,.,.,.,.,.",T,T,T,8.280e-06,"B,B,.,.,.,.,.","T,.,.,.,.,.,.","T,.,.,.,.,.,.","T,.,.,.,.,.,T",N,N,T,4.025279e-06,".,.,.,.,.,.,.",N,T,".,.,.,.,.,.,.","B,B,.,.,.,.,.","N,N,.,.,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,tGt/tAt,p.Cys339Tyr,c.1016G>A,ZDBF2,NM_020923.2,5,A,2,1016,>,Cys,Tyr,339,subst,.,2,206305544,rs773248063,0.0
1,2,206663135,.,G,T,.,PASS,GT:AD:DP,rs776107210,.,T,T,N,T,N,T,D,T,8.264e-06,P,D,T,D,.,N,T,4.014516e-06,.,N,T,T,P,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser467Arg,c.1401C>A,DYTN,NM_001093730.1,11,T,3,1401,>,Ser,Arg,467,subst,.,2,206663135,rs776107210,0.0
2,2,206750962,.,G,A,.,PASS,GT:AD:DP,rs780661298,"Lactate_dehydrogenase/glycoside_hydrolase,_fam...","T,.,.",T,D,T,"D,D,D",D,T,T,3.295e-05,"B,.,B","T,T,T","T,T,T","T,D,T",N,D,T,3.207441e-05,".,.,.","D,D,D,D",T,"T,T,T","P,.,P","M,.,M",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg342Cys,c.1024C>T,MDH1B,NM_001039845.2,6,A,1,1024,>,Arg,Cys,342,subst,.,2,206750962,rs780661298,0.0
3,2,207861250,.,T,C,.,PASS,GT:AD:DP,.,Putative_zinc-RING_and/or_ribbon,T,T,D,T,N,T,D,D,.,D,T,D,D,D,D,D,.,.,"D,D",T,.,D,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Aag/Gag,p.Lys655Glu,c.1963A>G,PLEKHM3,NM_001080475.2,7,C,1,1963,>,Lys,Glu,655,subst,.,2,207861250,rs1291823997,0.0
4,2,211725146,.,G,A,.,PASS,GT:AD:DP,.,Furin-like_cysteine-rich_domain|Growth_factor_...,"T,D,.",T,D,D,".,D,D",D,D,T,.,".,P,B",".,D,D",".,T,T",".,T,T",D,D,T,3.977218e-06,".,.,.","D,D,D",T,"D,D,D",".,P,P",".,M,M",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCt/cTt,p.Pro224Leu,c.671C>T,ERBB4,NM_005235.2,6,A,2,671,>,Pro,Leu,224,subst,.,2,211725146,rs1451769238,0.0


##3.4 Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table (through fields *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_mult_2* table (through fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_mult_2
import pandas as pd
df_common_mult_2 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_mult_2.csv", delimiter='\t')

In [None]:
df_common_mult_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6296498 entries, 0 to 6296497
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 288.2+ MB


In [None]:
import pandas as pd
base_merge_mult = pd.merge(base_ACC, df_common_mult_2, left_on=['CHROM', 'POS', 'REF'], right_on=['Chrom','Pos', 'REF'], how='inner')

In [None]:
tam_merge_mult = 0
tam_merge_mult = len(base_merge_mult.index)
print(tam_merge_mult)

56


In [None]:
base_merge_mult.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56 entries, 0 to 55
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         56 non-null     int64  
 1   POS                           56 non-null     int64  
 2   ID                            56 non-null     object 
 3   REF                           56 non-null     object 
 4   ALT_x                         56 non-null     object 
 5   QUAL                          56 non-null     object 
 6   FILTER                        56 non-null     object 
 7   FORMAT                        56 non-null     object 
 8   avsnp150                      56 non-null     object 
 9   Interpro_domain               56 non-null     object 
 10  dbNSFP_DEOGEN2_pred           56 non-null     object 
 11  dbNSFP_MetaSVM_pred           56 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  56 non-null     object 
 13  dbNSFP_

In [None]:
base_merge_mult.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT_x,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,ALT_y,COMMON
0,2,206771265,.,C,T,.,PASS,GT:AD:DP,rs748507111,.,"T,T,T",T,D,T,"D,D,D",D,D,D,.,"D,D,D","D,D,D","T,T,T","D,D,D",D,D,D,1.194334e-05,".,.,.","D,D,D",T,".,.,T","D,D,D","M,M,M",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro322Leu,c.965C>T,FASTKD2,NM_001136193.1,4,T,2,965,>,Pro,Leu,322,subst,.,2,206771265,rs748507111,"G,T",0.0
1,2,228017880,.,C,A,.,PASS,GT:AD:DP,.,.,"T,.",T,N,T,"N,N",T,T,T,.,"B,B","T,T","T,T","T,T",N,N,T,.,".,.","N,N",T,"T,T","B,B","L,L",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtg/Ttg,p.Val992Leu,c.2974G>T,SPHKAP,NM_001142644.1,7,A,1,2974,>,Val,Leu,992,subst,.,2,228017880,rs779563361,"A,T",0.0
2,2,228019250,.,G,T,.,PASS,GT:AD:DP,.,.,"T,.",T,N,T,"N,N",T,T,T,.,"B,B","T,T","T,T","T,T",N,N,T,.,".,.","N,N",T,"T,T","B,B","N,N",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCa/gAa,p.Ala535Glu,c.1604C>A,SPHKAP,NM_001142644.1,7,T,2,1604,>,Ala,Glu,535,subst,.,2,228019250,rs755247779,"A,T",0.0
3,2,229025969,.,C,T,.,PASS,GT:AD:DP,rs769912855,PH_domain-like|PTB/PI_domain,".,.,.,T",T,D,T,"N,N,N,N",T,D,T,4.118e-05,"D,P,P,D","T,T,T,T",".,.,.,.","T,T,T,D",D,D,D,3.190174e-05,".,.,.,.","D,D,D,D",T,"D,D,D,D","D,D,D,D",".,.,.,M",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg137Gln,c.410G>A,PID1,NM_017933.4,4,T,2,410,>,Arg,Gln,137,subst,.,2,229025969,rs769912855,"A,G,T",0.0
4,2,237372241,.,A,T,.,panel_of_normals,GT:AD:DP,.,"von_Willebrand_factor,_type_A",".,D,.,D,.,.,.",D,D,T,"D,D,D,.,D,D,D",D,D,D,.,"D,D,.,.,D,.,.","D,D,D,.,D,D,D","D,D,D,.,D,D,D","D,D,D,D,D,D,D",N,D,D,.,".,.,.,.,.,.,.","D,D,D,D,D,D,D,D",D,"D,D,D,D,.,D,D","D,D,.,.,D,.,.",".,M,.,.,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gTt/gAt,p.Val1259Asp,c.3776T>A,COL6A3,NM_004369.3,9,T,2,3776,>,Val,Asp,1259,subst,.,2,237372241,rs747174329,"C,G",0.0


In [None]:
#Converting the value of the ALTy field into a list
base_merge_mult["ALT_y"] = base_merge_mult["ALT_y"].apply(lambda x: x.split(","))

In [None]:
print(base_merge_mult[['ALT_x','ALT_y']])

   ALT_x      ALT_y
0      T     [G, T]
1      A     [A, T]
2      T     [A, T]
3      T  [A, G, T]
4      T     [C, G]
5      T     [G, T]
6      T     [A, T]
7      A     [A, T]
8      C     [C, T]
9      T     [A, T]
10     T  [A, G, T]
11     C     [A, C]
12     A  [A, C, T]
13     T  [A, C, T]
14     G  [C, G, T]
15     C     [C, G]
16     T  [A, G, T]
17     T     [G, T]
18     A     [A, T]
19     A     [G, T]
20     T     [A, T]
21     A     [A, C]
22     T     [A, T]
23     T     [A, T]
24     A     [A, T]
25     T  [A, G, T]
26     T  [A, G, T]
27     A  [A, C, T]
28     A     [A, C]
29     A  [A, G, T]
30     T     [A, T]
31     G     [G, T]
32     T     [A, T]
33     A     [A, C]
34     T     [A, C]
35     C     [A, T]
36     T     [G, T]
37     T     [A, T]
38     T     [A, T]
39     T  [A, G, T]
40     T     [G, T]
41     A     [A, C]
42     A     [G, T]
43     T     [A, C]
44     A  [A, G, T]
45     C     [A, C]
46     A     [A, C]
47     T     [C, T]
48     A     [A, T]


In [None]:
#Generating a new dataframe that only contains the rows where the value of ALT_x (ACC database) is contained in ALT_y (COMMON database)
def find_value_column(row):
            return row.ALT_x in row.ALT_y

base_merge_mult_ok = base_merge_mult[base_merge_mult.apply(find_value_column, axis=1)]

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49 entries, 0 to 54
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         49 non-null     int64  
 1   POS                           49 non-null     int64  
 2   ID                            49 non-null     object 
 3   REF                           49 non-null     object 
 4   ALT_x                         49 non-null     object 
 5   QUAL                          49 non-null     object 
 6   FILTER                        49 non-null     object 
 7   FORMAT                        49 non-null     object 
 8   avsnp150                      49 non-null     object 
 9   Interpro_domain               49 non-null     object 
 10  dbNSFP_DEOGEN2_pred           49 non-null     object 
 11  dbNSFP_MetaSVM_pred           49 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  49 non-null     object 
 13  dbNSFP_

In [None]:
#Rename the ALT_x column to ALT
base_merge_mult_ok.rename(columns={'ALT_x': 'ALT'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [None]:
#Let's remove redundant fields
base_merge_mult_ok = base_merge_mult_ok.drop('ALT_y', 1)

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49 entries, 0 to 54
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         49 non-null     int64  
 1   POS                           49 non-null     int64  
 2   ID                            49 non-null     object 
 3   REF                           49 non-null     object 
 4   ALT                           49 non-null     object 
 5   QUAL                          49 non-null     object 
 6   FILTER                        49 non-null     object 
 7   FORMAT                        49 non-null     object 
 8   avsnp150                      49 non-null     object 
 9   Interpro_domain               49 non-null     object 
 10  dbNSFP_DEOGEN2_pred           49 non-null     object 
 11  dbNSFP_MetaSVM_pred           49 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  49 non-null     object 
 13  dbNSFP_

In [None]:
base_ACC_COMMON2_MUTECT = base_merge.append([base_merge_mult_ok], ignore_index=True)

In [None]:
base_ACC_COMMON2_MUTECT.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 176 entries, 0 to 175
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         176 non-null    int64  
 1   POS                           176 non-null    int64  
 2   ID                            176 non-null    object 
 3   REF                           176 non-null    object 
 4   ALT                           176 non-null    object 
 5   QUAL                          176 non-null    object 
 6   FILTER                        176 non-null    object 
 7   FORMAT                        176 non-null    object 
 8   avsnp150                      176 non-null    object 
 9   Interpro_domain               176 non-null    object 
 10  dbNSFP_DEOGEN2_pred           176 non-null    object 
 11  dbNSFP_MetaSVM_pred           176 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  176 non-null    object 
 13  dbNSF

In [None]:
base_ACC_COMMON2_MUTECT.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,2,206305544,.,G,A,.,PASS,GT:AD:DP,rs773248063,.,"T,T,.,.,.,.,T",T,N,T,"N,.,.,.,.,.,.",T,T,T,8.280e-06,"B,B,.,.,.,.,.","T,.,.,.,.,.,.","T,.,.,.,.,.,.","T,.,.,.,.,.,T",N,N,T,4.025279e-06,".,.,.,.,.,.,.",N,T,".,.,.,.,.,.,.","B,B,.,.,.,.,.","N,N,.,.,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,tGt/tAt,p.Cys339Tyr,c.1016G>A,ZDBF2,NM_020923.2,5,A,2,1016,>,Cys,Tyr,339,subst,.,2,206305544,rs773248063,0.0
1,2,206663135,.,G,T,.,PASS,GT:AD:DP,rs776107210,.,T,T,N,T,N,T,D,T,8.264e-06,P,D,T,D,.,N,T,4.014516e-06,.,N,T,T,P,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,agC/agA,p.Ser467Arg,c.1401C>A,DYTN,NM_001093730.1,11,T,3,1401,>,Ser,Arg,467,subst,.,2,206663135,rs776107210,0.0
2,2,206750962,.,G,A,.,PASS,GT:AD:DP,rs780661298,"Lactate_dehydrogenase/glycoside_hydrolase,_fam...","T,.,.",T,D,T,"D,D,D",D,T,T,3.295e-05,"B,.,B","T,T,T","T,T,T","T,D,T",N,D,T,3.207441e-05,".,.,.","D,D,D,D",T,"T,T,T","P,.,P","M,.,M",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgc/Tgc,p.Arg342Cys,c.1024C>T,MDH1B,NM_001039845.2,6,A,1,1024,>,Arg,Cys,342,subst,.,2,206750962,rs780661298,0.0
3,2,207861250,.,T,C,.,PASS,GT:AD:DP,.,Putative_zinc-RING_and/or_ribbon,T,T,D,T,N,T,D,D,.,D,T,D,D,D,D,D,.,.,"D,D",T,.,D,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Aag/Gag,p.Lys655Glu,c.1963A>G,PLEKHM3,NM_001080475.2,7,C,1,1963,>,Lys,Glu,655,subst,.,2,207861250,rs1291823997,0.0
4,2,211725146,.,G,A,.,PASS,GT:AD:DP,.,Furin-like_cysteine-rich_domain|Growth_factor_...,"T,D,.",T,D,D,".,D,D",D,D,T,.,".,P,B",".,D,D",".,T,T",".,T,T",D,D,T,3.977218e-06,".,.,.","D,D,D",T,"D,D,D",".,P,P",".,M,M",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCt/cTt,p.Pro224Leu,c.671C>T,ERBB4,NM_005235.2,6,A,2,671,>,Pro,Leu,224,subst,.,2,211725146,rs1451769238,0.0


###3.4.1 Generating a file with the ACC CLINVAR and COMMON_02 database

In [None]:
base_ACC_COMMON2_MUTECT.to_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_02.csv",sep='\t',index=False)

##3.5 Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table (through fields *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_3* table (through fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_3
import pandas as pd
df_common3 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_3.csv", delimiter='\t')

In [None]:
df_common3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92264444 entries, 0 to 92264443
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 4.1+ GB


In [None]:
#Reading base_ACC
import pandas as pd
base_ACC = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/BasescomANNOVAR_SnpEFF/ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt.csv", delimiter='\t')

In [None]:
base_ACC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6098 entries, 0 to 6097
Data columns (total 51 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   CHROM                         6098 non-null   int64 
 1   POS                           6098 non-null   int64 
 2   ID                            6098 non-null   object
 3   REF                           6098 non-null   object
 4   ALT                           6098 non-null   object
 5   QUAL                          6098 non-null   object
 6   FILTER                        6098 non-null   object
 7   FORMAT                        6098 non-null   object
 8   avsnp150                      6098 non-null   object
 9   Interpro_domain               6098 non-null   object
 10  dbNSFP_DEOGEN2_pred           6098 non-null   object
 11  dbNSFP_MetaSVM_pred           6098 non-null   object
 12  dbNSFP_fathmmMKL_coding_pred  6098 non-null   object
 13  dbNSFP_PrimateAI_p

In [None]:
import pandas as pd
base_merge = pd.merge(base_ACC, df_common3, left_on=['CHROM', 'POS', 'REF', 'ALT'], right_on=['Chrom','Pos', 'REF', 'ALT'], how='inner')

In [None]:
tam_merge = 0
tam_merge = len(base_merge.index)
print(tam_merge)

151


In [None]:
base_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 151 entries, 0 to 150
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         151 non-null    int64  
 1   POS                           151 non-null    int64  
 2   ID                            151 non-null    object 
 3   REF                           151 non-null    object 
 4   ALT                           151 non-null    object 
 5   QUAL                          151 non-null    object 
 6   FILTER                        151 non-null    object 
 7   FORMAT                        151 non-null    object 
 8   avsnp150                      151 non-null    object 
 9   Interpro_domain               151 non-null    object 
 10  dbNSFP_DEOGEN2_pred           151 non-null    object 
 11  dbNSFP_MetaSVM_pred           151 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  151 non-null    object 
 13  dbNSF

In [None]:
base_merge.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,5,5146177,.,G,A,.,PASS,GT:AD:DP,.,"Peptidase_M12B,_propeptide",".,T",T,N,T,"N,N",T,T,T,.,"B,B","T,T","T,T","T,T",N,N,T,4.01062e-06,".,.","D,D",T,"T,T","P,P",".,L",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtg/Atg,p.Val75Met,c.223G>A,ADAMTS16,NM_139056.2,3,A,1,223,>,Val,Met,75,subst,.,5,5146177,rs1158314132,0.0
1,5,5460853,.,G,T,.,PASS,GT:AD:DP,.,.,T,T,N,T,N,T,D,T,.,B,D,T,T,N,N,T,4.016484e-06,.,N,T,T,P,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Ggt/Tgt,p.Gly507Cys,c.1519G>T,ICE1,NM_015325.2,13,T,1,1519,>,Gly,Cys,507,subst,.,5,5460853,rs1322544389,0.0
2,5,7802264,.,C,T,.,PASS,GT:AD:DP,.,Adenylyl_cyclase_class-3/4/guanylyl_cyclase|Nu...,D,T,D,D,D,D,D,D,.,D,D,T,D,D,D,D,3.978706e-06,.,"D,D",T,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,tCc/tTc,p.Ser892Phe,c.2675C>T,ADCY2,NM_020546.2,21,T,2,2675,>,Ser,Phe,892,subst,.,5,7802264,rs1358146800,0.0
3,5,10992622,.,C,T,.,PASS,GT:AD:DP,rs367931998,.,"T,T,T,T",D,D,T,"N,N,.,N",D,D,D,.,"P,.,.,D","T,T,.,T","T,T,.,T","T,D,D,T",D,D,D,3.978168e-06,".,.,.,.","D,D,D,D,D",D,"D,D,D,D","D,.,.,D","M,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg1047Gln,c.3140G>A,CTNND2,NM_001332.3,19,T,2,3140,>,Arg,Gln,1047,subst,.,5,10992622,rs367931998,0.0
4,5,13753235,.,G,A,.,PASS,GT:AD:DP,.,"Dynein_heavy_chain,_ATP-binding_dynein_motor_r...",.,.,D,.,.,.,.,D,.,.,.,.,.,D,D,D,4.003331e-06,.,A,.,.,.,.,STOP_GAINED+SPLICE_SITE_REGION,HIGH,NONSENSE,Cag/Tag,p.Gln3624*,c.10870C>T,DNAH5,NM_001369.2,63,A,1,10870,>,Gln,*,3624,translation termination,.,5,13753235,rs1295167678,0.0


##3.6 Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table (through fields  *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_mult_3* table (through fields  *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_mult_3
import pandas as pd
df_common_mult_3 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_mult_3.csv", delimiter='\t')

In [None]:
df_common_mult_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6387736 entries, 0 to 6387735
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 292.4+ MB


In [None]:
import pandas as pd
base_merge_mult = pd.merge(base_ACC, df_common_mult_3, left_on=['CHROM', 'POS', 'REF'], right_on=['Chrom','Pos', 'REF'], how='inner')

In [None]:
tam_merge_mult = 0
tam_merge_mult = len(base_merge_mult.index)
print(tam_merge_mult)

58


In [None]:
base_merge_mult.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 58 entries, 0 to 57
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         58 non-null     int64  
 1   POS                           58 non-null     int64  
 2   ID                            58 non-null     object 
 3   REF                           58 non-null     object 
 4   ALT_x                         58 non-null     object 
 5   QUAL                          58 non-null     object 
 6   FILTER                        58 non-null     object 
 7   FORMAT                        58 non-null     object 
 8   avsnp150                      58 non-null     object 
 9   Interpro_domain               58 non-null     object 
 10  dbNSFP_DEOGEN2_pred           58 non-null     object 
 11  dbNSFP_MetaSVM_pred           58 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  58 non-null     object 
 13  dbNSFP_

In [None]:
base_merge_mult.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT_x,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,ALT_y,COMMON
0,5,1089068,.,C,T,.,PASS,GT:AD:DP,rs372645005,Amino_acid_permease/_SLC12A_domain,D,D,D,T,N,D,D,T,1.648e-05,D,D,D,T,D,D,D,7.985817e-06,.,D,D,D,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtc/Atc,p.Val135Ile,c.403G>A,SLC12A7,NM_006598.2.2,4,T,1,403,>,Val,Ile,135,subst,.,5,1089068,rs372645005,"A,T",0.0
1,5,7831939,.,G,A,.,PASS,GT:AD:DP,rs202097565,.,"T,.",T,N,.,"D,D",T,T,T,1.075e-04,"P,.","D,D","T,T","D,D",.,N,T,1.402502e-04,".,.","N,N",T,"T,T","D,.","M,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgt/Tgt,p.Arg119Cys,c.355C>T,C5orf49,NM_001089584.2,3,A,1,355,>,Arg,Cys,119,subst,.,5,7831939,rs202097565,"A,T",0.0
2,5,9154671,.,C,A,.,panel_of_normals,GT:AD:DP,.,Sema_domain|WD40/YVTN_repeat-like-containing_d...,T,T,D,T,N,T,D,T,.,B,T,T,T,D,N,T,.,.,D,T,D,B,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cTg,p.Arg433Leu,c.1298G>T,SEMA5A,NM_003966.2,12,A,2,1298,>,Arg,Leu,433,subst,.,5,9154671,rs138343991,"A,T",0.0
3,5,10681140,.,C,T,.,PASS,GT:AD:DP,rs370396587,.,.,T,N,.,N,T,T,T,8.277e-05,B,D,.,D,N,N,T,6.647008e-05,.,"D,N",T,T,B,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGc/cAc,p.Arg61His,c.182G>A,DAP,NM_001291963.1,3,T,2,182,>,Arg,His,61,subst,.,5,10681140,rs370396587,"A,T",0.0
4,5,31526869,.,G,A,.,PASS,GT:AD:DP,.,.,".,.,.,.,.",.,D,.,".,.,.,.,.",.,.,D,.,".,.,.,.,.",".,.,.,.,.",".,.,.,.,.",".,.,.,.,.",N,N,D,.,".,.,.,.,.","A,A,A,A",.,".,.,.,.,.",".,.,.,.,.",".,.,.,.,.",STOP_GAINED,HIGH,NONSENSE,Cga/Tga,p.Arg22*,c.64C>T,DROSHA,NM_013235.4,4,A,1,64,>,Arg,*,22,translation termination,.,5,31526869,rs1485879154,"A,C",0.0


In [None]:
#Converting the value of the ALTy field into a list
base_merge_mult["ALT_y"] = base_merge_mult["ALT_y"].apply(lambda x: x.split(","))

In [None]:
print(base_merge_mult[['ALT_x','ALT_y']])

   ALT_x      ALT_y
0      T     [A, T]
1      A     [A, T]
2      A     [A, T]
3      T     [A, T]
4      A     [A, C]
5      T     [A, T]
6      C     [A, C]
7      T     [G, T]
8      A     [A, T]
9      T     [C, T]
10     T     [A, C]
11     A     [C, T]
12     G     [G, T]
13     T     [C, T]
14     A     [A, T]
15     G     [A, G]
16     A     [A, T]
17     T  [A, G, T]
18     T     [C, G]
19     T     [A, T]
20     A     [A, C]
21     A     [A, C]
22     A     [A, T]
23     T     [G, T]
24     T     [G, T]
25     A     [A, C]
26     T     [A, T]
27     T     [A, C]
28     C     [G, T]
29     G     [C, G]
30     T     [A, T]
31     A     [A, C]
32     A     [A, T]
33     A  [A, G, T]
34     G  [A, G, T]
35     G     [A, T]
36     T     [A, C]
37     G     [G, T]
38     T     [G, T]
39     A     [A, T]
40     A     [A, C]
41     A     [A, C]
42     C     [C, G]
43     A     [A, T]
44     G     [A, T]
45     T     [A, T]
46     A     [A, T]
47     C     [A, C]
48     T     [A, T]


In [None]:
#Generating a new dataframe that only contains the rows where the value of ALT_x (ACC database) is contained in ALT_y (COMMON database)
def find_value_column(row):
            return row.ALT_x in row.ALT_y

base_merge_mult_ok = base_merge_mult[base_merge_mult.apply(find_value_column, axis=1)]

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48 entries, 0 to 57
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         48 non-null     int64  
 1   POS                           48 non-null     int64  
 2   ID                            48 non-null     object 
 3   REF                           48 non-null     object 
 4   ALT_x                         48 non-null     object 
 5   QUAL                          48 non-null     object 
 6   FILTER                        48 non-null     object 
 7   FORMAT                        48 non-null     object 
 8   avsnp150                      48 non-null     object 
 9   Interpro_domain               48 non-null     object 
 10  dbNSFP_DEOGEN2_pred           48 non-null     object 
 11  dbNSFP_MetaSVM_pred           48 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  48 non-null     object 
 13  dbNSFP_

In [None]:
#Rename the ALT_x column to ALT
base_merge_mult_ok.rename(columns={'ALT_x': 'ALT'}, inplace=True)

In [None]:
#Let's remove redundant fields
base_merge_mult_ok = base_merge_mult_ok.drop('ALT_y', 1)

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48 entries, 0 to 57
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         48 non-null     int64  
 1   POS                           48 non-null     int64  
 2   ID                            48 non-null     object 
 3   REF                           48 non-null     object 
 4   ALT                           48 non-null     object 
 5   QUAL                          48 non-null     object 
 6   FILTER                        48 non-null     object 
 7   FORMAT                        48 non-null     object 
 8   avsnp150                      48 non-null     object 
 9   Interpro_domain               48 non-null     object 
 10  dbNSFP_DEOGEN2_pred           48 non-null     object 
 11  dbNSFP_MetaSVM_pred           48 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  48 non-null     object 
 13  dbNSFP_

In [None]:
base_ACC_COMMON3_MUTECT = base_merge.append([base_merge_mult_ok], ignore_index=True)

In [None]:
base_ACC_COMMON3_MUTECT.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199 entries, 0 to 198
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         199 non-null    int64  
 1   POS                           199 non-null    int64  
 2   ID                            199 non-null    object 
 3   REF                           199 non-null    object 
 4   ALT                           199 non-null    object 
 5   QUAL                          199 non-null    object 
 6   FILTER                        199 non-null    object 
 7   FORMAT                        199 non-null    object 
 8   avsnp150                      199 non-null    object 
 9   Interpro_domain               199 non-null    object 
 10  dbNSFP_DEOGEN2_pred           199 non-null    object 
 11  dbNSFP_MetaSVM_pred           199 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  199 non-null    object 
 13  dbNSF

In [None]:
base_ACC_COMMON3_MUTECT.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,5,5146177,.,G,A,.,PASS,GT:AD:DP,.,"Peptidase_M12B,_propeptide",".,T",T,N,T,"N,N",T,T,T,.,"B,B","T,T","T,T","T,T",N,N,T,4.01062e-06,".,.","D,D",T,"T,T","P,P",".,L",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtg/Atg,p.Val75Met,c.223G>A,ADAMTS16,NM_139056.2,3,A,1,223,>,Val,Met,75,subst,.,5,5146177,rs1158314132,0.0
1,5,5460853,.,G,T,.,PASS,GT:AD:DP,.,.,T,T,N,T,N,T,D,T,.,B,D,T,T,N,N,T,4.016484e-06,.,N,T,T,P,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Ggt/Tgt,p.Gly507Cys,c.1519G>T,ICE1,NM_015325.2,13,T,1,1519,>,Gly,Cys,507,subst,.,5,5460853,rs1322544389,0.0
2,5,7802264,.,C,T,.,PASS,GT:AD:DP,.,Adenylyl_cyclase_class-3/4/guanylyl_cyclase|Nu...,D,T,D,D,D,D,D,D,.,D,D,T,D,D,D,D,3.978706e-06,.,"D,D",T,D,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,tCc/tTc,p.Ser892Phe,c.2675C>T,ADCY2,NM_020546.2,21,T,2,2675,>,Ser,Phe,892,subst,.,5,7802264,rs1358146800,0.0
3,5,10992622,.,C,T,.,PASS,GT:AD:DP,rs367931998,.,"T,T,T,T",D,D,T,"N,N,.,N",D,D,D,.,"P,.,.,D","T,T,.,T","T,T,.,T","T,D,D,T",D,D,D,3.978168e-06,".,.,.,.","D,D,D,D,D",D,"D,D,D,D","D,.,.,D","M,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg1047Gln,c.3140G>A,CTNND2,NM_001332.3,19,T,2,3140,>,Arg,Gln,1047,subst,.,5,10992622,rs367931998,0.0
4,5,13753235,.,G,A,.,PASS,GT:AD:DP,.,"Dynein_heavy_chain,_ATP-binding_dynein_motor_r...",.,.,D,.,.,.,.,D,.,.,.,.,.,D,D,D,4.003331e-06,.,A,.,.,.,.,STOP_GAINED+SPLICE_SITE_REGION,HIGH,NONSENSE,Cag/Tag,p.Gln3624*,c.10870C>T,DNAH5,NM_001369.2,63,A,1,10870,>,Gln,*,3624,translation termination,.,5,13753235,rs1295167678,0.0


###3.6.1 Generating a file with the ACC Mutect and COMMON_03 database

In [None]:
base_ACC_COMMON3_MUTECT.to_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_03.csv",sep='\t',index=False)

##3.7 Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table (through fields *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_4* table (through fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_4
import pandas as pd
df_common4 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_4.csv", delimiter='\t')

In [None]:
df_common4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 91331295 entries, 0 to 91331294
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 4.1+ GB


In [None]:
#Reading base_ACC
import pandas as pd
base_ACC = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/BasescomANNOVAR_SnpEFF/ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt.csv", delimiter='\t')

In [None]:
base_ACC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6098 entries, 0 to 6097
Data columns (total 51 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   CHROM                         6098 non-null   int64 
 1   POS                           6098 non-null   int64 
 2   ID                            6098 non-null   object
 3   REF                           6098 non-null   object
 4   ALT                           6098 non-null   object
 5   QUAL                          6098 non-null   object
 6   FILTER                        6098 non-null   object
 7   FORMAT                        6098 non-null   object
 8   avsnp150                      6098 non-null   object
 9   Interpro_domain               6098 non-null   object
 10  dbNSFP_DEOGEN2_pred           6098 non-null   object
 11  dbNSFP_MetaSVM_pred           6098 non-null   object
 12  dbNSFP_fathmmMKL_coding_pred  6098 non-null   object
 13  dbNSFP_PrimateAI_p

In [None]:
import pandas as pd
base_merge = pd.merge(base_ACC, df_common4, left_on=['CHROM', 'POS', 'REF', 'ALT'], right_on=['Chrom','Pos', 'REF', 'ALT'], how='inner')

In [None]:
tam_merge = 0
tam_merge = len(base_merge.index)
print(tam_merge)

159


In [None]:
base_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159 entries, 0 to 158
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         159 non-null    int64  
 1   POS                           159 non-null    int64  
 2   ID                            159 non-null    object 
 3   REF                           159 non-null    object 
 4   ALT                           159 non-null    object 
 5   QUAL                          159 non-null    object 
 6   FILTER                        159 non-null    object 
 7   FORMAT                        159 non-null    object 
 8   avsnp150                      159 non-null    object 
 9   Interpro_domain               159 non-null    object 
 10  dbNSFP_DEOGEN2_pred           159 non-null    object 
 11  dbNSFP_MetaSVM_pred           159 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  159 non-null    object 
 13  dbNSF

In [None]:
base_merge.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,10,387755,.,C,T,.,PASS,GT:AD:DP,rs935304052,AMP-dependent_synthetase/ligase,".,T,T",T,D,D,".,D,.",D,D,D,.,".,P,.",".,D,.",".,T,.","D,D,D",D,D,T,.,".,.,.","D,D",T,"D,D,D",".,P,.",".,L,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gGc/gAc,p.Gly551Asp,c.1652G>A,DIP2C,NM_014974.2,14,T,2,1652,>,Gly,Asp,551,subst,.,10,387755,rs935304052,0.0
1,10,5097467,rs117377088,C,T,.,panel_of_normals,GT:AD:DP,rs782363753,NADP-dependent_oxidoreductase_domain,".,.,.,.",.,N,.,".,.,.,.",.,.,D,1.071e-04,".,.,.,.",".,.,.,.",".,.,.,.",".,.,.,.",N,N,D,1.153907e-04,".,.,.,.","A,D",.,".,.,.,.",".,.,.,.",".,.,.,.",STOP_GAINED,HIGH,NONSENSE,Cga/Tga,p.Arg96*,c.286C>T,AKR1C3,NM_001253908.1,3,T,1,286,>,Arg,*,96,translation termination,.,10,5097467,rs782363753,0.0
2,10,5456150,.,C,T,.,PASS,GT:AD:DP,rs139037982,Dbl_homology_(DH)_domain\x3bPH_domain-like|Ple...,"D,.",T,D,T,"D,D",D,D,D,8.236e-06,"D,.","D,D","T,T","D,D",D,D,T,3.977029e-06,".,.","D,D,D",T,"D,D","D,.","M,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgg/Tgg,p.Arg421Trp,c.1261C>T,NET1,NM_001047160.2,11,T,1,1261,>,Arg,Trp,421,subst,.,10,5456150,rs139037982,0.0
3,10,5906382,.,A,G,.,PASS,GT:AD:DP,rs762146468,F-box_domain,"T,T,.",T,N,T,"N,N,N",T,T,T,8.236e-06,".,B,B","D,T,T",".,.,.","T,.,T",N,N,T,3.979973e-06,".,.,.","N,N",T,"T,T,T",".,B,B",".,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gAc/gGc,p.Asp219Gly,c.656A>G,FBXO18,NM_032807.4,4,G,2,656,>,Asp,Gly,219,subst,.,10,5906382,rs762146468,0.0
4,10,7727031,.,C,T,.,PASS,GT:AD:DP,rs545967088,"von_Willebrand_factor,_type_A",".,.",.,D,.,".,.",.,.,D,1.647e-05,".,.",".,.",".,.",".,.",D,D,D,1.591115e-05,".,.","A,A",.,".,.",".,.",".,.",STOP_GAINED,HIGH,NONSENSE,Cga/Tga,p.Arg356*,c.1066C>T,ITIH2,NM_002216.2,10,T,1,1066,>,Arg,*,356,translation termination,.,10,7727031,rs545967088,0.0


##3.8 Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table (through fields *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_mult_4* table(through fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_mult_4
import pandas as pd
df_common_mult_4 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_mult_4.csv", delimiter='\t')

In [None]:
df_common_mult_4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6837596 entries, 0 to 6837595
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 313.0+ MB


In [None]:
import pandas as pd
base_merge_mult = pd.merge(base_ACC, df_common_mult_4, left_on=['CHROM', 'POS', 'REF'], right_on=['Chrom','Pos', 'REF'], how='inner')

In [None]:
tam_merge_mult = 0
tam_merge_mult = len(base_merge_mult.index)
print(tam_merge_mult)

60


In [None]:
base_merge_mult.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 60 entries, 0 to 59
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         60 non-null     int64  
 1   POS                           60 non-null     int64  
 2   ID                            60 non-null     object 
 3   REF                           60 non-null     object 
 4   ALT_x                         60 non-null     object 
 5   QUAL                          60 non-null     object 
 6   FILTER                        60 non-null     object 
 7   FORMAT                        60 non-null     object 
 8   avsnp150                      60 non-null     object 
 9   Interpro_domain               60 non-null     object 
 10  dbNSFP_DEOGEN2_pred           60 non-null     object 
 11  dbNSFP_MetaSVM_pred           60 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  60 non-null     object 
 13  dbNSFP_

In [None]:
base_merge_mult.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT_x,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,ALT_y,COMMON
0,10,24543576,.,G,A,.,PASS,GT:AD:DP,rs372909784,.,"T,.",T,D,T,"N,N",T,D,T,3.295e-05,"B,D","D,D","T,T","D,D",N,D,T,1.989005e-05,".,.","D,D,D,D,D,D,D,D",T,"D,D","D,D","M,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtg/Atg,p.Val1436Met,c.4306G>A,KIAA1217,NM_019590.4,19,A,1,4306,>,Val,Met,1436,subst,.,10,24543576,rs372909784,"A,T",0.0
1,10,47999719,.,C,A,.,PASS,GT:AD:DP,.,.,T,T,N,T,.,T,T,T,.,.,.,.,D,N,N,T,.,.,N,T,.,.,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGc/cTc,p.Arg16Leu,c.47G>T,FAM25C,NM_001137548.2.2,1,A,2,47,>,Arg,Leu,16,subst,.,10,47999719,rs566948491,"A,T",0.0
2,10,47999719,.,C,A,.,PASS,GT:AD:DP,.,.,T,T,N,T,.,T,T,T,.,.,.,.,D,N,N,T,.,.,N,T,.,.,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGc/cTc,p.Arg16Leu,c.47G>T,FAM25G,NM_001137549.1.2,1,A,2,47,>,Arg,Leu,16,subst,.,10,47999719,rs566948491,"A,T",0.0
3,10,49515669,.,C,A,.,PASS,GT:AD:DP,rs757845558,PiggyBac_transposable_element-derived_protein,".,.,.",T,N,.,"D,D,D",T,D,T,1.647e-05,".,.,.","D,D,D","T,T,T","D,D,D",.,N,T,1.991128e-05,".,.,.","D,D,D",T,".,T,T",".,.,.",".,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,tgG/tgT,p.Trp482Cys,c.1446G>T,PGBD3,NM_170753.3,2,A,3,1446,>,Trp,Cys,482,subst,.,10,49515669,rs757845558,"A,T",0.0
4,10,49515669,.,C,A,.,PASS,GT:AD:DP,rs757845558,PiggyBac_transposable_element-derived_protein,".,.,.",T,N,.,"D,D,D",T,D,T,1.647e-05,".,.,.","D,D,D","T,T,T","D,D,D",.,N,T,1.991128e-05,".,.,.","D,D,D",T,".,T,T",".,.,.",".,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,tgG/tgT,p.Trp950Cys,c.2850G>T,ERCC6-PGBD3,NM_001277058.1,6,A,3,2850,>,Trp,Cys,950,subst,.,10,49515669,rs757845558,"A,T",0.0


In [None]:
#Converting the value of the ALTy field into a list
base_merge_mult["ALT_y"] = base_merge_mult["ALT_y"].apply(lambda x: x.split(","))

In [None]:
print(base_merge_mult[['ALT_x','ALT_y']])

   ALT_x      ALT_y
0      A     [A, T]
1      A     [A, T]
2      A     [A, T]
3      A     [A, T]
4      A     [A, T]
5      T  [A, G, T]
6      A     [A, C]
7      A     [A, T]
8      T     [G, T]
9      T     [A, T]
10     C     [C, G]
11     A     [G, T]
12     A     [A, C]
13     A     [A, T]
14     G     [G, T]
15     C  [A, C, G]
16     A     [A, C]
17     A     [A, T]
18     A  [A, C, T]
19     A     [G, T]
20     T     [A, T]
21     T     [A, T]
22     T     [A, T]
23     G     [A, T]
24     T     [A, C]
25     A     [A, T]
26     A     [A, T]
27     A     [A, C]
28     A     [A, T]
29     T     [G, T]
30     A     [A, C]
31     T     [A, T]
32     T     [A, T]
33     T     [A, T]
34     G     [A, T]
35     T     [G, T]
36     T     [A, T]
37     T     [A, T]
38     T     [G, T]
39     A     [A, T]
40     A     [C, G]
41     C     [A, T]
42     A     [A, T]
43     T     [A, T]
44     A     [A, T]
45     T     [A, T]
46     A     [A, T]
47     T     [A, T]
48     A     [A, T]


In [None]:
#Generating a new dataframe that only contains the rows where the value of ALT_x (ACC database) is contained in ALT_y (COMMON database)
def find_value_column(row):
            return row.ALT_x in row.ALT_y

base_merge_mult_ok = base_merge_mult[base_merge_mult.apply(find_value_column, axis=1)]

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 59
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         51 non-null     int64  
 1   POS                           51 non-null     int64  
 2   ID                            51 non-null     object 
 3   REF                           51 non-null     object 
 4   ALT_x                         51 non-null     object 
 5   QUAL                          51 non-null     object 
 6   FILTER                        51 non-null     object 
 7   FORMAT                        51 non-null     object 
 8   avsnp150                      51 non-null     object 
 9   Interpro_domain               51 non-null     object 
 10  dbNSFP_DEOGEN2_pred           51 non-null     object 
 11  dbNSFP_MetaSVM_pred           51 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  51 non-null     object 
 13  dbNSFP_

In [None]:
#Rename the ALT_x column to ALT
base_merge_mult_ok.rename(columns={'ALT_x': 'ALT'}, inplace=True)

In [None]:
#Let's remove redundant fields
base_merge_mult_ok = base_merge_mult_ok.drop('ALT_y', 1)

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 0 to 59
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         51 non-null     int64  
 1   POS                           51 non-null     int64  
 2   ID                            51 non-null     object 
 3   REF                           51 non-null     object 
 4   ALT                           51 non-null     object 
 5   QUAL                          51 non-null     object 
 6   FILTER                        51 non-null     object 
 7   FORMAT                        51 non-null     object 
 8   avsnp150                      51 non-null     object 
 9   Interpro_domain               51 non-null     object 
 10  dbNSFP_DEOGEN2_pred           51 non-null     object 
 11  dbNSFP_MetaSVM_pred           51 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  51 non-null     object 
 13  dbNSFP_

In [None]:
base_ACC_COMMON4_MUTECT = base_merge.append([base_merge_mult_ok], ignore_index=True)

In [None]:
base_ACC_COMMON4_MUTECT.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 210 entries, 0 to 209
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         210 non-null    int64  
 1   POS                           210 non-null    int64  
 2   ID                            210 non-null    object 
 3   REF                           210 non-null    object 
 4   ALT                           210 non-null    object 
 5   QUAL                          210 non-null    object 
 6   FILTER                        210 non-null    object 
 7   FORMAT                        210 non-null    object 
 8   avsnp150                      210 non-null    object 
 9   Interpro_domain               210 non-null    object 
 10  dbNSFP_DEOGEN2_pred           210 non-null    object 
 11  dbNSFP_MetaSVM_pred           210 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  210 non-null    object 
 13  dbNSF

In [None]:
base_ACC_COMMON4_MUTECT.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,10,387755,.,C,T,.,PASS,GT:AD:DP,rs935304052,AMP-dependent_synthetase/ligase,".,T,T",T,D,D,".,D,.",D,D,D,.,".,P,.",".,D,.",".,T,.","D,D,D",D,D,T,.,".,.,.","D,D",T,"D,D,D",".,P,.",".,L,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gGc/gAc,p.Gly551Asp,c.1652G>A,DIP2C,NM_014974.2,14,T,2,1652,>,Gly,Asp,551,subst,.,10,387755,rs935304052,0.0
1,10,5097467,rs117377088,C,T,.,panel_of_normals,GT:AD:DP,rs782363753,NADP-dependent_oxidoreductase_domain,".,.,.,.",.,N,.,".,.,.,.",.,.,D,1.071e-04,".,.,.,.",".,.,.,.",".,.,.,.",".,.,.,.",N,N,D,1.153907e-04,".,.,.,.","A,D",.,".,.,.,.",".,.,.,.",".,.,.,.",STOP_GAINED,HIGH,NONSENSE,Cga/Tga,p.Arg96*,c.286C>T,AKR1C3,NM_001253908.1,3,T,1,286,>,Arg,*,96,translation termination,.,10,5097467,rs782363753,0.0
2,10,5456150,.,C,T,.,PASS,GT:AD:DP,rs139037982,Dbl_homology_(DH)_domain\x3bPH_domain-like|Ple...,"D,.",T,D,T,"D,D",D,D,D,8.236e-06,"D,.","D,D","T,T","D,D",D,D,T,3.977029e-06,".,.","D,D,D",T,"D,D","D,.","M,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgg/Tgg,p.Arg421Trp,c.1261C>T,NET1,NM_001047160.2,11,T,1,1261,>,Arg,Trp,421,subst,.,10,5456150,rs139037982,0.0
3,10,5906382,.,A,G,.,PASS,GT:AD:DP,rs762146468,F-box_domain,"T,T,.",T,N,T,"N,N,N",T,T,T,8.236e-06,".,B,B","D,T,T",".,.,.","T,.,T",N,N,T,3.979973e-06,".,.,.","N,N",T,"T,T,T",".,B,B",".,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gAc/gGc,p.Asp219Gly,c.656A>G,FBXO18,NM_032807.4,4,G,2,656,>,Asp,Gly,219,subst,.,10,5906382,rs762146468,0.0
4,10,7727031,.,C,T,.,PASS,GT:AD:DP,rs545967088,"von_Willebrand_factor,_type_A",".,.",.,D,.,".,.",.,.,D,1.647e-05,".,.",".,.",".,.",".,.",D,D,D,1.591115e-05,".,.","A,A",.,".,.",".,.",".,.",STOP_GAINED,HIGH,NONSENSE,Cga/Tga,p.Arg356*,c.1066C>T,ITIH2,NM_002216.2,10,T,1,1066,>,Arg,*,356,translation termination,.,10,7727031,rs545967088,0.0


###3.8.1 Generating a file with the ACC Mutect and COMMON_04 database

In [None]:
base_ACC_COMMON4_MUTECT.to_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_04.csv",sep='\t',index=False)

##3.9 Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table (through fields *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_5* table (through fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_5
import pandas as pd
df_common5 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_5.csv", delimiter='\t')

In [None]:
df_common5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89378611 entries, 0 to 89378610
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 4.0+ GB


In [None]:
#Reading base_ACC
import pandas as pd
base_ACC = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/BasescomANNOVAR_SnpEFF/ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt.csv", delimiter='\t')

In [None]:
base_ACC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6098 entries, 0 to 6097
Data columns (total 51 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   CHROM                         6098 non-null   int64 
 1   POS                           6098 non-null   int64 
 2   ID                            6098 non-null   object
 3   REF                           6098 non-null   object
 4   ALT                           6098 non-null   object
 5   QUAL                          6098 non-null   object
 6   FILTER                        6098 non-null   object
 7   FORMAT                        6098 non-null   object
 8   avsnp150                      6098 non-null   object
 9   Interpro_domain               6098 non-null   object
 10  dbNSFP_DEOGEN2_pred           6098 non-null   object
 11  dbNSFP_MetaSVM_pred           6098 non-null   object
 12  dbNSFP_fathmmMKL_coding_pred  6098 non-null   object
 13  dbNSFP_PrimateAI_p

In [None]:
import pandas as pd
base_merge = pd.merge(base_ACC, df_common5, left_on=['CHROM', 'POS', 'REF', 'ALT'], right_on=['Chrom','Pos', 'REF', 'ALT'], how='inner')

In [None]:
tam_merge = 0
tam_merge = len(base_merge.index)
print(tam_merge)

166


In [None]:
base_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 166 entries, 0 to 165
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         166 non-null    int64  
 1   POS                           166 non-null    int64  
 2   ID                            166 non-null    object 
 3   REF                           166 non-null    object 
 4   ALT                           166 non-null    object 
 5   QUAL                          166 non-null    object 
 6   FILTER                        166 non-null    object 
 7   FORMAT                        166 non-null    object 
 8   avsnp150                      166 non-null    object 
 9   Interpro_domain               166 non-null    object 
 10  dbNSFP_DEOGEN2_pred           166 non-null    object 
 11  dbNSFP_MetaSVM_pred           166 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  166 non-null    object 
 13  dbNSF

In [None]:
base_merge.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,10,75401186,.,C,A,.,PASS,GT:AD:DP,rs781432010,.,T,T,D,D,D,D,D,T,8.237e-06,P,T,T,D,D,D,T,4.052554e-06,.,"D,D",T,D,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aaG/aaT,p.Lys78Asn,c.234G>T,ZNF503,NM_032772.5,1,A,3,234,>,Lys,Asn,78,subst,.,10,75401186,rs781432010,0.0
1,10,92479291,.,T,C,.,PASS,GT:AD:DP,.,"Metalloenzyme,_LuxS/M16_peptidase-like\x3bMeta...",".,T",T,D,T,"N,N",T,D,T,.,".,B","T,T","T,T","T,T",D,N,T,.,".,.","D,D",T,"D,D",".,B",".,N",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Atc/Gtc,p.Ile624Val,c.1870A>G,IDE,NM_004969.3,15,C,1,1870,>,Ile,Val,624,subst,.,10,92479291,rs1277293098,0.0
2,10,93061307,.,C,T,.,PASS,GT:AD:DP,rs761068228,.,T,T,N,T,N,T,T,T,1.670e-05,B,T,T,T,N,N,T,2.393146e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCg/gTg,p.Ala15Val,c.44C>T,CYP26C1,NM_183374.2,1,T,2,44,>,Ala,Val,15,subst,.,10,93061307,rs761068228,0.0
3,10,93343905,.,C,T,.,PASS,GT:AD:DP,rs201449564,C2_domain,".,T",T,D,T,"N,N",D,T,T,3.641e-04,"B,B","T,T","D,D","T,T",N,N,T,3.366410e-04,".,.","D,D,D,D",T,"D,D","B,B",".,N",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg1426Gln,c.4277G>A,MYOF,NM_013451.3,38,T,2,4277,>,Arg,Gln,1426,subst,.,10,93343905,rs201449564,0.0
4,10,97369508,.,C,T,.,PASS,GT:AD:DP,rs370226569,Armadillo-like_helical|Armadillo-type_fold,"T,.,T,.",D,D,T,".,N,N,N",D,D,T,1.649e-05,"D,D,D,.",".,D,D,D",".,T,T,T","D,D,D,D",D,N,D,2.566779e-05,".,.,.,.","D,D,D,D",T,".,D,D,D","D,D,D,.","M,.,M,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtg/Atg,p.Val958Met,c.2872G>A,RRP12,NM_015179.3,25,T,1,2872,>,Val,Met,958,subst,.,10,97369508,rs370226569,0.0


##3.10  Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table (through fields *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_mult_5* table (through fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_mult_5
import pandas as pd
df_common_mult_5 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_mult_5.csv", delimiter='\t')

In [None]:
df_common_mult_5.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6299706 entries, 0 to 6299705
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 288.4+ MB


In [None]:
import pandas as pd
base_merge_mult = pd.merge(base_ACC, df_common_mult_5, left_on=['CHROM', 'POS', 'REF'], right_on=['Chrom','Pos', 'REF'], how='inner')

In [None]:
tam_merge_mult = 0
tam_merge_mult = len(base_merge_mult.index)
print(tam_merge_mult)

69


In [None]:
base_merge_mult.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 69 entries, 0 to 68
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         69 non-null     int64  
 1   POS                           69 non-null     int64  
 2   ID                            69 non-null     object 
 3   REF                           69 non-null     object 
 4   ALT_x                         69 non-null     object 
 5   QUAL                          69 non-null     object 
 6   FILTER                        69 non-null     object 
 7   FORMAT                        69 non-null     object 
 8   avsnp150                      69 non-null     object 
 9   Interpro_domain               69 non-null     object 
 10  dbNSFP_DEOGEN2_pred           69 non-null     object 
 11  dbNSFP_MetaSVM_pred           69 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  69 non-null     object 
 13  dbNSFP_

In [None]:
base_merge_mult.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT_x,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,ALT_y,COMMON
0,10,77843561,.,C,T,.,PASS,GT:AD:DP,rs772926425,.,T,T,D,T,D,T,D,T,1.647e-05,D,D,T,D,N,D,T,1.591128e-05,.,"D,D",T,.,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGc/cAc,p.Arg337His,c.1010G>A,DLG5,NM_004747.3,6,T,2,1010,>,Arg,His,337,subst,.,10,77843561,rs772926425,"A,T",0.0
1,10,79512752,.,G,C,.,PASS,GT:AD:DP,.,Ribosomal_protein_L2_domain_2|Translation_prot...,D,T,D,D,D,T,D,T,.,P,D,T,D,.,N,T,.,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Ggc/Cgc,p.Gly35Arg,c.103G>C,EIF5AL1,NM_001099692.1,1,C,1,103,>,Gly,Arg,35,subst,.,10,79512752,rs980328910,"A,T",0.0
2,10,97616295,.,C,T,.,PASS,GT:AD:DP,rs150577682,.,T,T,D,T,N,D,T,T,4.942e-05,D,T,T,T,D,D,T,3.234780e-05,.,"D,D,D,D,D",T,D,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gcc/Acc,p.Ala137Thr,c.409G>A,MORN4,NM_001098831.1,5,T,1,409,>,Ala,Thr,137,subst,.,10,97616295,rs150577682,"A,T",1.0
3,10,102108173,.,C,T,.,panel_of_normals,GT:AD:DP,rs141161148,.,".,T",T,D,T,"N,N",T,T,T,2.883e-04,".,B","T,T","T,T","T,T",D,N,T,4.493685e-04,".,.","D,D",T,"D,D",".,B",".,N",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gcc/Acc,p.Ala386Thr,c.1156G>A,LDB1,NM_001113407.1,11,T,1,1156,>,Ala,Thr,386,subst,.,10,102108173,rs141161148,"A,T",0.0
4,10,103455687,.,G,A,.,PASS,GT:AD:DP,rs375270464,.,T,T,D,T,D,D,D,D,8.239e-06,D,D,T,D,D,D,D,.,.,D,T,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cgg/Tgg,p.Arg206Trp,c.616C>T,CALHM1,NM_001001412.3,2,A,1,616,>,Arg,Trp,206,subst,.,10,103455687,rs375270464,"A,T",0.0


In [None]:
#Converting the value of the ALTy field into a list
base_merge_mult["ALT_y"] = base_merge_mult["ALT_y"].apply(lambda x: x.split(","))

In [None]:
print(base_merge_mult[['ALT_x','ALT_y']])

   ALT_x      ALT_y
0      T     [A, T]
1      C     [A, T]
2      T     [A, T]
3      T     [A, T]
4      A     [A, T]
5      A     [A, C]
6      A     [A, C]
7      G  [A, G, T]
8      T     [A, G]
9      A     [G, T]
10     T     [G, T]
11     G     [G, T]
12     T     [A, T]
13     A     [A, T]
14     T     [A, T]
15     T     [A, T]
16     T     [A, T]
17     G     [A, T]
18     G     [G, T]
19     A  [A, C, T]
20     T     [A, C]
21     T     [A, T]
22     T     [A, C]
23     T     [G, T]
24     T     [A, T]
25     A     [A, T]
26     A     [A, C]
27     A     [A, T]
28     T     [A, T]
29     A     [A, C]
30     T     [A, T]
31     T     [G, T]
32     T     [G, T]
33     T  [A, G, T]
34     A     [G, T]
35     G  [A, G, T]
36     C  [A, C, G]
37     A     [A, T]
38     T     [A, T]
39     T     [G, T]
40     T  [A, G, T]
41     A     [G, T]
42     A     [A, C]
43     A     [A, T]
44     T     [C, T]
45     G  [A, G, T]
46     T     [G, T]
47     T     [A, T]
48     T     [A, T]


In [None]:
#Generating a new dataframe that only contains the rows where the value of ALT_x (ACC database) is contained in ALT_y (COMMON database)
def find_value_column(row):
            return row.ALT_x in row.ALT_y

base_merge_mult_ok = base_merge_mult[base_merge_mult.apply(find_value_column, axis=1)]

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56 entries, 0 to 68
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         56 non-null     int64  
 1   POS                           56 non-null     int64  
 2   ID                            56 non-null     object 
 3   REF                           56 non-null     object 
 4   ALT_x                         56 non-null     object 
 5   QUAL                          56 non-null     object 
 6   FILTER                        56 non-null     object 
 7   FORMAT                        56 non-null     object 
 8   avsnp150                      56 non-null     object 
 9   Interpro_domain               56 non-null     object 
 10  dbNSFP_DEOGEN2_pred           56 non-null     object 
 11  dbNSFP_MetaSVM_pred           56 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  56 non-null     object 
 13  dbNSFP_

In [None]:
#Rename the ALT_x column to ALT
base_merge_mult_ok.rename(columns={'ALT_x': 'ALT'}, inplace=True)

In [None]:
#Let's remove redundant fields
base_merge_mult_ok = base_merge_mult_ok.drop('ALT_y', 1)

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 56 entries, 0 to 68
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         56 non-null     int64  
 1   POS                           56 non-null     int64  
 2   ID                            56 non-null     object 
 3   REF                           56 non-null     object 
 4   ALT                           56 non-null     object 
 5   QUAL                          56 non-null     object 
 6   FILTER                        56 non-null     object 
 7   FORMAT                        56 non-null     object 
 8   avsnp150                      56 non-null     object 
 9   Interpro_domain               56 non-null     object 
 10  dbNSFP_DEOGEN2_pred           56 non-null     object 
 11  dbNSFP_MetaSVM_pred           56 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  56 non-null     object 
 13  dbNSFP_

In [None]:
base_ACC_COMMON5_MUTECT = base_merge.append([base_merge_mult_ok], ignore_index=True)

In [None]:
base_ACC_COMMON5_MUTECT.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 222 entries, 0 to 221
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         222 non-null    int64  
 1   POS                           222 non-null    int64  
 2   ID                            222 non-null    object 
 3   REF                           222 non-null    object 
 4   ALT                           222 non-null    object 
 5   QUAL                          222 non-null    object 
 6   FILTER                        222 non-null    object 
 7   FORMAT                        222 non-null    object 
 8   avsnp150                      222 non-null    object 
 9   Interpro_domain               222 non-null    object 
 10  dbNSFP_DEOGEN2_pred           222 non-null    object 
 11  dbNSFP_MetaSVM_pred           222 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  222 non-null    object 
 13  dbNSF

In [None]:
base_ACC_COMMON5_MUTECT.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,10,75401186,.,C,A,.,PASS,GT:AD:DP,rs781432010,.,T,T,D,D,D,D,D,T,8.237e-06,P,T,T,D,D,D,T,4.052554e-06,.,"D,D",T,D,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aaG/aaT,p.Lys78Asn,c.234G>T,ZNF503,NM_032772.5,1,A,3,234,>,Lys,Asn,78,subst,.,10,75401186,rs781432010,0.0
1,10,92479291,.,T,C,.,PASS,GT:AD:DP,.,"Metalloenzyme,_LuxS/M16_peptidase-like\x3bMeta...",".,T",T,D,T,"N,N",T,D,T,.,".,B","T,T","T,T","T,T",D,N,T,.,".,.","D,D",T,"D,D",".,B",".,N",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Atc/Gtc,p.Ile624Val,c.1870A>G,IDE,NM_004969.3,15,C,1,1870,>,Ile,Val,624,subst,.,10,92479291,rs1277293098,0.0
2,10,93061307,.,C,T,.,PASS,GT:AD:DP,rs761068228,.,T,T,N,T,N,T,T,T,1.670e-05,B,T,T,T,N,N,T,2.393146e-05,.,N,T,T,B,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCg/gTg,p.Ala15Val,c.44C>T,CYP26C1,NM_183374.2,1,T,2,44,>,Ala,Val,15,subst,.,10,93061307,rs761068228,0.0
3,10,93343905,.,C,T,.,PASS,GT:AD:DP,rs201449564,C2_domain,".,T",T,D,T,"N,N",D,T,T,3.641e-04,"B,B","T,T","D,D","T,T",N,N,T,3.366410e-04,".,.","D,D,D,D",T,"D,D","B,B",".,N",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg1426Gln,c.4277G>A,MYOF,NM_013451.3,38,T,2,4277,>,Arg,Gln,1426,subst,.,10,93343905,rs201449564,0.0
4,10,97369508,.,C,T,.,PASS,GT:AD:DP,rs370226569,Armadillo-like_helical|Armadillo-type_fold,"T,.,T,.",D,D,T,".,N,N,N",D,D,T,1.649e-05,"D,D,D,.",".,D,D,D",".,T,T,T","D,D,D,D",D,N,D,2.566779e-05,".,.,.,.","D,D,D,D",T,".,D,D,D","D,D,D,.","M,.,M,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gtg/Atg,p.Val958Met,c.2872G>A,RRP12,NM_015179.3,25,T,1,2872,>,Val,Met,958,subst,.,10,97369508,rs370226569,0.0


###3.10.1 Generating a file with the ACC Mutect and COMMON_05 database

In [None]:
base_ACC_COMMON5_MUTECT.to_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_05.csv",sep='\t',index=False)

##3.11 Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table(through fields *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_6* table (through fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_6
import pandas as pd
df_common6 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_6.csv", delimiter='\t')

In [None]:
df_common6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89544345 entries, 0 to 89544344
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 4.0+ GB


In [None]:
#Reading base_ACC
import pandas as pd
base_ACC = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/BasescomANNOVAR_SnpEFF/ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt.csv", delimiter='\t')

In [None]:
base_ACC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6098 entries, 0 to 6097
Data columns (total 51 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   CHROM                         6098 non-null   int64 
 1   POS                           6098 non-null   int64 
 2   ID                            6098 non-null   object
 3   REF                           6098 non-null   object
 4   ALT                           6098 non-null   object
 5   QUAL                          6098 non-null   object
 6   FILTER                        6098 non-null   object
 7   FORMAT                        6098 non-null   object
 8   avsnp150                      6098 non-null   object
 9   Interpro_domain               6098 non-null   object
 10  dbNSFP_DEOGEN2_pred           6098 non-null   object
 11  dbNSFP_MetaSVM_pred           6098 non-null   object
 12  dbNSFP_fathmmMKL_coding_pred  6098 non-null   object
 13  dbNSFP_PrimateAI_p

In [None]:
import pandas as pd
base_merge = pd.merge(base_ACC, df_common6, left_on=['CHROM', 'POS', 'REF', 'ALT'], right_on=['Chrom','Pos', 'REF', 'ALT'], how='inner')

In [None]:
tam_merge = 0
tam_merge = len(base_merge.index)
print(tam_merge)

183


In [None]:
base_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 183 entries, 0 to 182
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         183 non-null    int64  
 1   POS                           183 non-null    int64  
 2   ID                            183 non-null    object 
 3   REF                           183 non-null    object 
 4   ALT                           183 non-null    object 
 5   QUAL                          183 non-null    object 
 6   FILTER                        183 non-null    object 
 7   FORMAT                        183 non-null    object 
 8   avsnp150                      183 non-null    object 
 9   Interpro_domain               183 non-null    object 
 10  dbNSFP_DEOGEN2_pred           183 non-null    object 
 11  dbNSFP_MetaSVM_pred           183 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  183 non-null    object 
 13  dbNSF

In [None]:
base_merge.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,13,112862472,.,G,A,.,PASS,GT:AD:DP,rs778847637,"P-type_ATPase,__transmembrane_domain|P-type_AT...","T,T,T",T,D,T,"N,N,N",D,T,T,8.236e-06,"P,P,B","T,T,T","T,T,T","T,T,T",D,N,T,2.787490e-05,".,.,.","D,D,D,D",T,".,D,D","P,P,D","L,L,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGc/cAc,p.Arg963His,c.2888G>A,ATP11A,NM_032189.3,25,A,2,2888,>,Arg,His,963,subst,.,13,112862472,rs778847637,0.0
1,13,113637857,.,C,T,.,PASS,GT:AD:DP,rs760185871,.,T,T,D,T,N,T,T,T,8.236e-06,B,T,T,T,N,N,T,3.976301e-06,.,"N,N,N",T,T,P,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aCg/aTg,p.Thr349Met,c.1046C>T,TFDP1,NM_007111.4,11,T,2,1046,>,Thr,Met,349,subst,.,13,113637857,rs760185871,0.0
2,14,19876887,.,C,T,.,panel_of_normals,GT:AD:DP,rs371172454,"GPCR,_rhodopsin-like,_7TM","T,T",T,N,T,".,D",T,T,T,1.318e-04,"P,P",".,D",".,T",".,T",N,N,T,1.392702e-04,".,.",N,T,".,T","D,D","N,N",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCg/gTg,p.Ala207Val,c.620C>T,OR4K2,NM_001005501.1,1,T,2,620,>,Ala,Val,207,subst,.,14,19876887,rs371172454,0.0
3,14,20457398,.,G,C,.,PASS,GT:AD:DP,.,"AP_endonuclease_1,_conserved_site|Endonuclease...","D,D,D",D,D,T,"D,D,D",D,D,D,.,"D,D,D","D,D,D","D,D,D","D,D,D",D,D,D,.,".,.,.","D,D,D,D",D,".,.,D","D,D,D","H,H,H",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gat/Cat,p.Asp283His,c.847G>C,APEX1,NM_001244249.1,5,C,1,847,>,Asp,His,283,subst,.,14,20457398,rs1453612380,0.0
4,14,21523596,.,C,T,.,PASS,GT:AD:DP,rs764828751,Zinc_finger_C2H2-type,".,T",D,D,D,".,D",D,D,T,8.236e-06,".,D",".,D",".,T","D,D",N,D,D,1.990129e-05,".,.","D,D,D,D",D,".,D",".,D",".,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg711Gln,c.2132G>A,SALL2,NM_005407.2,2,T,2,2132,>,Arg,Gln,711,subst,.,14,21523596,rs764828751,0.0


##3.12 Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table (through fields *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_mult_6* table (through fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_mult_6
import pandas as pd
df_common_mult_6 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_mult_6.csv", delimiter='\t')

In [None]:
df_common_mult_6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6827825 entries, 0 to 6827824
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   int64  
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(2), object(3)
memory usage: 312.6+ MB


In [None]:
import pandas as pd
base_merge_mult = pd.merge(base_ACC, df_common_mult_6, left_on=['CHROM', 'POS', 'REF'], right_on=['Chrom','Pos', 'REF'], how='inner')

In [None]:
tam_merge_mult = 0
tam_merge_mult = len(base_merge_mult.index)
print(tam_merge_mult)

102


In [None]:
base_merge_mult.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 102 entries, 0 to 101
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         102 non-null    int64  
 1   POS                           102 non-null    int64  
 2   ID                            102 non-null    object 
 3   REF                           102 non-null    object 
 4   ALT_x                         102 non-null    object 
 5   QUAL                          102 non-null    object 
 6   FILTER                        102 non-null    object 
 7   FORMAT                        102 non-null    object 
 8   avsnp150                      102 non-null    object 
 9   Interpro_domain               102 non-null    object 
 10  dbNSFP_DEOGEN2_pred           102 non-null    object 
 11  dbNSFP_MetaSVM_pred           102 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  102 non-null    object 
 13  dbNSF

In [None]:
base_merge_mult.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT_x,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,ALT_y,COMMON
0,13,114265198,.,C,T,.,PASS,GT:AD:DP,.,Tetratricopeptide-like_helical_domain,"T,T,T,.,.,.,T",T,D,D,"N,.,N,N,N,N,N",T,D,D,.,"B,.,B,.,B,.,.","T,.,T,T,T,T,T",".,.,.,.,.,.,T","T,T,T,T,T,T,T",D,D,D,3.977124e-06,".,.,.,.,.,.,.","D,D,D,D,D,D,D",T,"D,D,.,.,D,D,.","B,.,B,.,B,.,.","L,.,L,.,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Cat/Tat,p.His521Tyr,c.1561C>T,CDC16,NM_001078645.1,17,T,1,1561,>,His,Tyr,521,subst,.,13,114265198,rs754062932,"A,G,T",0.0
1,14,19920982,.,A,C,.,PASS,GT:AD:DP,rs200386813,"GPCR,_rhodopsin-like,_7TM",T,D,D,T,N,T,D,T,.,D,D,T,D,D,N,T,.,.,D,T,T,D,H,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Ata/Cta,p.Ile126Leu,c.376A>C,OR4K5,NM_001005483.1,1,C,1,376,>,Ile,Leu,126,subst,.,14,19920982,rs200386813,"C,G",0.0
2,14,19976196,.,T,A,.,PASS,GT:AD:DP,.,"GPCR,_rhodopsin-like,_7TM",".,T",T,N,T,".,N",T,T,T,.,".,B",".,D",".,T",".,D",N,N,T,.,".,.",N,T,"T,T",".,B",".,N",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gaT/gaA,p.Asp226Glu,c.678T>A,OR4K15,NM_001005486.1,1,A,3,678,>,Asp,Glu,226,subst,.,14,19976196,rs760971262,"C,G",0.0
3,14,23034853,.,G,A,.,PASS,GT:AD:DP,.,.,"T,.,.",T,N,T,"N,N,D",T,T,T,.,"B,B,.","T,T,D","T,T,T","T,T,T",N,N,T,3.982541e-06,".,.,.","D,D,D,D",T,"T,T,T","B,B,.","N,N,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro10Leu,c.29C>T,PSMB5,NM_002797.4,1,A,2,29,>,Pro,Leu,10,subst,.,14,23034853,rs764303754,"A,C,T",0.0
4,14,24072936,.,C,G,.,PASS,GT:AD:DP,rs567201247,C2_domain,.,T,D,.,N,T,T,T,8.241e-06,.,T,T,T,.,N,T,4.725362e-06,.,"D,D,N",T,.,.,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gaC/gaG,p.Asp55Glu,c.165C>G,CPNE6,NM_001280558.1,3,G,3,165,>,Asp,Glu,55,subst,.,14,24072936,rs567201247,"G,T",0.0


In [None]:
#Converting the value of the ALTy field into a list
base_merge_mult["ALT_y"] = base_merge_mult["ALT_y"].apply(lambda x: x.split(","))

In [None]:
print(base_merge_mult[['ALT_x','ALT_y']])

    ALT_x      ALT_y
0       T  [A, G, T]
1       C     [C, G]
2       A     [C, G]
3       A  [A, C, T]
4       G     [G, T]
5       A     [A, C]
6       A  [A, C, T]
7       A     [A, T]
8       A     [A, G]
9       T     [A, T]
10      G     [A, T]
11      T     [A, T]
12      T     [A, T]
13      T     [G, T]
14      C     [G, T]
15      A     [A, C]
16      A     [A, C]
17      C  [A, C, T]
18      T     [G, T]
19      T     [G, T]
20      T     [A, G]
21      A     [A, C]
22      A     [G, T]
23      A     [A, C]
24      G     [C, G]
25      T     [A, G]
26      A     [A, T]
27      A     [A, T]
28      T     [A, T]
29      G     [G, T]
30      A     [A, T]
31      T  [A, G, T]
32      T  [A, G, T]
33      A     [G, T]
34      T     [C, T]
35      A     [A, C]
36      A     [A, T]
37      A     [A, T]
38      A     [C, G]
39      C     [A, T]
40      T     [G, T]
41      T     [G, T]
42      A     [A, T]
43      C     [G, T]
44      T     [A, T]
45      T     [G, T]
46      A    

In [None]:
#Generating a new dataframe that only contains the rows where the value of ALT_x (ACC database) is contained in ALT_y (COMMON database)
def find_value_column(row):
            return row.ALT_x in row.ALT_y

base_merge_mult_ok = base_merge_mult[base_merge_mult.apply(find_value_column, axis=1)]

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 74 entries, 0 to 98
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         74 non-null     int64  
 1   POS                           74 non-null     int64  
 2   ID                            74 non-null     object 
 3   REF                           74 non-null     object 
 4   ALT_x                         74 non-null     object 
 5   QUAL                          74 non-null     object 
 6   FILTER                        74 non-null     object 
 7   FORMAT                        74 non-null     object 
 8   avsnp150                      74 non-null     object 
 9   Interpro_domain               74 non-null     object 
 10  dbNSFP_DEOGEN2_pred           74 non-null     object 
 11  dbNSFP_MetaSVM_pred           74 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  74 non-null     object 
 13  dbNSFP_

In [None]:
#Rename the ALT_x column to ALT
base_merge_mult_ok.rename(columns={'ALT_x': 'ALT'}, inplace=True)

In [None]:
#Let's remove redundant fields
base_merge_mult_ok = base_merge_mult_ok.drop('ALT_y', 1)

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 74 entries, 0 to 98
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         74 non-null     int64  
 1   POS                           74 non-null     int64  
 2   ID                            74 non-null     object 
 3   REF                           74 non-null     object 
 4   ALT                           74 non-null     object 
 5   QUAL                          74 non-null     object 
 6   FILTER                        74 non-null     object 
 7   FORMAT                        74 non-null     object 
 8   avsnp150                      74 non-null     object 
 9   Interpro_domain               74 non-null     object 
 10  dbNSFP_DEOGEN2_pred           74 non-null     object 
 11  dbNSFP_MetaSVM_pred           74 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  74 non-null     object 
 13  dbNSFP_

In [None]:
base_ACC_COMMON6_MUTECT = base_merge.append([base_merge_mult_ok], ignore_index=True)

In [None]:
base_ACC_COMMON6_MUTECT.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 257 entries, 0 to 256
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         257 non-null    int64  
 1   POS                           257 non-null    int64  
 2   ID                            257 non-null    object 
 3   REF                           257 non-null    object 
 4   ALT                           257 non-null    object 
 5   QUAL                          257 non-null    object 
 6   FILTER                        257 non-null    object 
 7   FORMAT                        257 non-null    object 
 8   avsnp150                      257 non-null    object 
 9   Interpro_domain               257 non-null    object 
 10  dbNSFP_DEOGEN2_pred           257 non-null    object 
 11  dbNSFP_MetaSVM_pred           257 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  257 non-null    object 
 13  dbNSF

In [None]:
base_ACC_COMMON6_MUTECT.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,13,112862472,.,G,A,.,PASS,GT:AD:DP,rs778847637,"P-type_ATPase,__transmembrane_domain|P-type_AT...","T,T,T",T,D,T,"N,N,N",D,T,T,8.236e-06,"P,P,B","T,T,T","T,T,T","T,T,T",D,N,T,2.787490e-05,".,.,.","D,D,D,D",T,".,D,D","P,P,D","L,L,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGc/cAc,p.Arg963His,c.2888G>A,ATP11A,NM_032189.3,25,A,2,2888,>,Arg,His,963,subst,.,13,112862472,rs778847637,0.0
1,13,113637857,.,C,T,.,PASS,GT:AD:DP,rs760185871,.,T,T,D,T,N,T,T,T,8.236e-06,B,T,T,T,N,N,T,3.976301e-06,.,"N,N,N",T,T,P,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aCg/aTg,p.Thr349Met,c.1046C>T,TFDP1,NM_007111.4,11,T,2,1046,>,Thr,Met,349,subst,.,13,113637857,rs760185871,0.0
2,14,19876887,.,C,T,.,panel_of_normals,GT:AD:DP,rs371172454,"GPCR,_rhodopsin-like,_7TM","T,T",T,N,T,".,D",T,T,T,1.318e-04,"P,P",".,D",".,T",".,T",N,N,T,1.392702e-04,".,.",N,T,".,T","D,D","N,N",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCg/gTg,p.Ala207Val,c.620C>T,OR4K2,NM_001005501.1,1,T,2,620,>,Ala,Val,207,subst,.,14,19876887,rs371172454,0.0
3,14,20457398,.,G,C,.,PASS,GT:AD:DP,.,"AP_endonuclease_1,_conserved_site|Endonuclease...","D,D,D",D,D,T,"D,D,D",D,D,D,.,"D,D,D","D,D,D","D,D,D","D,D,D",D,D,D,.,".,.,.","D,D,D,D",D,".,.,D","D,D,D","H,H,H",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gat/Cat,p.Asp283His,c.847G>C,APEX1,NM_001244249.1,5,C,1,847,>,Asp,His,283,subst,.,14,20457398,rs1453612380,0.0
4,14,21523596,.,C,T,.,PASS,GT:AD:DP,rs764828751,Zinc_finger_C2H2-type,".,T",D,D,D,".,D",D,D,T,8.236e-06,".,D",".,D",".,T","D,D",N,D,D,1.990129e-05,".,.","D,D,D,D",D,".,D",".,D",".,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg711Gln,c.2132G>A,SALL2,NM_005407.2,2,T,2,2132,>,Arg,Gln,711,subst,.,14,21523596,rs764828751,0.0


###3.12.1 Generating a file with the ACC Mutect and COMMON_06 database

In [None]:
base_ACC_COMMON6_MUTECT.to_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_06.csv",sep='\t',index=False)

##3.13 - Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table(through fields *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_7* table (through fields *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_7
import pandas as pd
df_common7 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_7.csv", delimiter='\t')

  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
df_common7.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70289434 entries, 0 to 70289433
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   object 
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(1), object(4)
memory usage: 3.1+ GB


In [None]:
#Reading base_ACC
import pandas as pd
base_ACC = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/BasescomANNOVAR_SnpEFF/ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt.csv", delimiter='\t')

In [None]:
base_ACC.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6098 entries, 0 to 6097
Data columns (total 51 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   CHROM                         6098 non-null   int64 
 1   POS                           6098 non-null   int64 
 2   ID                            6098 non-null   object
 3   REF                           6098 non-null   object
 4   ALT                           6098 non-null   object
 5   QUAL                          6098 non-null   object
 6   FILTER                        6098 non-null   object
 7   FORMAT                        6098 non-null   object
 8   avsnp150                      6098 non-null   object
 9   Interpro_domain               6098 non-null   object
 10  dbNSFP_DEOGEN2_pred           6098 non-null   object
 11  dbNSFP_MetaSVM_pred           6098 non-null   object
 12  dbNSFP_fathmmMKL_coding_pred  6098 non-null   object
 13  dbNSFP_PrimateAI_p

In [None]:
import pandas as pd
base_merge = pd.merge(base_ACC, df_common7, left_on=['CHROM', 'POS', 'REF', 'ALT'], right_on=['Chrom','Pos', 'REF', 'ALT'], how='inner')

In [None]:
tam_merge = 0
tam_merge = len(base_merge.index)
print(tam_merge)

189


In [None]:
base_merge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 189 entries, 0 to 188
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         189 non-null    object 
 1   POS                           189 non-null    int64  
 2   ID                            189 non-null    object 
 3   REF                           189 non-null    object 
 4   ALT                           189 non-null    object 
 5   QUAL                          189 non-null    object 
 6   FILTER                        189 non-null    object 
 7   FORMAT                        189 non-null    object 
 8   avsnp150                      189 non-null    object 
 9   Interpro_domain               189 non-null    object 
 10  dbNSFP_DEOGEN2_pred           189 non-null    object 
 11  dbNSFP_MetaSVM_pred           189 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  189 non-null    object 
 13  dbNSF

In [None]:
base_merge.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,19,1220641,.,C,T,.,PASS,GT:AD:DP,.,Protein_kinase_domain|Protein_kinase-like_domain,".,.",.,D,.,".,.",.,.,D,.,".,.",".,.",".,.",".,.",D,N,D,.,".,.",A,.,".,.",".,.",".,.",STOP_GAINED,HIGH,NONSENSE,Cag/Tag,p.Gln220*,c.658C>T,STK11,NM_000455.4,5,T,1,658,>,Gln,*,220,translation termination,.,19,1220641,rs1131690940,0.0
1,19,1277287,.,G,A,.,PASS,GT:AD:DP,.,.,T,T,D,.,N,D,D,T,.,D,T,T,T,.,N,T,1.913168e-05,.,.,T,T,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg129Gln,c.386G>A,C19orf24,NM_017914.3,2,A,2,386,>,Arg,Gln,129,subst,.,19,1277287,rs1421078623,0.0
2,19,1796767,.,C,T,.,PASS,GT:AD:DP,rs773757936,"P-type_ATPase,_cytoplasmic_domain_N","D,.",T,N,T,"D,D",D,D,T,8.302e-06,"D,.","D,D","T,T","D,D",D,N,T,8.205465e-06,".,.","N,N,N",T,"D,T","D,.","M,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGc/cAc,p.Arg566His,c.1697G>A,ATP8B3,NM_138813.3,16,T,2,1697,>,Arg,His,566,subst,.,19,1796767,rs773757936,0.0
3,19,2763726,.,C,T,.,PASS,GT:AD:DP,rs374479958,Tetratricopeptide_repeat-containing_domain|Tet...,"T,T",T,D,T,"N,.",D,T,T,3.295e-05,"B,.","T,.","T,T","T,.",N,N,T,1.199530e-05,".,.",D,T,"T,T","B,.","N,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gca/Aca,p.Ala142Thr,c.424G>A,SGTA,NM_003021.3,6,T,1,424,>,Ala,Thr,142,subst,.,19,2763726,rs374479958,0.0
4,19,2934177,.,G,A,.,panel_of_normals,GT:AD:DP,rs151130031,Zinc_finger_C2H2-type,T,T,N,T,D,T,T,T,8.895e-04,B,D,T,T,.,N,T,8.393066e-04,.,N,T,T,P,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aCg/aTg,p.Thr317Met,c.950C>T,ZNF77,NM_021217.2,4,A,2,950,>,Thr,Met,317,subst,.,19,2934177,rs151130031,1.0


##3.14 -Joining the **ACC_mutect_campos_selecionados_INFO_EFF_PointMut_changecDNA_changeProt** table (through fields  *CHROM*, *POS*, *REF*, *ALT*) with *Common_hg38_mult_7* table(through fields  *Chrom*, *Pos*, *REF*, *ALT*)  


In [None]:
#Reading Common_hg38_mult_7
import pandas as pd
df_common_mult_7 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Common_hg38_mult_7.csv", delimiter='\t')

  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
df_common_mult_7.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4970710 entries, 0 to 4970709
Data columns (total 6 columns):
 #   Column  Dtype  
---  ------  -----  
 0   Chrom   object 
 1   Pos     int64  
 2   SNP_ID  object 
 3   REF     object 
 4   ALT     object 
 5   COMMON  float64
dtypes: float64(1), int64(1), object(4)
memory usage: 227.5+ MB


In [None]:
import pandas as pd
base_merge_mult = pd.merge(base_ACC, df_common_mult_7, left_on=['CHROM', 'POS', 'REF'], right_on=['Chrom','Pos', 'REF'], how='inner')

In [None]:
tam_merge_mult = 0
tam_merge_mult = len(base_merge_mult.index)
print(tam_merge_mult)

71


In [None]:
base_merge_mult.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 71 entries, 0 to 70
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         71 non-null     object 
 1   POS                           71 non-null     int64  
 2   ID                            71 non-null     object 
 3   REF                           71 non-null     object 
 4   ALT_x                         71 non-null     object 
 5   QUAL                          71 non-null     object 
 6   FILTER                        71 non-null     object 
 7   FORMAT                        71 non-null     object 
 8   avsnp150                      71 non-null     object 
 9   Interpro_domain               71 non-null     object 
 10  dbNSFP_DEOGEN2_pred           71 non-null     object 
 11  dbNSFP_MetaSVM_pred           71 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  71 non-null     object 
 13  dbNSFP_

In [None]:
base_merge_mult.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT_x,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,ALT_y,COMMON
0,18,72750078,.,C,T,.,PASS,GT:AD:DP,.,.,"T,T",T,D,T,".,N",T,D,T,.,"B,B",".,D","T,T","D,D",N,D,T,1.281328e-05,".,.","D,D,D",T,".,D","P,P","N,N",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gat/Aat,p.Asp509Asn,c.1525G>A,NETO1,NM_001201465.1,9,T,1,1525,>,Asp,Asn,509,subst,.,18,72750078,rs770123667,"G,T",0.0
1,19,2434858,.,C,T,.,PASS,GT:AD:DP,rs567796688,.,D,T,D,T,.,D,D,T,1.661e-05,.,.,D,T,D,N,T,1.699929e-05,.,"D,D",T,D,.,.,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGc/cAc,p.Arg304His,c.911G>A,LMNB2,NM_032737.3,6,T,2,911,>,Arg,His,304,subst,.,19,2434858,rs567796688,"A,T",0.0
2,19,3734402,.,G,A,.,PASS,GT:AD:DP,rs200603772,.,"T,.,.,.",T,N,T,"N,N,.,.",T,T,T,4.942e-05,".,.,.,B","T,T,.,.","T,T,T,T","T,T,T,T",N,N,T,4.430624e-05,".,.,.,.","N,N,N,N,N,N",T,"T,T,T,T",".,.,.,B","L,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg327Gln,c.980G>A,TJP3,NM_001267561.1,8,A,2,980,>,Arg,Gln,327,subst,.,19,3734402,rs200603772,"A,C,T",0.0
3,19,6047462,rs199616707,G,A,.,PASS,GT:AD:DP,rs948583093,RFX1_transcription_activation_region,"T,T,.,T,T,.,T",T,D,T,"N,N,.,.,.,.,.",T,D,T,.,"B,B,B,.,.,.,.","T,T,.,.,.,.,.","T,T,T,T,T,T,T","T,T,T,T,.,.,.",D,D,T,4.241242e-06,".,.,.,.,.,.,.","N,N,N",T,".,D,D,D,D,D,D","P,P,P,.,.,.,.","L,L,L,.,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,gCg/gTg,p.Ala12Val,c.35C>T,RFX2,NM_000635.3,2,A,2,35,>,Ala,Val,12,subst,.,19,6047462,rs948583093,"A,C",0.0
4,19,8305147,.,G,C,.,PASS,GT:AD:DP,.,.,.,.,N,.,.,.,.,D,.,.,.,.,.,N,N,D,.,.,"A,N",.,.,.,.,STOP_GAINED,HIGH,NONSENSE,tCa/tGa,p.Ser51*,c.152C>G,CD320,NM_016579.3,2,C,2,152,>,Ser,*,51,translation termination,.,19,8305147,rs1205348342,"A,T",0.0


In [None]:
#Converting the value of the ALTy field into a list
base_merge_mult["ALT_y"] = base_merge_mult["ALT_y"].apply(lambda x: x.split(","))

In [None]:
print(base_merge_mult[['ALT_x','ALT_y']])

   ALT_x      ALT_y
0      T     [G, T]
1      T     [A, T]
2      A  [A, C, T]
3      A     [A, C]
4      C     [A, T]
5      C     [A, C]
6      C     [A, C]
7      A     [A, T]
8      T     [A, T]
9      T     [A, T]
10     T     [G, T]
11     A     [A, C]
12     A     [A, T]
13     T     [A, T]
14     A  [A, C, T]
15     A     [A, C]
16     T     [A, T]
17     T  [A, G, T]
18     A     [A, T]
19     A     [A, C]
20     T     [G, T]
21     T  [A, G, T]
22     A     [A, T]
23     A     [A, T]
24     A     [A, C]
25     T     [A, T]
26     G     [A, T]
27     A     [A, C]
28     A     [A, C]
29     T     [G, T]
30     T     [G, T]
31     A     [A, C]
32     A     [A, T]
33     T     [A, T]
34     A     [A, T]
35     A     [A, T]
36     T     [A, T]
37     T     [A, T]
38     A     [A, T]
39     A     [A, T]
40     T     [G, T]
41     T     [A, T]
42     A     [A, G]
43     A     [G, T]
44     A     [A, C]
45     T     [A, C]
46     T  [A, G, T]
47     T     [G, T]
48     T     [A, T]


In [None]:
#Generating a new dataframe that only contains the rows where the value of ALT_x (ACC database) is contained in ALT_y (COMMON database)
def find_value_column(row):
            return row.ALT_x in row.ALT_y

base_merge_mult_ok = base_merge_mult[base_merge_mult.apply(find_value_column, axis=1)]

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62 entries, 0 to 68
Data columns (total 56 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         62 non-null     object 
 1   POS                           62 non-null     int64  
 2   ID                            62 non-null     object 
 3   REF                           62 non-null     object 
 4   ALT_x                         62 non-null     object 
 5   QUAL                          62 non-null     object 
 6   FILTER                        62 non-null     object 
 7   FORMAT                        62 non-null     object 
 8   avsnp150                      62 non-null     object 
 9   Interpro_domain               62 non-null     object 
 10  dbNSFP_DEOGEN2_pred           62 non-null     object 
 11  dbNSFP_MetaSVM_pred           62 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  62 non-null     object 
 13  dbNSFP_

In [None]:
#Rename the ALT_x column to ALT
base_merge_mult_ok.rename(columns={'ALT_x': 'ALT'}, inplace=True)

In [None]:
#Let's remove redundant fields
base_merge_mult_ok = base_merge_mult_ok.drop('ALT_y', 1)

In [None]:
base_merge_mult_ok.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 62 entries, 0 to 68
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         62 non-null     object 
 1   POS                           62 non-null     int64  
 2   ID                            62 non-null     object 
 3   REF                           62 non-null     object 
 4   ALT                           62 non-null     object 
 5   QUAL                          62 non-null     object 
 6   FILTER                        62 non-null     object 
 7   FORMAT                        62 non-null     object 
 8   avsnp150                      62 non-null     object 
 9   Interpro_domain               62 non-null     object 
 10  dbNSFP_DEOGEN2_pred           62 non-null     object 
 11  dbNSFP_MetaSVM_pred           62 non-null     object 
 12  dbNSFP_fathmmMKL_coding_pred  62 non-null     object 
 13  dbNSFP_

In [None]:
base_ACC_COMMON7_MUTECT = base_merge.append([base_merge_mult_ok], ignore_index=True)

In [None]:
base_ACC_COMMON7_MUTECT.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 251 entries, 0 to 250
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         251 non-null    object 
 1   POS                           251 non-null    int64  
 2   ID                            251 non-null    object 
 3   REF                           251 non-null    object 
 4   ALT                           251 non-null    object 
 5   QUAL                          251 non-null    object 
 6   FILTER                        251 non-null    object 
 7   FORMAT                        251 non-null    object 
 8   avsnp150                      251 non-null    object 
 9   Interpro_domain               251 non-null    object 
 10  dbNSFP_DEOGEN2_pred           251 non-null    object 
 11  dbNSFP_MetaSVM_pred           251 non-null    object 
 12  dbNSFP_fathmmMKL_coding_pred  251 non-null    object 
 13  dbNSF

In [None]:
base_ACC_COMMON7_MUTECT.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,19,1220641,.,C,T,.,PASS,GT:AD:DP,.,Protein_kinase_domain|Protein_kinase-like_domain,".,.",.,D,.,".,.",.,.,D,.,".,.",".,.",".,.",".,.",D,N,D,.,".,.",A,.,".,.",".,.",".,.",STOP_GAINED,HIGH,NONSENSE,Cag/Tag,p.Gln220*,c.658C>T,STK11,NM_000455.4,5,T,1,658,>,Gln,*,220,translation termination,.,19,1220641,rs1131690940,0.0
1,19,1277287,.,G,A,.,PASS,GT:AD:DP,.,.,T,T,D,.,N,D,D,T,.,D,T,T,T,.,N,T,1.913168e-05,.,.,T,T,D,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg129Gln,c.386G>A,C19orf24,NM_017914.3,2,A,2,386,>,Arg,Gln,129,subst,.,19,1277287,rs1421078623,0.0
2,19,1796767,.,C,T,.,PASS,GT:AD:DP,rs773757936,"P-type_ATPase,_cytoplasmic_domain_N","D,.",T,N,T,"D,D",D,D,T,8.302e-06,"D,.","D,D","T,T","D,D",D,N,T,8.205465e-06,".,.","N,N,N",T,"D,T","D,.","M,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGc/cAc,p.Arg566His,c.1697G>A,ATP8B3,NM_138813.3,16,T,2,1697,>,Arg,His,566,subst,.,19,1796767,rs773757936,0.0
3,19,2763726,.,C,T,.,PASS,GT:AD:DP,rs374479958,Tetratricopeptide_repeat-containing_domain|Tet...,"T,T",T,D,T,"N,.",D,T,T,3.295e-05,"B,.","T,.","T,T","T,.",N,N,T,1.199530e-05,".,.",D,T,"T,T","B,.","N,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gca/Aca,p.Ala142Thr,c.424G>A,SGTA,NM_003021.3,6,T,1,424,>,Ala,Thr,142,subst,.,19,2763726,rs374479958,0.0
4,19,2934177,.,G,A,.,panel_of_normals,GT:AD:DP,rs151130031,Zinc_finger_C2H2-type,T,T,N,T,D,T,T,T,8.895e-04,B,D,T,T,.,N,T,8.393066e-04,.,N,T,T,P,L,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,aCg/aTg,p.Thr317Met,c.950C>T,ZNF77,NM_021217.2,4,A,2,950,>,Thr,Met,317,subst,.,19,2934177,rs151130031,1.0


###3.14.1 Generating a file with the ACC Mutect and COMMON_07 database

In [None]:
base_ACC_COMMON7_MUTECT.to_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_07.csv",sep='\t',index=False)

##3.15 Integration of the 7 ACC databases with the COMMON field into a single database

In [None]:
import pandas as pd

ACC_01 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_01.csv",delimiter='\t')
ACC_02 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_02.csv",delimiter='\t')
ACC_03 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_03.csv",delimiter='\t')
ACC_04 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_04.csv",delimiter='\t')
ACC_05 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_05.csv",delimiter='\t')
ACC_06 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_06.csv",delimiter='\t')
ACC_07 = pd.read_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_Common_07.csv",delimiter='\t')


In [None]:
base_ACC_MUTECT_COMMON = ACC_01.append([ACC_02, ACC_03, ACC_04, ACC_05, ACC_06, ACC_07], ignore_index=True)

In [None]:
base_ACC_MUTECT_COMMON.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1543 entries, 0 to 1542
Data columns (total 55 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CHROM                         1543 non-null   int64  
 1   POS                           1543 non-null   int64  
 2   ID                            1543 non-null   object 
 3   REF                           1543 non-null   object 
 4   ALT                           1543 non-null   object 
 5   QUAL                          1543 non-null   object 
 6   FILTER                        1543 non-null   object 
 7   FORMAT                        1543 non-null   object 
 8   avsnp150                      1543 non-null   object 
 9   Interpro_domain               1543 non-null   object 
 10  dbNSFP_DEOGEN2_pred           1543 non-null   object 
 11  dbNSFP_MetaSVM_pred           1543 non-null   object 
 12  dbNSFP_fathmmMKL_coding_pred  1543 non-null   object 
 13  dbN

In [None]:
base_ACC_MUTECT_COMMON.head()

Unnamed: 0,CHROM,POS,ID,REF,ALT,QUAL,FILTER,FORMAT,avsnp150,Interpro_domain,dbNSFP_DEOGEN2_pred,dbNSFP_MetaSVM_pred,dbNSFP_fathmmMKL_coding_pred,dbNSFP_PrimateAI_pred,dbNSFP_PROVEAN_pred,dbNSFP_MCAP_pred,dbNSFP_ClinPred_pred,dbNSFP_BayesDel_addAF_pred,dbNSFP_ExAC_AF,dbNSFP_Polyphen2_HVAR_pred,dbNSFP_SIFT_pred,dbNSFP_FATHMM_pred,dbNSFP_SIFT4G_pred,dbNSFP_LRT_pred,dbNSFP_fathmmXF_coding_pred,dbNSFP_BayesDel_noAF_pred,dbNSFP_gnomAD_exomes_AF,dbNSFP_Aloft_pred,dbNSFP_MutationTaster_pred,dbNSFP_MetaLR_pred,dbNSFP_LISTS2_pred,dbNSFP_Polyphen2_HDIV_pred,dbNSFP_MutationAssessor_pred,VariantEffect_EFF,Risco_Mut_EFF,Tipo_Mut_EFF,Point_Mutation_EFF,changeProt_EFF,changecDNA_EFF,Gene_EFF,RefSeq_EFF,Exon_EFF,ALT_EFF,Pos_Point_Mutation_EFF,poschangecDNA_EFF,typechangecDNA_EFF,aminBefore,aminAfter,poschangeProt,typechangeProt,pos_terminalchangeProt,Chrom,Pos,SNP_ID,COMMON
0,1,2027599,.,G,A,.,PASS,GT:AD:DP,.,Neurotransmitter-gated_ion-channel_ligand-bind...,"T,.,.,.,.,.","T,.","D,D","D,.","N,.,.,.,.,.","D,D","T,T","T,T",".,.","D,.,.,.,.,.","T,.,.,.,.,.","T,.,.,.,.,.","T,.,.,.,.,.","D,.","D,D","T,T","4.006442e-06,4.006442e-06",".,.,.,.,.,.","D,D","T,.","D,D,D,D,D,T","D,.,.,.,.,.","N,.,.,.,.,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,Gac/Aac,p.Asp165Asn,c.493G>A,GABRD,NM_000815.4,5,A,1,493,>,Asp,Asn,165,subst,.,1,2027599,rs1477740666,0.0
1,1,2303896,.,C,T,.,PASS,GT:AD:DP,rs752779978,.,D,D,D,T,D,D,D,D,8.284e-06,P,D,D,T,D,D,D,8.239065e-06,.,D,D,D,D,M,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro423Leu,c.1268C>T,SKI,NM_003036.3,4,T,2,1268,>,Pro,Leu,423,subst,.,1,2303896,rs752779978,0.0
2,1,2385062,.,G,A,.,PASS,GT:AD:DP,rs772157368,.,.,T,N,T,D,.,T,T,2.502e-05,B,.,.,.,.,N,T,.,.,"D,D,D",T,.,P,.,SYNONYMOUS_CODING,LOW,SILENT,aaC/aaT,p.Asn151Asn,c.453C>T,MORN1,NM_024848.2,6,A,3,453,>,Asn,Asn,151,subst,.,1,2385062,rs772157368,0.0
3,1,2591033,.,C,T,.,PASS,GT:AD:DP,rs199926063,"Metallopeptidase,_catalytic_domain|Peptidase_M...",".,T,.",T,N,T,".,N,N",D,T,T,8.264e-05,".,B,.",".,T,T","D,D,D","T,T,T",N,N,T,1.351819e-04,".,.,.","N,N,N,N,N,N,N",T,"T,T,T",".,B,.",".,N,.",NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cGg/cAg,p.Arg766Gln,c.2297G>A,MMEL1,NM_033467.3.2,24,T,2,2297,>,Arg,Gln,766,subst,.,1,2591033,rs199926063,0.0
4,1,3816294,.,G,A,.,PASS,GT:AD:DP,.,.,T,T,N,T,N,T,T,T,.,B,T,T,T,N,N,T,.,.,N,T,T,B,N,NON_SYNONYMOUS_CODING,MODERATE,MISSENSE,cCg/cTg,p.Pro883Leu,c.2648C>T,CEP104,NM_014704.3,21,A,2,2648,>,Pro,Leu,883,subst,.,1,3816294,rs1197412379,0.0


In [None]:
#Rename the SNP_ID column to SNP_ID_COMMON
base_ACC_MUTECT_COMMON.rename(columns={'SNP_ID': 'SNP_ID_COMMON'}, inplace=True)

In [None]:
print(base_ACC_MUTECT_COMMON.columns)

Index(['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'FORMAT',
       'avsnp150', 'Interpro_domain', 'dbNSFP_DEOGEN2_pred',
       'dbNSFP_MetaSVM_pred', 'dbNSFP_fathmmMKL_coding_pred',
       'dbNSFP_PrimateAI_pred', 'dbNSFP_PROVEAN_pred', 'dbNSFP_MCAP_pred',
       'dbNSFP_ClinPred_pred', 'dbNSFP_BayesDel_addAF_pred', 'dbNSFP_ExAC_AF',
       'dbNSFP_Polyphen2_HVAR_pred', 'dbNSFP_SIFT_pred', 'dbNSFP_FATHMM_pred',
       'dbNSFP_SIFT4G_pred', 'dbNSFP_LRT_pred', 'dbNSFP_fathmmXF_coding_pred',
       'dbNSFP_BayesDel_noAF_pred', 'dbNSFP_gnomAD_exomes_AF',
       'dbNSFP_Aloft_pred', 'dbNSFP_MutationTaster_pred', 'dbNSFP_MetaLR_pred',
       'dbNSFP_LISTS2_pred', 'dbNSFP_Polyphen2_HDIV_pred',
       'dbNSFP_MutationAssessor_pred', 'VariantEffect_EFF', 'Risco_Mut_EFF',
       'Tipo_Mut_EFF', 'Point_Mutation_EFF', 'changeProt_EFF',
       'changecDNA_EFF', 'Gene_EFF', 'RefSeq_EFF', 'Exon_EFF', 'ALT_EFF',
       'Pos_Point_Mutation_EFF', 'poschangecDNA_EFF', 'typechangecDNA_EFF',
  

In [None]:
def categories_column(df):
    for col in['CHROM', 'POS', 'ID', 'REF', 'ALT', 'QUAL', 'FILTER', 'FORMAT',
       'avsnp150', 'Interpro_domain', 'dbNSFP_DEOGEN2_pred',
       'dbNSFP_MetaSVM_pred', 'dbNSFP_fathmmMKL_coding_pred',
       'dbNSFP_PrimateAI_pred', 'dbNSFP_PROVEAN_pred', 'dbNSFP_MCAP_pred',
       'dbNSFP_ClinPred_pred', 'dbNSFP_BayesDel_addAF_pred', 'dbNSFP_ExAC_AF',
       'dbNSFP_Polyphen2_HVAR_pred', 'dbNSFP_SIFT_pred', 'dbNSFP_FATHMM_pred',
       'dbNSFP_SIFT4G_pred', 'dbNSFP_LRT_pred', 'dbNSFP_fathmmXF_coding_pred',
       'dbNSFP_BayesDel_noAF_pred', 'dbNSFP_gnomAD_exomes_AF',
       'dbNSFP_Aloft_pred', 'dbNSFP_MutationTaster_pred', 'dbNSFP_MetaLR_pred',
       'dbNSFP_LISTS2_pred', 'dbNSFP_Polyphen2_HDIV_pred',
       'dbNSFP_MutationAssessor_pred', 'VariantEffect_EFF', 'Risco_Mut_EFF',
       'Tipo_Mut_EFF', 'Point_Mutation_EFF', 'changeProt_EFF',
       'changecDNA_EFF', 'Gene_EFF', 'RefSeq_EFF', 'Exon_EFF', 'ALT_EFF',
       'Pos_Point_Mutation_EFF', 'poschangecDNA_EFF', 'typechangecDNA_EFF',
       'aminBefore', 'aminAfter', 'poschangeProt', 'typechangeProt',
       'pos_terminalchangeProt', 'Chrom', 'Pos', 'SNP_ID_COMMON', 'COMMON']:
        mydic= df[col].value_counts().to_dict()
        print(col, mydic)
        print('\n')

categories_column(base_ACC_MUTECT_COMMON)

CHROM {1: 153, 19: 137, 12: 108, 2: 98, 5: 90, 3: 89, 6: 82, 17: 80, 7: 78, 10: 66, 4: 64, 11: 63, 16: 63, 9: 62, 8: 61, 20: 57, 14: 54, 15: 36, 22: 35, 13: 24, 18: 22, 21: 21}


POS {41224613: 2, 47999719: 2, 49515669: 2, 32027084: 2, 86505860: 2, 71800667: 2, 30329606: 2, 117820859: 2, 112389583: 2, 41845758: 2, 8952474: 1, 122279878: 1, 75882838: 1, 1796767: 1, 59194014: 1, 16749271: 1, 100123277: 1, 13076132: 1, 10726403: 1, 113297763: 1, 96944784: 1, 132901521: 1, 768658: 1, 109279882: 1, 45480518: 1, 160430743: 1, 15103498: 1, 15240503: 1, 136466431: 1, 98687625: 1, 228321909: 1, 24076906: 1, 124658283: 1, 127033964: 1, 103465581: 1, 43719279: 1, 248593008: 1, 9751153: 1, 30571123: 1, 237628020: 1, 92443413: 1, 39359694: 1, 107577529: 1, 128584312: 1, 72475257: 1, 39682682: 1, 59677102: 1, 42893597: 1, 51260844: 1, 123144834: 1, 60830496: 1, 125307527: 1, 56386497: 1, 82426529: 1, 30481584: 1, 48200379: 1, 101728956: 1, 37388989: 1, 161106622: 1, 6047462: 1, 62701708: 1, 41142978

In [None]:
#Identify duplicates records in the data
dupes=base_ACC_MUTECT_COMMON.duplicated()
sum(dupes)

0

####3.15.1 -  Generating an intermediate file with the *base_ACC_MUTECT_COMMON* database*

In [None]:
base_ACC_MUTECT_COMMON.to_csv("drive/My Drive/BaseNovaDaniRaul/Bases_com_COMMON/base_ACC_MUTECT_COMMON.csv",sep='\t',index=False)