This notebook processes the Transcript_NCBI database, which relates the transcript_NCBI_id of ACC cancer with its Uniprot_id.

Here, the **Adenoid cystic carcinoma (ACC)** is processed. To process the other 32 cancers, just change the input file (in the section 1) as the processing is the same.

#0 - Basic Settings

In [None]:
#Permission to access any file on Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Increased column and row display capacity
import pandas as pd

pd.set_option('display.max_columns', 7000)
pd.set_option('display.max_rows',70000)

#1 - Reading the Transcript_NCBI_id database and generating the Uniprot id


The database was downloaded from the UR: https://www.uniprot.org/uploadlists,
submitting as input the RefSeq Nucleotide (NCBI transcript id of the **ACC** tissue.). he option selected to generate the Uniprots was:

From: RefSeq Nucleotide

To: UniProtKB

And the filter was applied: Reviewed (Swiss-Prot) - Manually annotated

Saved in the format: Tab-separated

Download performed on 03/05/2021

In [None]:
#Readind idTranscrito_Uniprot_ACC database
import pandas as pd
base_uniprot = pd.read_csv("drive/My Drive/MontagemNovaBase/ProcessamentoUniprot_Revisado/idTranscrito_Uniprot_ACC.tab", delimiter='\t')

In [None]:
tam_base_uniprot = len(base_uniprot.index)
print(tam_base_uniprot)

1211


In [None]:
base_uniprot.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1211 entries, 0 to 1210
Data columns (total 9 columns):
 #   Column                                                     Non-Null Count  Dtype 
---  ------                                                     --------------  ----- 
 0   yourlist:M2021050363E7E78CFC6242B71761763234FF46DC049A5EI  1211 non-null   object
 1   isomap:M2021050363E7E78CFC6242B71761763234FF46DC049A5EI    745 non-null    object
 2   Entry                                                      1211 non-null   object
 3   Entry name                                                 1211 non-null   object
 4   Status                                                     1211 non-null   object
 5   Protein names                                              1211 non-null   object
 6   Gene names                                                 1211 non-null   object
 7   Organism                                                   1211 non-null   object
 8   Length             

In [None]:
#Selecting the necessary attributes
base_uniprot = base_uniprot.loc[:,['yourlist:M2021050363E7E78CFC6242B71761763234FF46DC049A5EI','Entry','Gene names']]

In [None]:
base_uniprot.rename(columns={'yourlist:M2021050363E7E78CFC6242B71761763234FF46DC049A5EI': 'transcript_NCBI_id',
                             'Entry': 'Uniprot_id',
                             'Gene names': 'Genes_Uniprot',
                       }, inplace=True)

In [None]:
base_uniprot.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1211 entries, 0 to 1210
Data columns (total 3 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   transcript_NCBI_id  1211 non-null   object
 1   Uniprot_id          1211 non-null   object
 2   Genes_Uniprot       1211 non-null   object
dtypes: object(3)
memory usage: 28.5+ KB


In [None]:
base_uniprot.head(1000)

Unnamed: 0,transcript_NCBI_id,Uniprot_id,Genes_Uniprot
0,NM_000546.5,P04637,TP53 P53
1,NM_001098209.1,P35222,CTNNB1 CTNNB OK/SW-cl.35 PRO2286
2,NM_001292009.1,Q5T1H1,EYS C6orf178 C6orf179 C6orf180 EGFL10 EGFL11 S...
3,NM_032242.3,Q9UIW2,PLXNA1 NOV PLXN1
4,NM_130773.3,Q8WYK1,CNTNAP5 CASPR5
5,NM_001297650.1,Q9NR16,CD163L1 CD163B M160 UNQ6434/PRO23202
6,NM_005559.3,P25391,LAMA1 LAMA
7,NM_004984.2,Q12840,KIF5A NKHC1
8,NM_001122965.1,Q6XPR3,RPTN
9,NM_032119.3,Q8WXG9,ADGRV1 GPR98 KIAA0686 KIAA1943 MASS1 VLGR1


In [None]:
#Checking for the existence of 'missing' values
base_uniprot.isna().sum()

transcript_NCBI_id    0
Uniprot_id            0
Genes_Uniprot         0
dtype: int64

In [None]:
#Identify duplicates records in the data
dupes=base_uniprot.duplicated()
sum(dupes)

0

In [None]:
base_uniprot.head()

Unnamed: 0,transcript_NCBI_id,Uniprot_id,Genes_Uniprot
0,NM_000546.5,P04637,TP53 P53
1,NM_001098209.1,P35222,CTNNB1 CTNNB OK/SW-cl.35 PRO2286
2,NM_001292009.1,Q5T1H1,EYS C6orf178 C6orf179 C6orf180 EGFL10 EGFL11 S...
3,NM_032242.3,Q9UIW2,PLXNA1 NOV PLXN1
4,NM_130773.3,Q8WYK1,CNTNAP5 CASPR5


In [None]:
base_uniprot.tail(50)

Unnamed: 0,transcript_NCBI_id,Uniprot_id,Genes_Uniprot
1161,NM_054028.1,Q96KT7,SLC35G5 AMAC AMAC1L2
1162,NM_015052.4,Q76N89,HECW1 KIAA0322 NEDL1
1163,NM_002318.2,Q9Y4K0,LOXL2
1164,NM_003283.5,P13805,TNNT1 TNT
1165,NM_018380.3,Q9NUL7,DDX28 MDDX28
1166,NM_014112.4,Q9UHF7,TRPS1
1167,NM_003655.2,O00257,CBX4
1168,NM_052948.3,O14559,ARHGAP33 SNX26 TCGAP
1169,NM_001102594.1,Q86UW9,DTX2 KIAA1528 RNF58
1170,NM_001127372.2,P51523,ZNF84


#2 - Generating the file with the Uniprot database

In [None]:
base_uniprot.to_csv("drive/My Drive/MontagemNovaBase/ProcessamentoUniprot_Revisado/Uniprot_ACC.csv",sep='\t',index=False)