In [1]:
import pandas as pd

## 1.1) Explore information sources and compile the seed gene list:
a) Get the list of human genes (i.e. the **seed list**) involved in the disease **Cardiomyopathy, Dilated** from the dataset *“Curated gene-disease associations”* (from *https://www.disgenet.org/downloads*)

In [5]:
path = "C:/Users/clara/Documents/Bio/curated_gene_disease_associations.tsv"
curated_gene_desease_association_DF = pd.read_csv(path, sep = '\t', compression = 'infer')

In [18]:
cardio_DF = curated_gene_desease_association_DF.loc[curated_gene_desease_association_DF['diseaseId'] == 'C0007193']
print('Number of detected genes involved in the desease "Cardiomyopathy, Dilated": ', len(cardio_DF))
cardio_DF.head(5)

Number of detected genes involved in the desease "Cardiomyopathy, Dilated":  48


Unnamed: 0,geneId,geneSymbol,DSI,DPI,diseaseId,diseaseName,diseaseType,diseaseClass,diseaseSemanticType,score,EI,YearInitial,YearFinal,NofPmids,NofSnps,source
502,58,ACTA1,0.54,0.769,C0007193,"Cardiomyopathy, Dilated",group,C14,Disease or Syndrome,0.4,1.0,2006.0,2013.0,2,0,GENOMICS_ENGLAND
610,70,ACTC1,0.61,0.538,C0007193,"Cardiomyopathy, Dilated",group,C14,Disease or Syndrome,0.65,1.0,2006.0,2019.0,0,3,CTD_human
1414,153,ADRB1,0.555,0.769,C0007193,"Cardiomyopathy, Dilated",group,C14,Disease or Syndrome,0.58,1.0,1998.0,2019.0,1,0,CTD_human
1444,154,ADRB2,0.442,0.923,C0007193,"Cardiomyopathy, Dilated",group,C14,Disease or Syndrome,0.51,1.0,2002.0,2008.0,1,0,CTD_human
3663,355,FAS,0.372,0.923,C0007193,"Cardiomyopathy, Dilated",group,C14,Disease or Syndrome,0.51,1.0,1999.0,2007.0,1,0,CTD_human


In [28]:
seed_list = list(cardio_DF['geneId'])

b) Check if gene symbols for all genes in the seed gene list are updated and approved on the *HGNC* website

In [36]:
cardio_DF['geneSymbol'].to_csv('gene_symbols.csv', index = False, header = False, line_terminator = ',')
gene_symbols_list = list(cardio_DF['geneSymbol'])
print(gene_symbols_list)

['ACTA1', 'ACTC1', 'ADRB1', 'ADRB2', 'FAS', 'FASLG', 'ATM', 'CD36', 'CSF3', 'NKX2-5', 'CTNNB1', 'DMD', 'EGFR', 'FASN', 'GPX1', 'ITGB1', 'LMNA', 'NR3C2', 'MYH6', 'MYH7', 'NPPA', 'NPPB', 'PKP2', 'PSEN1', 'PSEN2', 'RAC1', 'RAF1', 'RENBP', 'SCN5A', 'SDHA', 'SGCB', 'SHBG', 'SLC22A5', 'SOD2', 'TCF7L2', 'TMPO', 'TNNI3', 'TNNT2', 'TTN', 'UCP1', 'ALMS1', 'AXIN2', 'ABCC9', 'WDR12', 'CSRNP1', 'ABRA', 'SIK1', 'CAVIN4']


In [55]:
cardio_DF['geneId'].to_csv('geneId.csv', index = False, header = False, line_terminator = ',')

The multi-symbol checker on the *HGNC* database reports that all the gene symbols in our seed list are updated and approved. Three of them match both an approved and an alias symbol, namely ***FAS***, ***RAC1*** and ***RAF1***

For each protein in our seed list we want to collect the following information from the *Uniprot* website:
* official (primary) **gene symbol** --> *Gene names (primary)*
* **Uniprot AC**, alphanumeric ‘accession number’ (a.k.a. ’Uniprot entry’) --> *Entry*
* **protein name** (the main one only, do not report the aliases)
* **Entrez Gene ID** (a.k.a. ‘GeneID’) --> *geneID* from disgenet
* very brief description of its function (keep it very short, i.e. max 20 words)
* notes related to the above information, if any and if relevant

**NOTE**: With regards to the gene symbol **TMPO** only the entry corresponding to the protein *Thymopoietin, isoforms alpha* (P42166) has been kept since its information on the *HGNG* only refer to this one and not to the isoform beta/gamma

In [56]:
root = "C:/Users/clara/Documents/Bio/"
path = root + "uniprot-list-with-ids.csv"
uniprot_DF = pd.read_csv(path, sep = '\t')

In [59]:
uniprot_DF[['GeneName','UniprotAC', 'ProteinName', 'GeneId']]

Unnamed: 0,GeneName,UniprotAC,ProteinName,GeneId
0,ACTA1,P68133,"Actin, alpha skeletal muscle (Alpha-actin-1) [...",58
1,ACTC1,P68032,"Actin, alpha cardiac muscle 1 (Alpha-cardiac a...",70
2,ADRB1,P08588,Beta-1 adrenergic receptor (Beta-1 adrenorecep...,153
3,ADRB2,P07550,Beta-2 adrenergic receptor (Beta-2 adrenorecep...,154
4,FAS,P25445,Tumor necrosis factor receptor superfamily mem...,355
5,FASLG,P48023,Tumor necrosis factor ligand superfamily membe...,356
6,ATM,Q13315,Serine-protein kinase ATM (EC 2.7.11.1) (Ataxi...,472
7,CD36,P16671,Platelet glycoprotein 4 (Fatty acid translocas...,948
8,CSF3,P09919,Granulocyte colony-stimulating factor (G-CSF) ...,1440
9,NKX2-5,P52952,Homeobox protein Nkx-2.5 (Cardiac-specific hom...,1482


In [57]:
uniprot_DF.head(20)

Unnamed: 0,UniprotAC,Status,Entry name,GeneName,ProteinName,GeneId
0,P68133,reviewed,ACTS_HUMAN,ACTA1,"Actin, alpha skeletal muscle (Alpha-actin-1) [...",58
1,P68032,reviewed,ACTC_HUMAN,ACTC1,"Actin, alpha cardiac muscle 1 (Alpha-cardiac a...",70
2,P08588,reviewed,ADRB1_HUMAN,ADRB1,Beta-1 adrenergic receptor (Beta-1 adrenorecep...,153
3,P07550,reviewed,ADRB2_HUMAN,ADRB2,Beta-2 adrenergic receptor (Beta-2 adrenorecep...,154
4,P25445,reviewed,TNR6_HUMAN,FAS,Tumor necrosis factor receptor superfamily mem...,355
5,P48023,reviewed,TNFL6_HUMAN,FASLG,Tumor necrosis factor ligand superfamily membe...,356
6,Q13315,reviewed,ATM_HUMAN,ATM,Serine-protein kinase ATM (EC 2.7.11.1) (Ataxi...,472
7,P16671,reviewed,CD36_HUMAN,CD36,Platelet glycoprotein 4 (Fatty acid translocas...,948
8,P09919,reviewed,CSF3_HUMAN,CSF3,Granulocyte colony-stimulating factor (G-CSF) ...,1440
9,P52952,reviewed,NKX25_HUMAN,NKX2-5,Homeobox protein Nkx-2.5 (Cardiac-specific hom...,1482
