Script to parse the Reactome database tsv file for UniProt to lowest pathway hierarchies versus all pathway hierarchies, get some characterizations of how many pathways exist, what the intersection is with the drug targets, etc

In [1]:
import pandas as pd

In [2]:
lowest_pathway_reactome_file_path = 'data/Reactome/UniProt2Reactome.tsv'
all_pathways_reactome_file_path = 'data/Reactome/UniProt2Reactome_All_Levels.tsv'

lowest_pathways_df = pd.read_csv(lowest_pathway_reactome_file_path, sep='\t', names=['UniProtKB_ID', 'Reactome_ID', 'Reactome_URL', 'Pathway_Name', 'Evidence_Code', 'Species'])
print("Original lowest pathways shape: " + str(lowest_pathways_df.shape))
lowest_pathways_df = lowest_pathways_df[lowest_pathways_df['Species'] == 'Homo sapiens']
print("Filtered lowest pathways shape restricting to Homo sapiens: " + str(lowest_pathways_df.shape))

all_pathways_df = pd.read_csv(all_pathways_reactome_file_path, sep='\t', names=['UniProtKB_ID', 'Reactome_ID', 'Reactome_URL', 'Pathway_Name', 'Evidence_Code', 'Species'])
print("Original all pathways shape: " + str(all_pathways_df.shape))
all_pathways_df = all_pathways_df[all_pathways_df['Species'] == 'Homo sapiens']
print("Filtered all pathways shape restricting to Homo sapiens: " + str(all_pathways_df.shape))

# Drop the Reactome URL, Evidence Code, and Species columns (unnecessary)
lowest_pathways_df.drop(columns=['Reactome_URL', 'Evidence_Code', 'Species'], inplace=True)
all_pathways_df.drop(columns=['Reactome_URL', 'Evidence_Code', 'Species'], inplace=True)

# Save to CSV file
lowest_pathways_df.to_csv('data_processed/reactome_lowest_pathways_homo_sapiens.csv', index=None)
all_pathways_df.to_csv('data_processed/reactome_all_pathways_homo_sapiens.csv', index=None)


Original lowest pathways shape: (302009, 6)
Filtered lowest pathways shape restricting to Homo sapiens: (51056, 6)
Original all pathways shape: (871849, 6)
Filtered all pathways shape restricting to Homo sapiens: (149564, 6)


In [3]:
# Test which pathways are present for the first drug target, Prothrombin, P00734

prothrombin_lowest_pathways = lowest_pathways_df[lowest_pathways_df['UniProtKB_ID'] == 'P00734']
print("Number of lowest prothrombin pathways: " + str(prothrombin_lowest_pathways.shape[0]))

prothrombin_all_pathways = all_pathways_df[all_pathways_df['UniProtKB_ID'] == 'P00734']
print("Number of all prothrombin pathways: " + str(prothrombin_all_pathways.shape[0]))

prothrombin_all_pathways.head()

Number of lowest prothrombin pathways: 14
Number of all prothrombin pathways: 33


Unnamed: 0,UniProtKB_ID,Reactome_ID,Pathway_Name
396991,P00734,R-HSA-109582,Hemostasis
396992,P00734,R-HSA-140837,Intrinsic Pathway of Fibrin Clot Formation
396993,P00734,R-HSA-140875,Common Pathway of Fibrin Clot Formation
396994,P00734,R-HSA-140877,Formation of Fibrin Clot (Clotting Cascade)
396995,P00734,R-HSA-159740,Gamma-carboxylation of protein precursors
