# Calculating Pathway Score from RPPA Data 
For this script you will need two txt files: 

1) RPPA data table that has been log normalized (columns consist of samples, index consists of protein). 

2) Pathway predictor table (columns should consist of pathway, predictor, weight, and count) 

This script will calculate the z-score of each protein and then use the pathway predictor table to determine the pathway scores for each provided sample. 


In [1]:
#import libraries  

import os
import os.path
import sys
import pandas as pd
import numpy as np
import scipy.stats 


## Set Your Directory and Confirm File Naming Convention Matches the Script
Set my_directory to the path location of your files. 
Files should be in the following format RPPA_Log2_{SAMPLE NAME OF YOUR CHOICE}_.txt and Pathway_Scores_{SAMPLE NAME OF YOUR CHOICE}.txt 

It is important that each experiment has its own pathway score file, so if you change something to fit your analysis, it does not impact other experimental analysis. 

In [6]:
#set a working directory
cwd = os.getcwd()


#define working sample within directory
sample_imput= input("Please Enter the sample you wish to work on that you have placed in the Imputs directory")
ls_sample = [sample_imput]

Please Enter the sample you wish to work on that you have placed in the Imputs directoryInt


## Import RPPA Data Frame 
Make sure that sample names are along the column headings and that proteins are along the index. We will remove spaces, hyphens, underscores, and capitalize all protein names. The dataframe will also be transposed so that proteins are the column names. 

An excel print out will give you counts of missing data for each protein and missing protein counts for each sample. 

Whenever you load a new file be sure that the delimiter/separation match the code. EG- in this case the txt file is delimited by whitespace so we pass the argument delim_whitespace = True. If it were separated by tabs we would deleted the previous argument and replace it with : sep = '\t' .If comma delimited there is no need to specify either of these arguments. 

In [7]:
# Import Data frame with log 2 normalized sample data 
for sample in ls_sample:
    df = pd.read_csv(
        f'Imputs/RPPA_Log2_{sample}.txt',
        delim_whitespace=True,
        index_col = 0
        
    )

#remove spaces hyphens and underscores 
pattern = '|'.join([' ', '-', '_'])
df.index = df.index.str.replace(pattern, '')

#Change Predictor color to uppercase
df.index = df.index.str.upper()



  df.index = df.index.str.replace(pattern, '')


In [8]:
for sample in ls_sample:
    df2 = pd.read_csv(
        f'Imputs/Pathway_Scores_{sample}.txt',
        sep = '\t' #if separated by white space use 'delim_whitespace = True' 
    
    )
#Remove inconsistent characters 
df2['Predictor'].replace(pattern,'', regex = True, inplace = True) # remove - from dataframe

#Change Predictor color to uppercase
df2['Predictor'] = df2['Predictor'].str.upper()

#change index to predictors 
df2.set_index('Predictor', inplace = True)

In [9]:
df5 = pd.merge(df, df2, on='Predictor')
#df5.reset_index(drop=True, inplace=True)
df5 = df5.drop_duplicates(subset=['Protein'], keep='last')
df5 = df5.drop(['Pathway','Weight','Count','Protein'], axis=1)

In [10]:
#for col in df5.columns: 
    #print(col) 

In [11]:
#transpose to get proteins in columns 
df1 = df5.T
cols = df1.columns.tolist()

#calculate missing values for proteins and samples export to view in excel 
df_samples = df1.isna().sum(axis = 1)
df_protein = df1.isna().sum(axis = 0)
df_samples.to_csv(f'Outputs/{sample}_Missing_Sample_Totals.csv', header = False)
df_protein.to_csv(f'Outputs/{sample}_Missing_Protein_Totals.csv', header = False)
df1.to_csv(f'Outputs/{sample}_score_pass.csv')

Score pass is delt with in the R markdown document Score_Calculator this needs to be run at this point we will aslo investigate making this document irrevelent and causing it to be done in python eventually 


## Deleting Rows or Columns with NaN entries CHOOSE ONLY ONE OF THE FOLLOWING per needs of experiment 
Choose option 1 if you want to delete samples that have missing protein data

Choose option 2 if you want to delete proteins that have missing values 

This step can be omitted if you do not have any NaN entries 

Add another logic gate that asks if you want to remove one or both or none do out patient dropping earlier we want to drop pathway here, if there is a protein missing in this sample we cant calculate the pathway score, if there is a protein 

WE need to drop wither the portein or the sample depeding 

In [141]:
 df1 = pd.read_csv("Outputs/Score_pass_back.csv")

In [142]:
df1

Unnamed: 0,Predictor,001P_1,001P_2,002_1,002P_1,003P_1,003P_2,003P_3,005_1,007_1,...,016_1,016_3,017_1,017_2,019_1,019_2,020_1,023_1,023_2,Pathway
0,1433BETA,-0.015185,0.236246,0.105427,0.000452,-0.168257,0.025241,0.541124,-0.172773,-0.031560,...,0.287232,0.210850,0.239674,0.272326,0.091779,-0.026102,-0.086248,-0.028022,-0.115955,Cell_cycle_progression
1,1433BETA,-0.015185,0.236246,0.105427,0.000452,-0.168257,0.025241,0.541124,-0.172773,-0.031560,...,0.287232,0.210850,0.239674,0.272326,0.091779,-0.026102,-0.086248,-0.028022,-0.115955,G0_G1
2,4EBP1PS65,0.167398,-0.101905,-0.285147,-0.731056,-0.522729,-0.271763,-0.170601,-0.262975,-0.151738,...,0.081102,-0.188676,-0.547832,-0.223147,-0.564906,-0.341603,-0.168607,0.101472,0.207719,TSC_mTOR
3,53BP1,0.169309,-0.013961,-0.356345,-0.917065,-0.436691,-0.874149,-0.872000,-0.243080,0.034701,...,0.183934,-0.868921,-0.460443,-0.479515,-0.227696,-0.062476,0.454319,0.236081,0.253004,G1_S
4,53BP1,0.169309,-0.013961,-0.356345,-0.917065,-0.436691,-0.874149,-0.872000,-0.243080,0.034701,...,0.183934,-0.868921,-0.460443,-0.479515,-0.227696,-0.062476,0.454319,0.236081,0.253004,G0_G1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106,WEE1PS642,0.129356,0.046076,0.053714,-0.156374,-0.104349,0.166102,0.127860,-0.108470,0.399441,...,-0.037086,0.127643,-0.295247,0.218803,0.264253,0.129049,0.067285,-0.030160,0.112956,DNA_Damage_Checkpoint
107,YAP,-0.286912,-0.316883,-0.137482,-0.275967,-0.088819,0.060770,0.236665,-0.323440,-0.441867,...,-0.007401,0.134789,0.267203,-0.087897,-0.210016,-0.080364,-0.529290,-0.211046,-0.501638,Notch
108,YAPPS127,-1.075035,-1.264643,-0.033752,-0.413574,-0.247559,-0.384498,-0.795372,-0.184943,-0.247771,...,0.321630,-1.233817,0.163414,-0.321299,0.157727,0.746198,-0.392708,-0.114359,-0.087817,Notch
109,YB1PS102,0.283650,0.211095,-0.232551,-0.598143,0.196917,0.573705,0.046703,-0.241812,-0.044211,...,-0.244933,0.018478,-0.130282,0.160406,-0.072360,-0.055351,-0.115983,-0.065400,0.000670,RAS_MAPK


In [87]:
#we will store our groups in a dictionary for looping purposes 
groups = (df1['Pathway'].append(df1['Pathway'])).unique()
print(groups)

['Cell_cycle_progression' 'G0_G1' 'TSC_mTOR' 'G1_S' 'PI3K_Akt'
 'Hormone_receptor' 'Epigenetic' 'DNA_Damage_Checkpoint'
 'Immune_Checkpoint' 'BH3_Balance' 'Hormone_signaling_Breast'
 'Tumor_Content' 'RAS_MAPK' 'Apoptosis' 'Immune' 'G2_M' 'RTK'
 'Histone_Alteration' 'DNA_Damage' 'Notch']


In [88]:
#We are going to use this unique column check that was submitted to stack overflow from usr yatu
def unique_cols(df):
    a = df.to_numpy() # df.values (pandas<0.24)
    return (a[0] == a).all()

In [89]:
#make a na calculator for each protein
#buoild another path for the TCGA samples
missing_proteins=df1
missing_proteins.set_index('Predictor',inplace=True)
missing_proteins=pd.DataFrame(missing_proteins.isnull().sum(axis=1))
#write this to a file 
missing_proteins.to_csv(f'Outputs/{sample}_missing_proteins.csv')

In [138]:
df1 = pd.read_csv("Outputs/Score_pass_back.csv")
df1

Unnamed: 0,Predictor,001P_1,001P_2,002_1,002P_1,003P_1,003P_2,003P_3,005_1,007_1,...,016_1,016_3,017_1,017_2,019_1,019_2,020_1,023_1,023_2,Pathway
0,1433BETA,-0.015185,0.236246,0.105427,0.000452,-0.168257,0.025241,0.541124,-0.172773,-0.031560,...,0.287232,0.210850,0.239674,0.272326,0.091779,-0.026102,-0.086248,-0.028022,-0.115955,Cell_cycle_progression
1,1433BETA,-0.015185,0.236246,0.105427,0.000452,-0.168257,0.025241,0.541124,-0.172773,-0.031560,...,0.287232,0.210850,0.239674,0.272326,0.091779,-0.026102,-0.086248,-0.028022,-0.115955,G0_G1
2,4EBP1PS65,0.167398,-0.101905,-0.285147,-0.731056,-0.522729,-0.271763,-0.170601,-0.262975,-0.151738,...,0.081102,-0.188676,-0.547832,-0.223147,-0.564906,-0.341603,-0.168607,0.101472,0.207719,TSC_mTOR
3,53BP1,0.169309,-0.013961,-0.356345,-0.917065,-0.436691,-0.874149,-0.872000,-0.243080,0.034701,...,0.183934,-0.868921,-0.460443,-0.479515,-0.227696,-0.062476,0.454319,0.236081,0.253004,G1_S
4,53BP1,0.169309,-0.013961,-0.356345,-0.917065,-0.436691,-0.874149,-0.872000,-0.243080,0.034701,...,0.183934,-0.868921,-0.460443,-0.479515,-0.227696,-0.062476,0.454319,0.236081,0.253004,G0_G1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106,WEE1PS642,0.129356,0.046076,0.053714,-0.156374,-0.104349,0.166102,0.127860,-0.108470,0.399441,...,-0.037086,0.127643,-0.295247,0.218803,0.264253,0.129049,0.067285,-0.030160,0.112956,DNA_Damage_Checkpoint
107,YAP,-0.286912,-0.316883,-0.137482,-0.275967,-0.088819,0.060770,0.236665,-0.323440,-0.441867,...,-0.007401,0.134789,0.267203,-0.087897,-0.210016,-0.080364,-0.529290,-0.211046,-0.501638,Notch
108,YAPPS127,-1.075035,-1.264643,-0.033752,-0.413574,-0.247559,-0.384498,-0.795372,-0.184943,-0.247771,...,0.321630,-1.233817,0.163414,-0.321299,0.157727,0.746198,-0.392708,-0.114359,-0.087817,Notch
109,YB1PS102,0.283650,0.211095,-0.232551,-0.598143,0.196917,0.573705,0.046703,-0.241812,-0.044211,...,-0.244933,0.018478,-0.130282,0.160406,-0.072360,-0.055351,-0.115983,-0.065400,0.000670,RAS_MAPK


In [131]:
#THis is a loop to get rid of the data in the proteins that is missing this 
#could be modified to get rid of it in the entire dataset 
#lets save this as a dataset of datasets
dropped_cols=[]
dataframe_collection = {}

for i in groups:
   
    df_hold=df1.loc[df1.Pathway==i]
    #print(df_hold)
    #first we will check if there are any cols that are totally missing from
    #the sample set and if we find any we will get rid of them 
    ind=df_hold.columns[df_hold.isnull().values.any(axis=0)].tolist()
    #print(ind)
    #print(len(df_hold.columns))
    if len(ind)>0:
        print("Entered dropper")
        df_hold=df_hold.drop(ind,axis="columns")
    #now we want to save these new datasets
    #we will do that by creating a dictionary of datasets
    dataframe_collection[i] = df_hold
    
    #df_droped_proteins=pd.DataFrame({"Idx":[1,i],"dfs":[df_droped_proteins,df_hold]})
    #print("count after drop")
    #print(len(df_hold.columns))
    ind.insert(0,i)
    dropped_cols.append(ind)
    
droped_data=pd.DataFrame(dropped_cols)
droped_data
#we can look at the dsctionary of datasets here


Entered dropper
Entered dropper
Entered dropper


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,Cell_cycle_progression,,,,,,,,,,,,,,,,
1,G0_G1,,,,,,,,,,,,,,,,
2,TSC_mTOR,,,,,,,,,,,,,,,,
3,G1_S,,,,,,,,,,,,,,,,
4,PI3K_Akt,,,,,,,,,,,,,,,,
5,Hormone_receptor,,,,,,,,,,,,,,,,
6,Epigenetic,,,,,,,,,,,,,,,,
7,DNA_Damage_Checkpoint,023_1,023_2,,,,,,,,,,,,,,
8,Immune_Checkpoint,,,,,,,,,,,,,,,,
9,BH3_Balance,,,,,,,,,,,,,,,,


In [92]:
for key in dataframe_collection.keys():
    print("\n" +"="*40)
    print(key)
    print("-"*40)
    print(dataframe_collection[key])
#print(dataframe_collection["Tumor_Content"])


Cell_cycle_progression
----------------------------------------
      Predictor    001P_1    001P_2     002_1    002P_1    003P_1    003P_2  \
0      1433BETA -0.015185  0.236246  0.105427  0.000452 -0.168257  0.025241   
32       CDC25C  0.373189  0.017453 -0.029126  0.333833  0.305895  0.152904   
33     CDK1PT14  0.453505  0.381661 -0.214161 -0.409900  0.749901 -0.351661   
36         CHK1  0.321988  0.210473  0.185539  0.214572  0.288979  0.188204   
45     CYCLINB1  0.444139  0.282333 -0.833408 -0.247227  0.503335 -0.774116   
49     CYCLIND1  0.237312  0.193498 -0.021292  0.365647  0.182389  0.389801   
79          P21 -0.153962 -0.139940 -0.056417  0.214950 -0.010635  0.025946   
81     P27PT198  0.133527  0.060793 -0.092377 -0.055818  0.032362  0.221887   
88         PLK1  0.488823  0.725593 -0.107016  0.319637  0.812924 -0.113435   
96  RBPS807S811  0.279722 -0.007609  0.076061  0.760967  0.374305 -0.373781   

      003P_3     005_1     007_1  ...     016_1     016_3     017

In [93]:
#this is approaching it from the sample droping direction
drop_grp=[]
for i in groups:
    df_hold=df1.loc[df1.Pathway==i]
    missing=pd.DataFrame(df_hold.isnull().sum(axis=1))
    #if the missing values are not the same across the different samples drop the sample with the most missing values.    
    #print(missing)

    while unique_cols(missing[0])==False:
        column = missing[0]
        max_value = column.idxmax()
        missing=missing.drop(labels=max_value,axis=0)
        drop_grp.append(max_value)
        
df_droped_samples = pd.DataFrame(df1,index = drop_grp)
df1=df1.drop(drop_grp)
    #this needs to be put back through the other one to remove missing 
print("the following samples were dropped")   
df_droped_samples

the following samples were dropped


Unnamed: 0,Predictor,001P_1,001P_2,002_1,002P_1,003P_1,003P_2,003P_3,005_1,007_1,...,016_1,016_3,017_1,017_2,019_1,019_2,020_1,023_1,023_2,Pathway
9,ATMPS1981,0.386665,-0.197895,-0.14493,0.158946,0.647,0.094359,0.146556,-0.182066,-0.044635,...,0.36417,0.949531,0.238711,-0.002832,-0.078209,-0.188853,-0.219293,,,DNA_Damage_Checkpoint
55,ECADHERIN,0.921717,-0.381172,1.543803,-1.320297,-0.443245,1.415854,2.762187,2.554583,1.306047,...,,,,,,,,,,Tumor_Content
54,ECADHERIN,-0.146206,-0.171157,-0.086268,-0.57259,-1.703004,0.355126,-0.413921,-0.295845,-0.636646,...,,,,,,,,,,Tumor_Content
62,H2AXPS139,-0.098278,-0.053944,0.49167,0.738345,0.17153,-0.070629,-0.085945,-0.151164,-0.106692,...,-0.245685,0.20022,0.404753,-0.206425,-0.217722,-0.219975,0.063358,,,DNA_Damage


In [137]:
df1


Unnamed: 0,Predictor,001P_1,001P_2,002_1,002P_1,003P_1,003P_2,003P_3,005_1,007_1,...,016_1,016_3,017_1,017_2,019_1,019_2,020_1,023_1,023_2,Pathway
0,1433BETA,-0.015185,0.236246,0.105427,0.000452,-0.168257,0.025241,0.541124,-0.172773,-0.031560,...,0.287232,0.210850,0.239674,0.272326,0.091779,-0.026102,-0.086248,-0.028022,-0.115955,Cell_cycle_progression
1,1433BETA,-0.015185,0.236246,0.105427,0.000452,-0.168257,0.025241,0.541124,-0.172773,-0.031560,...,0.287232,0.210850,0.239674,0.272326,0.091779,-0.026102,-0.086248,-0.028022,-0.115955,G0_G1
2,4EBP1PS65,0.167398,-0.101905,-0.285147,-0.731056,-0.522729,-0.271763,-0.170601,-0.262975,-0.151738,...,0.081102,-0.188676,-0.547832,-0.223147,-0.564906,-0.341603,-0.168607,0.101472,0.207719,TSC_mTOR
3,53BP1,0.169309,-0.013961,-0.356345,-0.917065,-0.436691,-0.874149,-0.872000,-0.243080,0.034701,...,0.183934,-0.868921,-0.460443,-0.479515,-0.227696,-0.062476,0.454319,0.236081,0.253004,G1_S
4,53BP1,0.169309,-0.013961,-0.356345,-0.917065,-0.436691,-0.874149,-0.872000,-0.243080,0.034701,...,0.183934,-0.868921,-0.460443,-0.479515,-0.227696,-0.062476,0.454319,0.236081,0.253004,G0_G1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
106,WEE1PS642,0.129356,0.046076,0.053714,-0.156374,-0.104349,0.166102,0.127860,-0.108470,0.399441,...,-0.037086,0.127643,-0.295247,0.218803,0.264253,0.129049,0.067285,-0.030160,0.112956,DNA_Damage_Checkpoint
107,YAP,-0.286912,-0.316883,-0.137482,-0.275967,-0.088819,0.060770,0.236665,-0.323440,-0.441867,...,-0.007401,0.134789,0.267203,-0.087897,-0.210016,-0.080364,-0.529290,-0.211046,-0.501638,Notch
108,YAPPS127,-1.075035,-1.264643,-0.033752,-0.413574,-0.247559,-0.384498,-0.795372,-0.184943,-0.247771,...,0.321630,-1.233817,0.163414,-0.321299,0.157727,0.746198,-0.392708,-0.114359,-0.087817,Notch
109,YB1PS102,0.283650,0.211095,-0.232551,-0.598143,0.196917,0.573705,0.046703,-0.241812,-0.044211,...,-0.244933,0.018478,-0.130282,0.160406,-0.072360,-0.055351,-0.115983,-0.065400,0.000670,RAS_MAPK


In [145]:
#This is the goal loop 
while True: 
     query = input("Would you Like to Delete samples that have Missing Proteins data or would you like to delete samples that have Missing Values?")
     if query not in ['Missing Proteins',"Missing Values"]: 
        print('Please answer with Missing Proteins or Missing Values') 
    #This will remove proteins however you dont save the datasets out that needs to be fixed how do we want to do that 
     elif query== 'Missing Proteins': 
            #THis is a loop to get rid of the data in the proteins that is missing this 
            #could be modified to get rid of it in the entire dataset 
            #lets save this as a dataset of datasets
            dropped_cols=[]
            dataframe_collection = {}

            for i in groups:

                df_hold=df1.loc[df1.Pathway==i]
                #print(df_hold)
                #first we will check if there are any cols that are totally missing from
                #the sample set and if we find any we will get rid of them 
                ind=df_hold.columns[df_hold.isnull().values.any(axis=0)].tolist()
                #print(ind)
                #print(len(df_hold.columns))
                if len(ind)>0:
                    print("Entered dropper")
                    df_hold=df_hold.drop(ind,axis="columns")
                #now we want to save these new datasets
                #we will do that by creating a dictionary of datasets
                dataframe_collection[i] = df_hold

                #df_droped_proteins=pd.DataFrame({"Idx":[1,i],"dfs":[df_droped_proteins,df_hold]})
                #print("count after drop")
                #print(len(df_hold.columns))
                ind.insert(0,i)
                dropped_cols.append(ind)

            droped_data=pd.DataFrame(dropped_cols)
            break
    #Now we need the one for removing samples
     elif query=="Missing Values":
       #this is approaching it from the sample droping direction
        drop_grp=[]
        for i in groups:
            df_hold=df1.loc[df1.Pathway==i]
            missing=pd.DataFrame(df_hold.isnull().sum(axis=1))
            #if the missing values are not the same across the different samples drop the sample with the most missing values.    
            #print(missing)

            while unique_cols(missing[0])==False:
                column = missing[0]
                max_value = column.idxmax()
                missing=missing.drop(labels=max_value,axis=0)
                drop_grp.append(max_value)

        droped_data = pd.DataFrame(df1,index = drop_grp)
        d_clean=df1.drop(drop_grp)
            #this needs to be put back through the other one to remove missing 
        break
        
print("the following samples were dropped")   
#droped_grp
print(droped_data)


Would you Like to Delete samples that have Missing Proteins data or would you like to delete samples that have Missing Values?Missing Proteins
Entered dropper
Entered dropper
Entered dropper
the following samples were dropped
                          0      1      2      3      4      5      6      7   \
0     Cell_cycle_progression   None   None   None   None   None   None   None   
1                      G0_G1   None   None   None   None   None   None   None   
2                   TSC_mTOR   None   None   None   None   None   None   None   
3                       G1_S   None   None   None   None   None   None   None   
4                   PI3K_Akt   None   None   None   None   None   None   None   
5           Hormone_receptor   None   None   None   None   None   None   None   
6                 Epigenetic   None   None   None   None   None   None   None   
7      DNA_Damage_Checkpoint  023_1  023_2   None   None   None   None   None   
8          Immune_Checkpoint   None   None   

In [None]:
 #df1.dropna(axis = 0, how = 'any', inplace = True) #drop rows/samples with NaN 

In [None]:
 df1.dropna(axis = 1, how = 'any', inplace = True) #drop columns/proteins with NaN 

## Calculate the standard deviation for each Protein (column) in the RPPA data 
Can either mean or median center. Manipulate code on line six to change method. EG: df1[col] - df1[col].mean . . . vs df1[col] - df1[col].median  

In [None]:
#standard deviation of each column

cols = df1.columns.tolist()
for col in cols:
    df1[col] = (df1[col] - df1[col].median())/df1[col].std(ddof=0)

## Import data frame with pathway score information and edit to match RPPA dataframe columns
Data will need to be trimmed so that spaces, dashes, underscores, and lower case are removed. In these steps we will also change the column of predictors to the data frame index. As mentioned above, be sure the code in line 5 matches how your txt file is delimited

In [None]:
for sample in ls_sample:
    print(f'Pathway_Scores_{sample}.txt')
    df2 = pd.read_csv(
        f'Imputs/Pathway_Scores_{sample}.txt',
        sep = '\t' #if separated by white space use 'delim_whitespace = True' 
    
    )
#Remove inconsistent characters 
df2['Predictor'].replace(pattern,'', regex = True, inplace = True) # remove - from dataframe

#Change Predictor color to uppercase
df2['Predictor'] = df2['Predictor'].str.upper()


#change index to predictors 
df2.set_index('Predictor', inplace = True)

## Create new pathway predictor dataframe based on dropped RPPA data
from the step where missing data was removed by colum or row we have potentially removed proteins. These proteins also need to be removed from the predictor dataframe. 

The result of this step will create such dataframe and also create an excel file of the pathway predictors that were not used. The purpose of this excel file is to identify inconsistencies within the protein names amongst the two documents. 

In [None]:
df3 = df2.loc[cols].reindex()#find columns that match with RPPA dataframe, create a new dataframe from these matches
df3.dropna(axis = 0, how = 'any', inplace = True)

In [None]:
#identify the proteins that were dropped from the Pathways analysis 
missing = df2[~df2.index.isin(df3.index)]
missing.to_csv(f'Outputs/{sample}_not_included.csv', header = True)

In [None]:
#create a dictionary of unique pathways and their corresponding predictors with weights
paths = df3.Pathway.unique()
d = {path: pd.DataFrame(data=df3.loc[df3['Pathway'] == path]) for path in paths}

In [None]:
#calculate Pathways Score for Each Samples 
for path in paths:
    weights = d[f'{path}'].loc[:,'Weight']
    markers = d[f'{path}'].index.tolist()
    temp_column = f'{path}_Pathway_Score'
    df1[temp_column] = ((df1[markers] * weights).sum(axis = 1))/(d[f'{path}'].loc[:,'Count'].sum())
    
   

In [None]:
#create dataframe with just prediction data for export
df_score = df1.filter(like ='Pathway_Score', axis = 1)

In [None]:
#export
df_score.to_csv(f'Outputs/{sample}_Pathways_Score.csv')