#### Second attempt at the python version of the centralised part of the microarray methylation analysis workflow (Quality control upto normalisation)
Using python as a shell to string together the specialised r functions used in the Exeter workflow

Loading in the required modules/packages

In [1]:
import pandas as pd
import numpy as np
import subprocess
import csv
import glob
import os
import re
import seaborn as sns
from matplotlib import pyplot as plt

# stuff needed for some specific analysis - maybe not needed in this version of the code
#from sklearn.decomposition import PCA 
#from scipy.stats import pearsonr
#from sklearn.cluster import KMeans

In [2]:
working_path = "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Federated_Differential_Methylation_Analysis"
data_path = "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data"
output_path = "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\Practical work\\Federated_Differential_Methylation_Analysis\\Output"


Use subprocess to read the data contained in the idat files into dataframe using the readEPIC function from the wateRmelon package in R

In [29]:
load = subprocess.run(["C:\\Program Files\\R\\R-4.1.2\\bin\\Rscript.exe", '--vanilla', "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Federated_Differential_Methylation_Analysis\\Loading_idats_code_saveOutput_python_shell.R", "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_RAW\\GSE66351", "E:\Msc Systems Biology\MSB5000_Master_Thesis\Practical work\Data\GSE66351_RAW\GSE66351_pheno_info.txt", "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\Practical work\\Federated_Differential_Methylation_Analysis\\Output"], capture_output=True)

In [None]:
print(load.stdout)

### Creating an output file structure and loading in the idat files

The input arguments of this script are: 
1. file_path to the folder containing the .idat files 
2. file_path to the phenotype information sheet (.txt) 
3. the directory where the output should be saved 
4. OPTIONAL the data identifier to be used in the creation of the output folders - this still needs to be fixed

In [11]:
load_with_option = subprocess.run(["C:\\Program Files\\R\\R-4.1.2\\bin\\Rscript.exe", '--vanilla', "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Federated_Differential_Methylation_Analysis\\Loading_idats_code_saveOutput_python_shell_dataID_option.R", "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_RAW\\GSE66351", "E:\Msc Systems Biology\MSB5000_Master_Thesis\Practical work\Data\GSE66351_RAW\GSE66351_pheno_info.txt", "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\Practical work\\Federated_Differential_Methylation_Analysis\\Output", "GSE66351a"], capture_output=True)

In [None]:
print(load_with_option.stderr)

Using subprocess to perform the complete preprocessing workflow upto the normalisation  
This is the whole preprocessing chopped into three functions but run as one .r script - maybe smart to run each function as its own script with own input

In [69]:
complete_preprocessing = subprocess.run(["C:\\Program Files\\R\\R-4.1.2\\bin\\Rscript.exe", '--vanilla', "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Federated_Differential_Methylation_Analysis\\preprocessing_r_code_replication_shell_version_no_norm.r", "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_RAW\\GSE66351", "E:\Msc Systems Biology\MSB5000_Master_Thesis\Practical work\Data\GSE66351_RAW\GSE66351_pheno_info.txt", "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\Practical work\\Federated_Differential_Methylation_Analysis\\Output", "E:\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_RAW\\GSE66351\\GPL13534_HumanMethylation450_15017482_v.1.1.csv"], capture_output = True)

In [None]:
#check what happend in the subprocess
print(complete_preprocessing.stderr)
print(complete_preprocessing.stdout)

In [19]:
preprocessing_normalisation = subprocess.run(["C:\\Program Files\\R\\R-4.1.2\\bin\\Rscript.exe", '--vanilla',"E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Federated_Differential_Methylation_Analysis\\preprocessing_r_code_replication_shell_version.r", "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_RAW\\GSE66351", "E:\Msc Systems Biology\MSB5000_Master_Thesis\Practical work\Data\GSE66351_RAW\GSE66351_pheno_info.txt", output_path], capture_output=True)

In [None]:
print(preprocessing_normalisation.stderr)
print(preprocessing_normalisation.stdout)

Next step is to normalise the data, this step will be offered centrally and distributed/federated to be flexible to the researchers needs  
Below a implementation of the normalisation algorithm behind the dasen function in the wateRmelon package is provided

Dasen normalisation is a form of quantile normalisation that is performed for the two probe types seperately. The normalised data (betas), per probe type, are calculated using the normalised methylated and unmethylated intensities of each probe type.  
    betas (per probe) = quantile normalised methylated intensities / (quantile normalised methylated intensities + quantile normalised unmethylated intensities + 100)  
The first step is to write the quantile normalisation function

In [None]:
# quantile normalisation function
def quantile_normalise(input_data):
    """
    input_data = a dataframe that needs to be quantile normalised
    returns a quantile normalised version of the input_data
    """
    data_sorted = pd.DataFrame(np.sort(input_data.values, axis = 0), index = input_data.index, columns = input_data.columns) #sort the values of each column (sample) and keep the original row 
    # and column names
    data_sorted_means = data_sorted.mean(axis = 1) # calulate the row means of the sorted data -> these means will be used to replace the raw values in the data
    data_sorted_means.index = np.arrange(1, len(data_sorted_means)+1) # this sets the index so it will correspond to the descending ranks that will be assigned to the original 
    # data in the dataframe. This way the row means, which are sorted loweste to highest, can be used to replace the raw data in the correct order
    data_rank = input_data.rank(method = min).stack().astype(int) # get the rank of the values for each sample in the raw dataset in integer format and change the dataframe so that
    # the columns become the rows, with a multi-index indicating probe as the highest level and the samples for that probe as the second level
    QN_data = data_rank.map(data_sorted_means).unstack() # map the row mean values onto the matching ranks obtained from the original dataframe and bring it back to a row = probe
    # and column = sample format
    return (QN_data)

    

Before the dasen function can be coded, first a couple of supporting functions need to be translated from r to python, these have been defined in the wateRmelon package as:  
* dfs2
* dfsfit

In [None]:
def dfs2_python(x, probe_type):
    from sklearn.neighbors import KernelDensity
    KD_one = KernelDensity(kernel = "gaussian", ).fit(x[probe_type == "I"])
    one = KD_one.score_samples(x[probe_type == "I"])
    KD_two = KernelDensity(kernel = "gaussian", ).fit(x[probe_type == "II"])
    two = KD_two.score_samples(x[probe_type == "II"])
    out = one.amax - two.amax #not quite sure if any of this is correct

In [None]:
def dfsfit_python():

In [None]:
# dasen normalisation
def dasen_normalisation(unmethylated, methylated, probe_type, base = 100):
    """
    computes the dasen normalised beta values: quantile normalises the unmethylated and methylated intensities, per probe type,
    and uses these normalised intensities to calculate the beta values

    Input arguments:
    unmethylated = dataframe of unmethylated intensities
    methylated = dataframe of methylated intensities
    probe_type = series indicating the type of each probe (Type I or Type II)

    Returns: a dataframe of normalised beta values
    """
    unmethylated

For now, to move on to writing the EWAS code, I wrote a script around the normalisation with the dasen function and the cell type decomposition in r which will be run as a subprocess. The normalisation will be implemented in python in the final version but the cell type decomposition remains r based because there is limitted need to reimplement that in a federated fashion - THIS IS NOT WORKING, INLCUDED THE NORMALISATION AND CELL TYPE DECOMPOSITION INTO THE R-SCRIPT THAT IS RUN IN THE SUBPROCESS FOR NOW

In [71]:
normalisation = "dasen_normalisation.r"
normalisation_file = os.path.join(working_path, normalisation)
data = os.path.join(output_path, "preprocessed_MethyLumiSet.RData") 
manifest_path = "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_RAW\\GPL13534_HumanMethylation450_15017482_v.1.1.csv"


In [72]:
r_normalisation = subprocess.run(["C:\\Program Files\\R\\R-4.1.2\\bin\\Rscript.exe", '--vanilla', normalisation_file, data, output_dir, manifest_path], capture_output = True)

In [None]:
print(r_normalisation.stderr)
print(r_normalisation.stdout)

R script containing the RefFreeEWAS cell type decomposition which will be run in a subprocess, output saved and added to the phenotype information that will be used in the EWAS furhter down in this file

In [28]:
# specifying the paths that go into the subprocess function
file_path = os.path.join(working_path, "RefFreeEWAS_local.r")
data_path = os.path.join(output_path, "Preprocessed_Normalised_MethyLumiSet.RData")
manifest_path = "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_RAW\\GPL13534_HumanMethylation450_15017482_v.1.1.csv"
pheno_path = os.path.join(output_path, "post_norm_pheno_information.csv")

# RefFreeEWAS subprocess
RefFreeEWAS = subprocess.run(["C:\\Program Files\\R\\R-4.1.2\\bin\\Rscript.exe", '--vanilla', file_path, data_path, pheno_path, manifest_path], capture_output=True)

In [29]:
RefFreeEWAS.stderr



Removing unwanted probes from the dataset

In [None]:
# SNP and overlapping (?) probe removal

EWAS code, using the linear model function from sklearn

In [14]:
#EWAS
#import statsmodels.api as sm # this contains an R-like linear model function that is more intuitive than the sklearn equivalent
#from patsy import dmatrices
import numpy as np
import pandas as pd

pheno = pd.read_csv("E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Federated_Differential_Methylation_Analysis\\Output\\QC_GSE66351_PythonShell\\post_norm_pheno_information.csv", index_col= "Sample_ID")
betas = pd.read_csv("E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Federated_Differential_Methylation_Analysis\Output\\QC_GSE66351_PythonShell\\Preprocessed_betas.csv", index_col=0)
x = pheno.loc[:,["Sample_diagnosis", "Sample_age", "Sample_sex", "Sample_sentrix_id"]] # design matrix with the dependent/explainatory variables to be included in the model
y = betas.iloc[0:20,:] # keeping it small now to test if everything works the way it should

# The design matrix needs to consist of numeric representations of the covariates to be included in the model, i.e. binary diagnosis, binary sex, dummy sentrix etc.
x["Sample_diagnosis"] = (x["Sample_diagnosis"] == "diagnosis: AD").astype(int) #create binary diagnosis with 1 = AD and 0 = CTR
x["Sample_sex"] = (x["Sample_sex"] == "Sex: F").astype(int) #create binary sex with 1 = F and 0 = M
# create dummy variables for the unique sentrix_ids present in the dataset - this code can be reused to create center number dummies in the federated version of the code
unique_ids = x["Sample_sentrix_id"].unique()
for id in unique_ids:
    x[id] = (x["Sample_sentrix_id"] == id).astype(int)
x.drop(columns="Sample_sentrix_id", inplace = True)
# turn the age variable into a continuous numerical variable without any leftover text
x["Sample_age"].replace("^[^:]*:", "", regex=True, inplace=True)
x["Sample_age"] = pd.to_numeric(x["Sample_age"])

x_matrix = x.values
y_matrix = y.values




  


' coefficient = []\nstandard_error = []\n#t_stat = []\n#p_value = []\nfor i in range(0, n):\n   y_m = y_matrix[i, :]\n   x_t = x_matrix.T @ x_matrix\n   x_t_y = x_matrix.T @ y\n   x_t_inv = np.linalg.inv(x_t)\n   coef = x_t_inv @ x_t_y\n   coefficient.append(coef)\n   stan_er = np.diag(x_t_inv)\n   standard_error.append(stan_er) '

In [17]:
n = y_matrix.shape[0] # select the number of rows of the beta matrix - #genes that the linear model will be calculated for
m = x.shape[1] #select the number of columns from the design matrix

coefficient = []
standard_error = []
#t_stat = []
#p_value = []
for i in range(0, n):
   y_m = y_matrix[i, :]
   x_t = x_matrix.T @ x_matrix
   x_t_y = x_matrix.T @ y
   x_t_inv = np.linalg.inv(x_t)
   coef = x_t_inv @ x_t_y
   coefficient.append(coef)
   stan_er = np.diag(x_t_inv)
   standard_error.append(stan_er)

In [19]:
coefficient[1]

Unnamed: 0,GSM2808875_8918692108_R01C02,GSM2808876_8918692108_R01C01,GSM2808877_8918692108_R02C02,GSM2808878_8918692108_R02C01,GSM2808879_8918692108_R03C02,GSM2808880_8918692108_R03C01,GSM2808881_8918692108_R04C02,GSM2808882_8918692108_R04C01,GSM2808883_8918692108_R05C02,GSM2808884_8918692108_R05C01,GSM2808885_8918692108_R06C02,GSM2808886_8918692108_R06C01,GSM2808887_8918692120_R04C02,GSM2808888_8918692120_R04C01,GSM2808889_8918692120_R05C02,GSM2808890_8918692120_R05C01,GSM2808891_8918692120_R06C02,GSM2808892_8918692120_R06C01,GSM2808893_8221932039_R04C01,GSM2808894_8221932039_R03C01
0,0.847292,0.785509,0.749685,0.703853,0.816978,0.77548,0.711682,0.662213,0.835412,0.659004,0.611971,0.702149,0.715169,0.707274,0.627512,0.677326,0.752324,0.616838,0.776581,0.697137
1,-0.014892,-0.013386,-0.013578,-0.01243,-0.014507,-0.013339,-0.013274,-0.01187,-0.014387,-0.011642,-0.011648,-0.012404,-0.013492,-0.012426,-0.011994,-0.012135,-0.01376,-0.011126,-0.013959,-0.012095
2,0.272046,0.295901,0.290442,0.300071,0.280151,0.278271,0.289889,0.318216,0.261543,0.294991,0.332527,0.299931,0.321115,0.31205,0.305292,0.295426,0.270812,0.279921,0.282431,0.270201
3,0.759083,0.686398,0.740552,0.682565,0.748327,0.691155,0.752335,0.678996,0.738872,0.66719,0.704505,0.685901,0.748042,0.681661,0.712129,0.692472,0.75454,0.672221,0.7531,0.677003
4,0.615132,0.525298,0.590424,0.525253,0.608239,0.549017,0.604088,0.499893,0.59056,0.511442,0.526275,0.512634,0.590768,0.507487,0.561179,0.538825,0.614685,0.527282,0.600166,0.52464
5,1.589652,1.438197,1.469968,1.375164,1.572272,1.482745,1.408578,1.318274,1.563075,1.315851,1.285725,1.365671,1.437791,1.355071,1.334247,1.356323,1.4867,1.277852,1.488898,1.381213
