DNA methylation data preprocessing and analysis workflow based on the pipeline provided by Ehsan and collegues from Exeter University. 
This version will contain a mix of python native function if they are available and r integrations of functions that don't have a
python native version yet. Of these r integrated function a selection will be translated into python native functions if it is 
necessary for the continuation of the project.

Install the PyMethylProcess package from GitHub - https://github.com/Christensen-Lab-Dartmouth/PyMethylProcess/blob/master/pymethylprocess/MethylationDataTypes.py - because this has similar functionalities, based on the same functions, as those in the Exeter pipeline

In [1]:
pip install git+https://github.com/bodono/scs-python.git@bb45c69ce57b1fbb5ab23e02b30549a7e0b801e3 git+https://github.com/jlevy44/hypopt.git@af59fbed732f5377cda73fdf42f3d4981c2be3ce

Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/bodono/scs-python.git@bb45c69ce57b1fbb5ab23e02b30549a7e0b801e3
  Cloning https://github.com/bodono/scs-python.git (to revision bb45c69ce57b1fbb5ab23e02b30549a7e0b801e3) to c:\users\silke\appdata\local\temp\pip-req-build-x13x6rw9
Note: you may need to restart the kernel to use updated packages.


  Running command git clone -q https://github.com/bodono/scs-python.git 'C:\Users\Silke\AppData\Local\Temp\pip-req-build-x13x6rw9'
  Running command git checkout -q bb45c69ce57b1fbb5ab23e02b30549a7e0b801e3
  Running command git submodule update --init --recursive -q


Collecting git+https://github.com/jlevy44/hypopt.git@af59fbed732f5377cda73fdf42f3d4981c2be3ce
  Cloning https://github.com/jlevy44/hypopt.git (to revision af59fbed732f5377cda73fdf42f3d4981c2be3ce) to c:\users\silke\appdata\local\temp\pip-req-build-rnn44w0l
Using legacy setup.py install for scs, since package 'wheel' is not installed.
Using legacy setup.py install for hypopt, since package 'wheel' is not installed.


  Running command git clone -q https://github.com/jlevy44/hypopt.git 'C:\Users\Silke\AppData\Local\Temp\pip-req-build-rnn44w0l'
You should consider upgrading via the 'C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\python.exe -m pip install --upgrade pip' command.


Install the methylsuite packages - https://pypi.org/project/methylsuite/ - because this contains functions that can read in raw data files 
rather than beta-matrices

In [2]:
pip install methylsuite

Defaulting to user installation because normal site-packages is not writeable
Collecting methylsuite
  Downloading methylsuite-1.3.0-py3-none-any.whl (2.7 kB)
Collecting methylprep
  Downloading methylprep-1.6.1-py3-none-any.whl (1.3 MB)
Collecting lxml
  Downloading lxml-4.8.0-cp37-cp37m-win_amd64.whl (3.6 MB)
Collecting tqdm
  Downloading tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
Collecting bs4
  Downloading bs4-0.0.1.tar.gz (1.1 kB)
Collecting methylize
  Downloading methylize-1.0.1-py3-none-any.whl (90 kB)
Collecting statsmodels
  Downloading statsmodels-0.13.2-cp37-cp37m-win_amd64.whl (9.0 MB)
Collecting methylcheck
  Downloading methylcheck-0.8.4-py3-none-any.whl (10.8 MB)
Collecting seaborn
  Downloading seaborn-0.11.2-py3-none-any.whl (292 kB)
Collecting requests
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
Collecting matplotlib
  Downloading matplotlib-3.5.1-cp37-cp37m-win_amd64.whl (7.2 MB)
Collecting pandas
  Downloading pandas-1.3.5-cp37-cp37m-win_amd64.whl (10



Collecting toolshed
  Downloading toolshed-0.4.6.tar.gz (13 kB)
Collecting interlap
  Downloading interlap-0.2.7.tar.gz (6.1 kB)
Collecting packaging>=21.3
  Downloading packaging-21.3-py3-none-any.whl (40 kB)
Collecting patsy>=0.5.2
  Downloading patsy-0.5.2-py2.py3-none-any.whl (233 kB)
Collecting xlsxwriter
  Downloading XlsxWriter-3.0.2-py3-none-any.whl (149 kB)
Collecting openpyxl
  Downloading openpyxl-3.0.9-py2.py3-none-any.whl (242 kB)
Collecting idna<4,>=2.5; python_version >= "3"
  Downloading idna-3.3-py3-none-any.whl (61 kB)
Collecting certifi>=2017.4.17
  Downloading certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
Collecting urllib3<1.27,>=1.21.1
  Downloading urllib3-1.26.8-py2.py3-none-any.whl (138 kB)
Collecting charset-normalizer~=2.0.0; python_version >= "3"
  Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Collecting fonttools>=4.22.0
  Downloading fonttools-4.29.1-py3-none-any.whl (895 kB)
Collecting pillow>=6.2.0
  Downloading Pillow-9.0.1-cp37-cp37m

You should consider upgrading via the 'C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python37_64\python.exe -m pip install --upgrade pip' command.


Importing the functionalities needed for the preprocessing and the analysis

In [9]:
from methylprep import make_pipeline, get_sample_sheet

In [10]:
from methylprep.files import create_sample_sheet, find_sample_sheet

In [11]:
import csv
import pandas as pd

Load the original series matrix

In [12]:
original_GSE66351_matrix = pd.read_table("E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_RAW\\GSE66351_series_matrix_pheno.txt", header=None)

Rename the columns of the series matrix file so that they match the required column names for the import function from the methyprep package:
Has to include Sentrix_ID and Sentrix_position. The original file does not have these column names (GSE66351)

In [13]:
original_GSE66351_matrix_trans = original_GSE66351_matrix.transpose()
colum_names = ["Sample_Name", "GEO_Accession", "Sample_Status", "Sample_Submission_date", "Sample_last_update_data", "Sample_Type", "Channel_count", "Tissue_Source", "Organism", 
"Cell_Type", "Diagnosis", "Braak_stage", "Braak_region", "Age", "Sex", "Donor_id", "Sentrix_ID", "Sentrix_Position", "Molecule"]
original_GSE66351_matrix_trans.columns = colum_names

Remove the first row
Remove text from data frame cells (i.e Sex: in the cells of column Sex)
Set the index to the barcodes/idat file names

In [14]:
original_GSE66351_matrix_trans.drop([0], inplace=True)
original_GSE66351_matrix_trans.replace("^[^:]*:", "", regex=True, inplace=True)
original_GSE66351_matrix_trans["Sample_ID"] = original_GSE66351_matrix_trans[["GEO_Accession", "Sentrix_ID", "Sentrix_Position"]].agg("_".join, axis = 1)



Save the sample sheet as a .csv file so the methyprep function can deal with it

In [13]:
original_GSE66351_matrix_trans.to_pickle("E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\Data\\GSE66351_RAW\\GSE66351_GPL13534_meta_data.pkl")
original_GSE66351_matrix_trans.to_csv("E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\Data\\GSE66351_RAW\\sample_sheet.csv")

Start with loading in the data, using functions from the methylsuite packages since these can deal with .idat files as input. 

In [56]:
from methylprep.download import convert_miniml
convert_miniml("GSE66351", data_dir = "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_RAW", merge=False)

INFO:methylprep.download.miniml:found 190 idat files; updated meta_data.
INFO:methylprep.download.miniml:Final samplesheet contains 190 rows and 16 columns


[WindowsPath('E:/Msc Systems Biology/MSB5000_Master_Thesis/Practical work/Data/GSE66351_RAW/GSE66351/GSE66351_family.xml')]

In [9]:
sample_sheet_GSE66351_at1 = find_sample_sheet("E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\Data\\GSE66351_RAW")

Exception: Too many sample sheets in this directory. Move or rename redundant ones. Or specify the path to the one to use with --sample_sheet. (candidate files: [WindowsPath('E:/Msc Systems Biology/MSB5000_Master_Thesis/Practical work/Data/GSE66351_RAW/sample_sheet.csv'), WindowsPath('E:/Msc Systems Biology/MSB5000_Master_Thesis/Practical work/Data/GSE66351_RAW/GSE66351_GPL13534_samplesheet.csv')])

In [14]:
sample_sheet_GSE66351 = pd.read_pickle("E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_RAW\\GSE66351_GPL13534_meta_data.pkl")

Changing .idat filenames

In [3]:
import os, glob, re

temporarily set the working directory to the file containint the idat files to be imported into the python workflow. This makes the renaming process easier

In [21]:
os.chdir("E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_python_format")

Create a list of the idat files in the working directory (python format folder)

In [22]:
idat_files = glob.glob("*.idat")

In [6]:
old_name = idat_files[1]
new_name = re.sub("^[^_]*_", "", old_name)
print(new_name)

8918692108_R03C01_Red.idat


Remove the "GSE[number]" part from the .idat filenames so they match with the expected input format

In [23]:
for file_name in idat_files:
    old_name = file_name
    new_name = re.sub("^[^_]*_", "", old_name)
    final_name = new_name
    os.replace(old_name, final_name)
    print(final_name)


5854945011_R05C02_Red.idat
8918692108_R01C02_Grn.idat
8918692108_R01C02_Red.idat
8918692108_R01C01_Grn.idat
8918692108_R01C01_Red.idat
8918692108_R02C02_Grn.idat
8918692108_R02C02_Red.idat
8918692108_R02C01_Grn.idat
8918692108_R02C01_Red.idat
8918692108_R03C02_Grn.idat
8918692108_R03C02_Red.idat
8918692108_R03C01_Grn.idat
8918692108_R03C01_Red.idat
8918692108_R04C02_Grn.idat
8918692108_R04C02_Red.idat
8918692108_R04C01_Grn.idat
8918692108_R04C01_Red.idat
8918692108_R05C02_Grn.idat
8918692108_R05C02_Red.idat
8918692108_R05C01_Grn.idat
8918692108_R05C01_Red.idat
8918692108_R06C02_Grn.idat
8918692108_R06C02_Red.idat
8918692108_R06C01_Grn.idat
8918692108_R06C01_Red.idat
8918692120_R04C02_Grn.idat
8918692120_R04C02_Red.idat
8918692120_R04C01_Grn.idat
8918692120_R04C01_Red.idat
8918692120_R05C02_Grn.idat
8918692120_R05C02_Red.idat
8918692120_R05C01_Grn.idat
8918692120_R05C01_Red.idat
8918692120_R06C02_Grn.idat
8918692120_R06C02_Red.idat
8918692120_R06C01_Grn.idat
8918692120_R06C01_Red.idat
8

Reading in the data (finally)

In [24]:
data = make_pipeline("E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_python_format" , sample_sheet_filepath = "E:\\Msc Systems Biology\\MSB5000_Master_Thesis\\Practical work\\Data\\GSE66351_python_format\\sample_sheet.csv", steps= None, exports= None, estimator= "beta")

INFO:methylprep.processing.pipeline:Running pipeline in: E:\Msc Systems Biology\MSB5000_Master_Thesis\Practical work\Data\GSE66351_python_format
INFO:methylprep.processing.pipeline:Found 17 additional fields in sample_sheet:
Sample_ID | GEO_Accession | Sample_Status | Sample_Submission_date | Sample_last_update_data | Channel_count | Tissue_Source | Organism | Cell_Type | Diagnosis | Braak_stage | Braak_region | Age | Sex | Donor_id | Molecule | Sample_ID.1 --> Sample_ID1
Reading IDATs: 100%|██████████| 190/190 [08:05<00:00,  2.55s/it]
INFO:methylprep.files.manifests:Downloading manifest: HumanMethylation450k_15017482_v3.csv
INFO:methylprep.files.manifests:Reading manifest file: HumanMethylation450k_15017482_v3.csv
Processing samples: 100%|██████████| 190/190 [2:03:17<00:00, 38.93s/it]  
