# Process Coverage Notebook (12/9/2021)

This notebook demonstrates how to load and process .cov/.cov.gz single-cell methylome files generated by [Bismark](https://www.bioinformatics.babraham.ac.uk/projects/bismark/) into binarized methylation matrices that are used as input to the `run_scAge` function. Please refer to the README file on the GitHub page for information on alternative processing methods for single-cell methylation data. <br><br> Single-cell methylation profiles from the [(Gravina et al, 2016)](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1011-3) study are provided as both raw .cov.gz files and processed .tsv.gz files in `sc_data_raw` and `sc_data_processed`, respectively.

## Import required packages

In [1]:
import numpy as np
import pandas as pd
import os
import sys
sys.path.append('..')
import scAge
import subprocess
import multiprocessing
num_total_cores = multiprocessing.cpu_count()

## Check coverage files

In [2]:
# designate input directory of .cov files
input_coverage_directory = "../sc_data_raw/"

# get .cov files
input_coverage_files = sorted(os.listdir(input_coverage_directory))

# remove ".ipynb_checkpoints" file if present
if ".ipynb_checkpoints" in input_coverage_files:
    input_coverage_files.remove(".ipynb_checkpoints")
    os.rmdir(input_coverage_directory + ".ipynb_checkpoints")
print("Coverage file input directory: '%s'" % input_coverage_directory)

# cycle through .cov files
for count, file in enumerate(input_coverage_files):
    print("\tRaw .cov file #%s --> '%s'" % (count + 1, file))
    
# denote output path for processed .tsv files
output_path = "../sc_data_processed/"
print("\nProcessed file output directory: '%s'" % output_path)

Coverage file input directory: '../sc_data_raw/'
	Raw .cov file #1 --> 'SRR3136624_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #2 --> 'SRR3136625_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #3 --> 'SRR3136626_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #4 --> 'SRR3136627_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #5 --> 'SRR3136628_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #6 --> 'SRR3136629_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #7 --> 'SRR3136630_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #8 --> 'SRR3136631_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #9 --> 'SRR3136634_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #10 --> 'SRR3136635_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #11 --> 'SRR3136646_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #12 --> 'SRR3136647_bismark_bt2.deduplicated.bismark.cov.gz'
	Raw .cov file #13 --> 'SRR3136651_bismark_bt2.d

## Process coverage files

In [3]:
# run process_coverage
scAge.process_coverage(cov_directory = input_coverage_directory,
                       n_cores = num_total_cores,
                       max_met = 100,
                       split = "_",
                       chunksize = 1,
                       binarization = "round",
                       output_path = output_path)

# gzip output files
print("\nCompressing output .tsv files")
rm_previous_gzip = subprocess.run("rm %s*.gz" % output_path, shell = True)
gzip_out = subprocess.run("gzip -v %s*" % output_path, shell = True)
print("Binary methylation matrices compressed!")

process_coverage function starting!

----------------------------------------------------------
Loading .cov files from '../sc_data_raw/'
Number of Bismark .cov files = 26
First .cov file name: 'SRR3136624_bismark_bt2.deduplicated.bismark.cov.gz'
----------------------------------------------------------

----------------------------------------------------------
Starting parallel loading and processing of .cov files...


Single-cell loading progress :   0%|          | 0/26 [00:00<?, ? cell methylomes/s]


Parallel loading complete!
Processed binary methylation matrices written to '../sc_data_processed/'
----------------------------------------------------------

Time elapsed to process coverage files = 30.942 seconds

process_coverage run complete!
Compressing output .tsv files
Binary methylation matrices compressed!
