# WORKFLOW-CAGEscan-short-reads with James Bagnall

This document is an example of how to process a C1CAGE library with a Jupyter notebook from raw reads to single molecule count. All the steps are described in the [tutorial](https://github.com/Population-Transcriptomics/C1-CAGE-preview/blob/master/tutorial.md) section of this repository. In the following section we assume that:
- The softwares used in this workflow are mentioned in the [prerequesite](https://github.com/Population-Transcriptomics/C1-CAGE-preview/blob/master/prerequisite.md) section.
- The reference genome has to be already indexed with bwa
- The tutorials are introduced with the example file mentioned in the tutorial

In our hands this notebook worked without trouble on a machine running Debian GNU/Linux 8. We noticed that the behavior of tagdust2 in single-end mode was different on Mac OSX. In short, the order of the reads1 is changed after extraction on Mac OSX which is a problem because syncpairs expect the order of reads1 and reads2 to be the same. One way to overcome this issueis sort reads1 and reads2 separately after the exctraction then syncpairs will work properly.

## Imports

In [1]:
import subprocess, os, csv, signal, pysam

## Custom functions

In [2]:
remove_extension = lambda x: x.split('.')[0]

In [3]:
print remove_extension

<function <lambda> at 0x7f4f5005eb90>


Declare the function that deals with inputs and outputs

In [4]:
def get_args(read1, read2, ref_genome, output_folders):
    '''Set the input and output path for a given pair of reads'''
    r1_shortname = remove_extension(os.path.basename(read1))

    args = {  
        'r1_input': read1,
        'r2_input': read2,
        'ref_genome': ref_genome,
    }
    
    #output_paths = {folder: os.path.join('output', folder, r1_shortname) for folder in output_folders}
    output_paths = {folder: os.path.join('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/', folder, r1_shortname) for folder in output_folders}
    
    return dict(args, **output_paths)

## Parameters

If the required softwares are not in the PATH you can manually set their location here

In [5]:
tagdust2_path = 'tagdust'
bwa_path = '/home/baker/bwa/bwa'
samtools_path = 'samtools'
paired_bam_to_bed12_path = '/home/baker/pairedBamToBed12/bin/pairedBamToBed12'
umicountFP_path = 'umicountFP'
syncpairs_path = 'syncpairs'

Path to the reference genome you want to align your reads against

In [6]:
#ref_genome = '/home/baker/my-mm10-index-share/Mus_musculus.GRCm38.71.fa'
ref_genome = '/home/baker/my-mm10-index-share/bwa/bwa_mm10_random_chrM_chrUn'

In [7]:
softwares = {    
    'bwa': bwa_path,
    'tagdust': tagdust2_path,
    'syncpairs': syncpairs_path,
    'samtools': samtools_path,
    'pairedBamToBed12': paired_bam_to_bed12_path,
    'umicountFP': umicountFP_path}

The name of the output folders for each command

In [8]:
output_folders = [ 'tagdust_r1', 'unzip_r2'                    # Demultiplexed R1, unziped R2
                 , 'extracted_r1', 'extracted_r2'              # Synced R1 and R2
                 , 'cleaned_reads', 'cleaned_r1', 'cleaned_r2' # rRNA reads removed
                 , 'r1_sai', 'r2_sai', 'sampe'                 # Intermediate files from BWA
                 , 'genome_mapped', 'properly_paired'          # Final output in BAM format
                 , 'cagescan_pairs', 'cagescan_fragments'      # Final output in BED12 format
                 ]

Create the folders

In [9]:
for folder in output_folders:
    os.makedirs(os.path.join('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output', folder))

The actual command to run. See the [tutorial](https://github.com/Population-Transcriptomics/C1-CAGE-preview/blob/master/tutorial.md) section for more details about each command

In [10]:
cmds = [
    
    '{tagdust} -t8 -o {tagdust_r1} -1 F:NNNNNNNN -2 S:TATAGGG -3 R:N {r1_input}',
    
    'gunzip -c {r2_input} > {unzip_r2}.fq',
        
    '{syncpairs} {tagdust_r1}.fq {unzip_r2}.fq {extracted_r1}.fq {extracted_r2}.fq',
    
    '{tagdust} -arch SimpleArchitecture.txt -ref /home/baker/Rna-seq_Data-Analysis/Pawel_Pascaz/ercc_and_TPA_mouse_rRNA.fa -o {cleaned_reads} {extracted_r1}.fq {extracted_r2}.fq',
    
    'cp {cleaned_reads}_READ1.fq {cleaned_r1}.fq',
    
    'cp {cleaned_reads}_READ2.fq {cleaned_r2}.fq',
    
    '{bwa} aln {ref_genome} {cleaned_r1}.fq > {r1_sai}.sai',
    
    '{bwa} aln {ref_genome} {cleaned_r2}.fq > {r2_sai}.sai',
    
    '{bwa} sampe -a 2000000 -c 0.00001 {ref_genome} {r1_sai}.sai {r2_sai}.sai {cleaned_r1}.fq {cleaned_r2}.fq > {sampe}.sam',
    
    '{samtools} view -uSo - {sampe}.sam | {samtools} sort - {genome_mapped}',
    
    '{samtools} view -f 0x0002 -F 0x0100 -uo - {genome_mapped}.bam | {samtools} sort -n - {properly_paired}',
    
    '{pairedBamToBed12} -i {properly_paired}.bam > {cagescan_pairs}.bed',
    
    '{umicountFP} -f {cagescan_pairs}.bed > {cagescan_fragments}.bed'
    
]

Get the reads. Here we assume that the reads are in the current directory, in a folder named following the MiSeq run id

In [11]:
root, folders, files = os.walk('/home/baker/my-James-Bagnall-share/fastqs/').next()

files = [f for f in files if not f.startswith('.')] #remove hidden files if there exist
reads1 = sorted([os.path.join(root, f) for f in files if 'R1' in f])
reads2 = sorted([os.path.join(root, f) for f in files if 'R2' in f])

In [12]:
print reads1, reads2

['/home/baker/my-James-Bagnall-share/fastqs/C01_S17_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C02_S9_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C03_S1_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C04_S65_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C05_S57_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C06_S49_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C07_S18_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C08_S10_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C09_S2_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C10_S66_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C11_S58_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C12_S50_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C13_S19_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/fastqs/C14_S11_R1_001.fastq.gz', '/home/baker/my-James-Bagnall-share/

Run the commands for all the pairs

In [13]:
for read1, read2 in zip(reads1, reads2):
    args = get_args(read1, read2, ref_genome, output_folders)
    args = dict(args, **softwares)
    
    for cmd in cmds:
        #print cmd.format(**args)
        subprocess.call(cmd.format(**args), preexec_fn=lambda: signal.signal(signal.SIGPIPE, signal.SIG_DFL), shell=True)

KeyboardInterrupt: 

In [175]:
#for cmd in cmds:
#    print cmd.format(**args)

Generate the level1 file

In [179]:
root, folders, files = os.walk('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/genome_mapped/').next()
files = [os.path.join(root, f) for f in files if f.endswith('bam')]
level1 = 'python /home/baker/PromoterPipeline_20150516/level1.py -o /home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/mylevel1file.l1.osc.gz -f 0x0042 -F 0x0104 --fingerprint {files}'.format(files=' '.join(files))

In [180]:
print level1

python /home/baker/PromoterPipeline_20150516/level1.py -o /home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/mylevel1file.l1.osc.gz -f 0x0042 -F 0x0104 --fingerprint /home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/genome_mapped/C64_S75_R1_001.bam /home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/genome_mapped/C03_S1_R1_001.bam /home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/genome_mapped/C77_S85_R1_001.bam /home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/genome_mapped/C54_S89_R1_001.bam /home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/genome_mapped/C76_S93_R1_001.bam /home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/genome_mapped/C38_S15_R1_001.bam /home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/genome_mapped/C84_S78_R1_001.bam /home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/genome_mapped/C43_S8_R1_0

In [181]:
subprocess.call(level1, shell=True)

0

## Generate logs (triplet)

Here we generate four summary files that will be used for [QC](https://github.com/Population-Transcriptomics/C1-CAGE-preview/blob/master/QC.md) and place them in the 'output' directory. 

1.  mapped.log: The number of mapped reads per cell
2.  extracted.log: The number of remaining reads after filtering for ribosomal DNA and unreadable UMIs
3.  filtered.log: The detailed number of ribosomal DNA extracted per cell
4.  transcript_count.log: The exact number of unique transcprit per cell



In [182]:
mapped_cmd = "{samtools} view -u -f 0x40 {genome_mapped}.bam | {samtools} flagstat - | grep mapped | grep % | cut -f 1 -d ' '"
extracted_cmd = "{samtools} flagstat {genome_mapped}.bam | grep read1 | cut -f 1 -d ' '"
counts_cmd = "wc -l {cagescan_fragments}.bed | cut -f 1 -d ' '"
rdna_cmd = "grep ribosomal {cleaned_reads}_logfile.txt | cut -f 2"

In [183]:
#remove _R1 from the file's name
custom_rename = lambda x: x.replace('_R1', '')

In [184]:
mapped, extracted, rdna, counts = ([], [], [], [])

for read1 in reads1:
    r1_shortname = remove_extension(os.path.basename(read1))
    
    args = {'samtools': samtools_path,
            'genome_mapped': os.path.join('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/', 'genome_mapped', r1_shortname),
            'cagescan_fragments': os.path.join('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/', 'cagescan_fragments', r1_shortname),
            'cleaned_reads': os.path.join('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/', 'cleaned_reads', r1_shortname)}
    
    output = subprocess.check_output(mapped_cmd.format(**args), shell=True).strip()
    mapped.append(['mapped', custom_rename(r1_shortname), output])

    output = subprocess.check_output(extracted_cmd.format(**args), shell=True).strip()
    extracted.append(['extracted', custom_rename(r1_shortname), output])
    
    output = subprocess.check_output(counts_cmd.format(**args), shell=True).strip()
    counts.append(['counts', custom_rename(r1_shortname), output])

    output = subprocess.check_output(rdna_cmd.format(**args), shell=True).strip()
    rdna.append(['rdna', custom_rename(r1_shortname), output])
    #print rdna_cmd.format(**args)

In [185]:
with open('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/mapped.log', 'w') as handler:
    writer = csv.writer(handler, delimiter='\t')
    writer.writerows(mapped)

with open('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/extracted.log', 'w') as handler:
    writer = csv.writer(handler, delimiter='\t')
    writer.writerows(extracted)
    
with open('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/filtered.log', 'w') as handler:
    writer = csv.writer(handler, delimiter='\t')
    writer.writerows(rdna)
    
with open('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/transcript_count.log', 'w') as handler:
    writer = csv.writer(handler, delimiter='\t')
    writer.writerows(counts)

### Generating commands for htseq-count runs

In [204]:
log_files = ""
root, folders, files = os.walk('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/properly_paired/').next()

In [231]:
htseq_command=""
for f in files:   
    htseq_command_file = open('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/htseq_command.txt', 'a')
    htseq_command = "qsub run_htseq-count output/properly_paired/" + f + " output/htseq_output/"+f.split('_', 1)[0]+"_htseq_count.txt\n"
    htseq_command_file.writelines(htseq_command)
    htseq_command_file.close
        #htseq_command.append(["qsub run_htseq-count", f])


In [276]:
root, folders, files = os.walk('/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/cleaned_reads/').next()
files = [os.path.join("", f) for f in files if f.endswith('txt')]

In [289]:
def ERCC_finder(f):
    full_path = '/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/cleaned_reads/' + f
    file_ERCC = open(full_path,'rb')
    for line in file_ERCC:
        if 'ERCC' in line:
            temp = str.split(line)            
            ERCC_file_path = "/home/baker/my-scratch-share/James_Bagnall_Single_cell_rna_seq/output/ERCC_count/ERCC_count_" + f.split('_', 1)[0]+".txt"            
            ERCC_command_file = open(ERCC_file_path, 'a')
            ERCC_line_to_write = temp[3] + "\t" + temp[2]+ "\n"
            ERCC_command_file.writelines(ERCC_line_to_write)
            ERCC_command_file.close
    


In [291]:
for f in files:    
    ERCC_val = ERCC_finder(f)