## Cell-Free Genomics: Read preparation

#### Ruby Froom, Campbell/Darst and Rock labs at Rockefeller University

The following pipeline processes raw FASTQ files, maps reads and extracts R2 reads only (the desired ends of interest for either 5' or 3' end libraries).

## Modules and functions

In [1]:
import subprocess, time
import pandas as pd
import numpy as np
import regex
import csv
from Bio.Seq import Seq
from Bio import SeqIO
from os import listdir
from os.path import exists
from Bio import SeqRecord
import scipy

In [2]:
# written by Peter Culviner, PhD to enable command-line access through Jupyter
def quickshell(command, print_output=True, output_path=None, return_output=False):
    process_output = subprocess.run(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout = process_output.stdout.decode('utf-8')
    stderr = process_output.stderr.decode('utf-8')
    output_string = f'STDOUT:\n{stdout}\nSTDERR:\n{stderr}\n'
    if print_output:
        print('$ ' + command)
        print(output_string)
    if output_path is not None:
        with open(output_path, 'w') as f:
            f.write(output_string)
    if return_output:
        return stdout, stderr

## Initializing inputs and settings

Inputs for directory creation and pointing to files. 

The starting requirements for this pipeline are:
>**main_path**: the absolute path of the working directory. Update to reflect your own configuration.

>**input_csv_dir:** a directory containing `i7_sample_file` and `inline_sample_file` csv files (see below for file specifications) called `input_csv_files`.

>**genome_dir:** a directory containing reference genome(s) for read mapping called `genomes`.

>**raw_fastq_dir:** a directory containing your i7-demultiplexed fastq files called `raw_fastq`.

In [3]:
# initializing locations of input .csv files, alignment genomes and raw compressed fastq files
main_path = '5enrich_CRP'
#main_path = '3enrich_NusAG'

input_csv_dir = f'{main_path}/input_csv_files'

readPrep_dir = f'{main_path}/readPrep'
genome_dir = 'genome_files_misc'
raw_fastq_dir = f'{readPrep_dir}/raw_fastq'

In [4]:
i7_table = pd.read_csv(f'{input_csv_dir}/i7_barcodes_5enrich.csv')
inline_table = pd.read_csv(f'{input_csv_dir}/inline_barcodes_5enrich.csv')

# i7_table = pd.read_csv(f'{input_csv_dir}/i7_barcodes_3enrich.csv')
# inline_table = pd.read_csv(f'{input_csv_dir}/inline_barcodes_3enrich.csv')

The additional directories below are a suggested organization for subsequent processing steps in this notebook.

The cells below will initialize these variables and create the directories inside the working directory.

In [5]:
fastqc_dir = f'{readPrep_dir}/fastqc'

# Directory for fastqs split by inline barcodes
demultiplexed_fastq_dir = f'{readPrep_dir}/demultiplexed_fastq'

# Directory for quality filtered and Illumina adaptor-trimmed fastq files
trimmed_fastq_dir = f'{readPrep_dir}/trimmed_fastq'

# Directory for containing converted fastas with UMIs removed for mapping
noUMI_dir = f'{readPrep_dir}/UMIextract_fasta'
combined_fasta_dir = f'{noUMI_dir}/combined_fastas'

# Pre-processing alignments prior to enriched end calling
initial_alignments_dir = f'{readPrep_dir}/initial_alignments'
dedup_alignments_dir = f'{readPrep_dir}/dedup_alignments'
dedup_logs_dir = f'{dedup_alignments_dir}/dedup_logs'
R2_alignments_dir = f'{readPrep_dir}/R2_alignments'
spike_R2_alignments_dir = f'{readPrep_dir}/spike_R2_alignments'

In [1]:
!mkdir $fastqc_dir
!mkdir $demultiplexed_fastq_dir
!mkdir $trimmed_fastq_dir
!mkdir $noUMI_dir
!mkdir $combined_fasta_dir
!mkdir $initial_alignments_dir
!mkdir $dedup_alignments_dir
!mkdir $dedup_logs_dir
!mkdir $R2_alignments_dir
!mkdir $spike_R2_alignments_dir

*Descriptions written by Peter Culviner, PhD with modifications from Ruby Froom.*

**threads:** number of threads to use in command-line calls (e.g. fastqc, cutadapt, bwa, end calling).

**inline_barcode_errors:** How many mismatches can be tolerated in the inline barcode. Analyze the edit distance between the barcodes used to assess a tolerable number of errors.

**barcode_length**: the length of the inline barcodes. Enables calculation of the error rate (# errors / barcode length), a required input for `cutadapt` when we are splitting based on inline barcode identification.

**minimum_insert_length:** minimum allowed length of insert after trimming of adapter sequences at the 3'-ends of read1 and read2.

**i7_trim**: For 3'-end trimming of the first read. The adapter sequence to look for at the 3'-end of the read. Note that the reverse complement of the inline barcode for a given sample will be appended to the front of this string since the inline barcode would appear just before this string.

**i5_trim**: For 3'-end trimming of the second read. The adapter sequence to look for at the 3'-end of the read.

In [6]:
# SETTINGS
# Threads (CPU x2 is max) for fastqc, cutadapt, bwa, and end calling
threads = 18

# inline adapter identification
inline_barcode_errors = 1
barcode_length = 8
inline_barcode_error_rate = inline_barcode_errors / barcode_length

# Quality filters
minimum_insert_length = 10
quality_threshold = 20

# 3'-adapter trimming
i7_trim = 'GATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
i5_trim = 'GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT'

*Descriptions written by Peter Culviner, PhD with modifications from Ruby Froom.*

**i7_sample_file**: Used to identify fastq files to split from. Assumes we're working with fastqs that have already been split by Illumina's i7 indexes. Minimally should have columns:
>**r1**: read 1 fastq, no inline barcode, template starts from first base.

>**r2**: read 2 fastq, this one is assumed to have the inline barcode starting from the first base.

>**i7**: number to designate this fastq, used internally by the code and used to designate in which unsplit fastq files final samples exist. minimally, each fastq pair should have a unique number.

>**title**: used internally for pool names during splitting, unique but the name itself is not important.

**inline_sample_file**: Used to determine which inline barcodes should exist in each fastq file in the i7 sample file (i.e., not all i7 sample files need to have the same inline barcodes). Minimally should have columns:
>**sequence**: sequence of the barcode as it is read 5' -> 3' on the RT primer. This sequence will be searched for by cutadapt at the beginning of each read 2 to split then its reverse complement will be used to identify if the 3'-ends of read 1 need to be trimmed (i.e., if the insert was completely read through and we started sequencing the adapter sequence).

>**inline_barcode:** number used to designate this inline barcode, used internally by the code. minimally, each barcode sequence should have a unique number.

>**title**: final intended title for this sample. Split fastqs will be titled **title**.R1/R2.fastq.gz

>**in_i7**: which i7 number (see above) to search for this sample in.

## Generate quality reports

In [16]:
for row in i7_table.iterrows():
    _, i7_data = row
    R1 = i7_data.r1
    R2 = i7_data.r2

    command1 = f'fastqc {raw_fastq_dir}/{R1} -o {fastqc_dir} -t {threads}'
    command2 = f'fastqc {raw_fastq_dir}/{R2} -o {fastqc_dir} -t {threads}'

    quickshell(
            command1,
            print_output=False,
            return_output=False)

    print(f'R1 fastqc: {i7_data.title} done')
    
    quickshell(
            command2,
            print_output=False,
            return_output=False)
    print(f'R2 fastqc: {i7_data.title} done')

## Split reads based on inline barcodes

The inline barcodes were optimized, and the parsing code was written, by Peter Culviner, PhD (Fortune Lab, Harvard).

Each i7-demultiplexed fastq file is further split based upon the presence of inline barcodes. Reads are then quality- and length-filtered, and adaptors (sequences assigned to `i7_trim` and `i5_trim`) are trimmed.

In [6]:
# Demultiplexing code
# Written by Peter Culviner, PhD

# list of reports for storage
dataframe_list = []
percent_with_adaptor = 0

for row in i7_table.iterrows():
    _, i7_data = row
    # prepare input and output file names for cutadapt
    cutadapt_inputs = [
        f'{raw_fastq_dir}/{i7_data.r2}',
        f'{raw_fastq_dir}/{i7_data.r1}']
    cutadapt_outputs = [
        f'{demultiplexed_fastq_dir}/{i7_data.i7}.' + '{name}.R2.fastq.gz',
        f'{demultiplexed_fastq_dir}/{i7_data.i7}.' + '{name}.R1.fastq.gz']
    # generate a lookup table for cutadapt to pull inline barcodes from
    samples_present = inline_table.loc[inline_table.in_i7 == i7_data.i7]
    barcode_fasta = 'inline_barcodes.tmp.fasta'
    with open(barcode_fasta, 'w') as f:
        for title, seq in zip(samples_present.inline_barcode, samples_present.sequence):    
            f.write(f'>{str(title)}\n^{str(seq)}\n')
    command = f'cutadapt -e {inline_barcode_error_rate} --discard-untrimmed ' + \
              f'-g file:{barcode_fasta} -o {cutadapt_outputs[0]} ' + \
              f'-p {cutadapt_outputs[1]} {cutadapt_inputs[0]} {cutadapt_inputs[1]}'
    # run cutadapt to split fastq files
    output_string, _ = quickshell(
        command,
        print_output = True,
        return_output = True,
        output_path = f'{demultiplexed_fastq_dir}/{i7_data.title}_cutadapt_demultiplex.txt')

    # record counts of each inline barcode found
    percent_with_adaptor = (f"In {cutadapt_inputs[0].split('/')[-1]} and mate:\n  " +
        (' '.join([m for m in regex.finditer('Read 1 with adapter:.*\n',output_string)][0].group()
                  [:-1].split())).replace('1','2'))
    barcode_counts = [int(m.group().split(' ')[1]) for m in regex.finditer('Trimmed: \d* times', output_string)]
    # generate output dataframe and append to list
    data = np.asarray([
        samples_present.title.values,
        [i7_data.i7 for i in barcode_counts],
        samples_present.inline_barcode.values,
        barcode_counts]).T
    output_df = pd.DataFrame(
        columns=['title', 'i7', 'inline_barcode', 'counts'],
        data=data)
    dataframe_list.append(output_df)
    # rename files to use filename from inline sample file
    for sample_title, barcode in zip(samples_present.title.values, samples_present.inline_barcode.values):
        temp_fastq_1 = f'{demultiplexed_fastq_dir}/{i7_data.i7}.{barcode}.R1.fastq.gz'
        out_fastq_1 = f'{demultiplexed_fastq_dir}/{sample_title}.R1.fastq.gz'
        !mv $temp_fastq_1 $out_fastq_1
        temp_fastq_2 = f'{demultiplexed_fastq_dir}/{i7_data.i7}.{barcode}.R2.fastq.gz'
        out_fastq_2 = f'{demultiplexed_fastq_dir}/{sample_title}.R2.fastq.gz'
        !mv $temp_fastq_2 $out_fastq_2
        
# concatenate i7 dataframes
output_info = pd.concat(dataframe_list, axis=0) 
output_info['percent_with_adaptor'] = percent_with_adaptor

# write record csv
output_info.to_csv(f'{demultiplexed_fastq_dir}/fastq_demultiplex_record.csv')

In [None]:
# Code from Peter Culviner, PhD
# iterate through file pairs and trim adapters
trimming_log = f'{trimmed_fastq_dir}/split_trim_log.txt'

with open(trimming_log, 'w') as f:
    for row in inline_table.iterrows():
        _, sample_data = row
        # prepare input and output titles
        cutadapt_inputs = [
            f'{demultiplexed_fastq_dir}/{sample_data.title}.R1.fastq.gz',
            f'{demultiplexed_fastq_dir}/{sample_data.title}.R2.fastq.gz']
        cutadapt_outputs = [
            f'{trimmed_fastq_dir}/{sample_data.title}.R1.fastq.gz',
            f'{trimmed_fastq_dir}/{sample_data.title}.R2.fastq.gz']
        # prepare cutadapt command
        # read from read 1 (reverse transcription primer with barcode)
        i7_adapter = f'{str(Seq(sample_data.sequence).reverse_complement())}{i7_trim}'
        # read from read 2
        i5_adapter = f'{i5_trim}'
        command = f'cutadapt --overlap=1 --minimum-length={minimum_insert_length} -q {quality_threshold} ' + \
                  f'-j {threads} -a {i7_adapter} -A {i5_adapter} -o {cutadapt_outputs[0]} ' + \
                  f'-p {cutadapt_outputs[1]} {cutadapt_inputs[0]} {cutadapt_inputs[1]}'
        # run cutadapt
        output_trim, _ = quickshell(command, print_output=False, return_output=True)
        # parse output to pull number of reads below minimum_trimmed_length
        too_short = int(regex.search('too short:\s*\S*',output_trim).group().split(' ')[-1].replace(',',''))
        # write full output to log
        f.write(output_trim + '\n' + 'too short: ' + too_short + '\n')

## Remove UMIs and convert fastq to fasta for mapping

Even if you do not make subsequent use of unique molecular indices (UMIs), this step is important to remove them from the reads and enable accurate alignment.

Update the UMI pattern below to enable UMI removal from both R1 and R2. The UMI will then be tacked onto the read ID, enabling de-duplication of .bam files with the dedup function from umi_tools after read mapping.

The read ID will have the following structure after UMI extraction:

ReadID_***NNNNN***NNNNN

>***NNNNN*** = UMI from read1

>NNNNN = UMI from read2



In [1]:
# Iterate through fastq files, split UMIs from reads and add to read_id,
# and re-write reads as fasta

# Initialize UMI length
UMI_length = 5

for row in inline_table.iterrows():
    _, sample_data = row
    R1_reads = f'{trimmed_fastq_dir}/{sample_data.title}.R1.fastq.gz'
    R2_reads = f'{trimmed_fastq_dir}/{sample_data.title}.R2.fastq.gz'

    counter = 0
    # How many lines to store in memory at once
    # Change if there are memory issues
    lines_in = 50000000
    
    for chunk_R1 in pd.read_table(R1_reads, header=None, chunksize = lines_in):

        R1 = pd.DataFrame(chunk_R1.values.reshape(-1, 4),
                          columns=['read_id_R1', 'seq_R1', '+', 'qual'])
        R1['read_id'] = R1['read_id_R1'].str[:-13]
        R1['UMI_R1'] = R1['seq_R1'].str[0:UMI_length]
        R1['read_R1'] = R1['seq_R1'].str[UMI_length:]
        R1['qual_R1'] = R1['qual'].str[UMI_length:]
        
        R2 = pd.DataFrame(pd.read_table(R2_reads, header=None, skiprows = counter * lines_in,
                                        nrows = lines_in).values.reshape(-1, 4),
                      columns=['read_id_R2', 'seq_R2', '+', 'qual'])
        R2['read_id'] = R2['read_id_R2'].str[:-13]
        R2['UMI_R2'] = R2['seq_R2'].str[0:UMI_length]
        R2['read_R2'] = R2['seq_R2'].str[UMI_length:]
        R2['qual_R2'] = R2['qual'].str[UMI_length:]

        all_reads = pd.merge(R1[['read_id','seq_R1','UMI_R1','read_R1','qual_R1','read_id_R1']],
                             R2[['read_id','seq_R2','UMI_R2','read_R2','qual_R2','read_id_R2']],
                             on = 'read_id')
    
        all_reads['read_id_UMI'] = all_reads['read_id'] + '_' + \
                                   all_reads['UMI_R1'] + all_reads['UMI_R2']

        fasta_R1 = all_reads.read_id_UMI + '\n' + all_reads.read_R1
        fasta_R2 = all_reads.read_id_UMI + '\n' + all_reads.read_R2

        output_R1 = f'{noUMI_dir}/{sample_data.title}.R1.{counter}.fasta'
        output_R2 = f'{noUMI_dir}/{sample_data.title}.R2.{counter}.fasta'

        # Save this table for UMI analysis after read mapping
        all_reads.to_csv(f'{noUMI_dir}/{sample_data.title}.UMI_readID.csv')
                     
        fasta_R1.to_csv(output_R1, index=False, quoting=csv.QUOTE_NONE, escapechar = "(", header=None)
        fasta_R2.to_csv(output_R2, index=False, quoting=csv.QUOTE_NONE, escapechar = "(", header=None)
        print(f'Chunking UMIs: {sample_data.title} iteration {counter} done')
        counter += 1
        
    print(f'Extracting UMIs: {sample_data.title} done')

# Format FASTAs for mapping

The fasta files are written with escape character `(` to enable proper export, but this character must be removed prior to fasta mapping.

Run the following command to remove escape character `(` in the fasta files.

In [31]:
replace_command = f'sed -i s/\(//g {noUMI_dir}/*.fasta'

replace = quickshell(replace_command, print_output = False, return_output = False)

# Combine FASTAs from same sample

In [2]:
for row in inline_table.iterrows():
    _, sample_data = row
    
    R1_string = f'{noUMI_dir}/{sample_data.title}.R1.0.fasta'
    R2_string = f'{noUMI_dir}/{sample_data.title}.R2.0.fasta'
    
    for file in listdir(f'{noUMI_dir}'):
        if file.startswith(f'{sample_data.title}.R1'):
            R1_string = f'{R1_string} + " " + {noUMI_dir}/{file}'
        elif file.startswith(f'{sample_data.title}.R2'):
            R2_string = f'{R2_string} + " " + {noUMI_dir}/{file}'
            
    R1_combine_command = f'cat {R1_string} > {combined_fasta_dir}/{sample_data.title}.R1.fasta'
    R2_combine_command = f'cat {R2_string} > {combined_fasta_dir}/{sample_data.title}.R2.fasta'
    
    R1_combine = quickshell(R1_combine_command, print_output = False, return_output = False)
    print(f'Combining FASTAs: {sample_data.title} R1 done')
    R2_combine = quickshell(R2_combine_command, print_output = False, return_output = False)
    print(f'Combining FASTAs: {sample_data.title} R2 done')

# Mapping reads

Map the reads to a concatenated genome: a spike genome (Eco) and an experimental genome of interest (Mtb).

In [32]:
# Index the input genomes (only needs to be done once)
index_genome_command = f'bwa index {genome_dir}/Eco_Mtb_genome.fasta'

index_genome = quickshell(index_genome_command, print_output = True, return_output = False)

In [3]:
# Use BWA to align to concatenated genomes 
# for spike calculation and subsequent end enrichment analysis

# bwa algorithm. Mem is default for most applications, read documentation to decide
bwa_algorithm = 'mem'

# Set print_output = True to see command-line output
# Set return_output = False if assigning commmand-line output to a variable
print_output = False
return_output = False

for row in inline_table.iterrows():
    _, sample_data = row
    R1_reads = f'{combined_fasta_dir}/{sample_data.title}.R1.fasta'
    R2_reads = f'{combined_fasta_dir}/{sample_data.title}.R2.fasta'

    map_command = f'bwa {bwa_algorithm} -t {threads} {genome_dir}/Eco_Mtb_genome.fasta ' + \
                       f'{R1_reads} {R2_reads} > {initial_alignments_dir}/{sample_data.title}.sam'
    
    map_output = quickshell(map_command, print_output = print_output, return_output = return_output)
    print(f'Initial alignments: {sample_data.title} Done')

In [4]:
# Process alignments: 
# sort, convert to bam, and index

# Set print_output = True to see command-line output
# Set return_output = False if assigning commmand-line output to a variable
print_output = False
return_output = False

for row in inline_table.iterrows():
    _, sample_data = row
    # Sort sam file and output as bam
    sort_command = f'samtools sort -O BAM {initial_alignments_dir}/{sample_data.title}.sam > ' + \
                          f'{initial_alignments_dir}/sorted_{sample_data.title}.bam'
    
    sort_output = quickshell(sort_command, print_output = print_output, return_output=return_output)
    
    # Index sorted bam
    index_command = f'samtools index {initial_alignments_dir}/sorted_{sample_data.title}.bam'
    
    index_output = quickshell(index_command, print_output = print_output, return_output=return_output)

    print(f'Sort and index alignments before de-duplication: {sample_data.title} done')

In [5]:
# Generate mapping statistics BEFORE de-duplication

# List to generate dataframe with spike counts for normalization (needed in downstream analysis)
# and other mapping statistics (not needed)
# mapping_stats = []

# Update with the name of your genome and relevant regions
Eco_region = 'Eco_Mtb:1-4641652'
Mtb_region = 'Eco_Mtb:4641653-9053361'

for row in inline_table.iterrows():
    _, sample_data = row
    # Extract mapping stats
    count_Mtb_reads_command = f'samtools view -c -F 260 ' + \
                              f'{initial_alignments_dir}/sorted_{sample_data.title}.bam "{Mtb_region}"'
    count_Mtb_reads = int(quickshell(count_Mtb_reads_command,
                                     print_output = False,
                                     return_output=True)[0].split('\n')[0])
    
    count_all_mapped_reads_command = f'samtools view -c -F 260 ' + \
                                         f'{initial_alignments_dir}/sorted_{sample_data.title}.bam'
    count_all_mapped_reads = int(quickshell(count_all_mapped_reads_command,
                                                print_output = False,
                                                return_output=True)[0].split('\n')[0])
    Mtb_percent = (count_Mtb_reads / count_all_mapped_reads) * 100
    
    count_Eco_reads_command = f'samtools view -c -F 260 ' + \
                                f'{initial_alignments_dir}/sorted_{sample_data.title}.bam "{Eco_region}"'
    count_Eco_reads = int(quickshell(count_Eco_reads_command,
                                       print_output = False,
                                       return_output=True)[0].split('\n')[0])

    Eco_percent = (count_Eco_reads / count_all_mapped_reads) * 100
    
    count_unmapped_reads_command = f'samtools view -c -f 4 ' + \
                                       f'{initial_alignments_dir}/sorted_{sample_data.title}.bam'
    count_unmapped_reads = int(quickshell(count_unmapped_reads_command,
                                              print_output = False,
                                              return_output=True)[0].split('\n')[0])
    
    
    count_all_reads_command = f'samtools view -c ' + \
                                  f'{initial_alignments_dir}/sorted_{sample_data.title}_enrich.bam'
    count_all_reads = int(quickshell(count_all_end_reads_command,
                                         print_output = False,
                                         return_output=True)[0].split('\n')[0])
    unmapped_percent = (count_unmapped_reads / count_all_reads) * 100
    
    mapping_stats.append([sample_data.title, count_Eco_reads, Eco_percent, count_Mtb_reads,
                          Mtb_percent, unmapped_percent])
    
    print(f'Mapping stats before de-duplication: {sample_data.title} done')

mapping_DF = pd.DataFrame(mapping_stats, columns = ['Sample_Name','Eco Counts',
                                                    'Eco % Mapped','Mtb Counts','Mtb % Mapped',
                                                    '% Unmapped'])
mapping_DF.to_csv(f'{initial_alignments_dir}/mapping_stats_preDeDup.csv')

# De-duplication: remove reads with same UMIs and mapping locations

In [36]:
print_output = False
return_output = False

for row in inline_table.iterrows():
    _, sample_data = row
    dedup_command = f'umi_tools dedup -I {initial_alignments_dir}/sorted_{sample_data.title}.bam ' + \
                           f'--paired --output-stats={dedup_logs_dir}/{sample_data.title} ' + \
                           f'--no-sort-output -S {dedup_alignments_dir}/{sample_data.title}.bam'
    dedup_output = quickshell(dedup_command,
                             print_output = print_output,
                             return_output = return_output)
    print(f'De-duplicating alignments: {sample_data.title} done')

De-duplicating alignments: core1 done
De-duplicating alignments: core2 done
De-duplicating alignments: core3 done
De-duplicating alignments: RC1 done
De-duplicating alignments: RC2 done
De-duplicating alignments: RC3 done
De-duplicating alignments: WhiB1 done
De-duplicating alignments: WhiB2 done
De-duplicating alignments: WhiB3 done
De-duplicating alignments: core_CRP1 done
De-duplicating alignments: core_CRP2 done
De-duplicating alignments: core_CRP3 done
De-duplicating alignments: RC_CRP1 done
De-duplicating alignments: RC_CRP2 done
De-duplicating alignments: RC_CRP3 done
De-duplicating alignments: CRP1 done
De-duplicating alignments: CRP2 done
De-duplicating alignments: CRP3 done


In [37]:
# Process deduplicated alignments: 
# sort and index

# Set print_output = True to see command-line output
# Set return_output = False if assigning commmand-line output to a variable
print_output = False
return_output = False

for row in inline_table.iterrows():
    _, sample_data = row
    # Sort sam file and output as bam
    sort_dedup_command = f'samtools sort -O BAM {dedup_alignments_dir}/{sample_data.title}.bam > ' + \
                          f'{dedup_alignments_dir}/sorted_{sample_data.title}.bam'

    sort_dedup_output = quickshell(sort_dedup_command, print_output = print_output, return_output=return_output)
    
    # Index sorted bam
    index_dedup_command = f'samtools index {dedup_alignments_dir}/sorted_{sample_data.title}.bam'
    
    index_dedup_output = quickshell(index_dedup_command, print_output = print_output, return_output=return_output)
    print(f'Sorting and indexing de-duplicated alignments: {sample_data.title} done')

Sorting and indexing de-duplicated alignments: core1 done
Sorting and indexing de-duplicated alignments: core2 done
Sorting and indexing de-duplicated alignments: core3 done
Sorting and indexing de-duplicated alignments: RC1 done
Sorting and indexing de-duplicated alignments: RC2 done
Sorting and indexing de-duplicated alignments: RC3 done
Sorting and indexing de-duplicated alignments: WhiB1 done
Sorting and indexing de-duplicated alignments: WhiB2 done
Sorting and indexing de-duplicated alignments: WhiB3 done
Sorting and indexing de-duplicated alignments: core_CRP1 done
Sorting and indexing de-duplicated alignments: core_CRP2 done
Sorting and indexing de-duplicated alignments: core_CRP3 done
Sorting and indexing de-duplicated alignments: RC_CRP1 done
Sorting and indexing de-duplicated alignments: RC_CRP2 done
Sorting and indexing de-duplicated alignments: RC_CRP3 done
Sorting and indexing de-duplicated alignments: CRP1 done
Sorting and indexing de-duplicated alignments: CRP2 done
Sort

In [38]:
# Generate mapping statistics AFTER de-duplication

# List to generate dataframe with spike counts for normalization (needed in downstream analysis)
# and other mapping statistics (not needed)
# mapping_stats = []

# Update with the name of your genome and relevant regions
Eco_region = 'Eco_Mtb:1-4641652'
Mtb_region = 'Eco_Mtb:4641653-9053361'

for row in inline_table.iterrows():
    _, sample_data = row
    # Extract mapping stats
    count_Mtb_reads_command = f'samtools view -c -F 260 ' + \
                              f'{dedup_alignments_dir}/sorted_{sample_data.title}.bam "{Mtb_region}"'
    count_Mtb_reads = int(quickshell(count_Mtb_reads_command,
                                     print_output = False,
                                     return_output=True)[0].split('\n')[0])
    
    count_all_mapped_reads_command = f'samtools view -c -F 260 ' + \
                                         f'{dedup_alignments_dir}/sorted_{sample_data.title}.bam'
    count_all_mapped_reads = int(quickshell(count_all_mapped_reads_command,
                                                print_output = False,
                                                return_output=True)[0].split('\n')[0])
    Mtb_percent = (count_Mtb_reads / count_all_mapped_reads) * 100
    
    count_Eco_reads_command = f'samtools view -c -F 260 ' + \
                                f'{dedup_alignments_dir}/sorted_{sample_data.title}.bam "{Eco_region}"'
    count_Eco_reads = int(quickshell(count_Eco_reads_command,
                                       print_output = False,
                                       return_output=True)[0].split('\n')[0])

    Eco_percent = (count_Eco_reads / count_all_mapped_reads) * 100
    
    count_unmapped_reads_command = f'samtools view -c -f 4 ' + \
                                       f'{dedup_alignments_dir}/sorted_{sample_data.title}.bam'
    count_unmapped_reads = int(quickshell(count_unmapped_reads_command,
                                              print_output = False,
                                              return_output=True)[0].split('\n')[0])
    
    
    count_all_reads_command = f'samtools view -c ' + \
                                  f'{dedup_alignments_dir}/sorted_{sample_data.title}_enrich.bam'
    count_all_reads = int(quickshell(count_all_end_reads_command,
                                         print_output = False,
                                         return_output=True)[0].split('\n')[0])
    unmapped_percent = (count_unmapped_reads / count_all_reads) * 100
    
    mapping_stats.append([sample_data.title, count_Eco_reads, Eco_percent, count_Mtb_reads,
                          Mtb_percent, unmapped_percent])
    
    print(f'Mapping stats after de-duplication: {sample_data.title} done')

mapping_DF = pd.DataFrame(mapping_stats, columns = ['Sample_Name','Eco Counts',
                                                    'Eco % Mapped','Mtb Counts','Mtb % Mapped',
                                                    '% Unmapped'])
mapping_DF.to_csv(f'{dedup_alignments_dir}/mapping_stats_postDeDup.csv')

Mapping stats after de-duplication: core1 done
Mapping stats after de-duplication: core2 done
Mapping stats after de-duplication: core3 done
Mapping stats after de-duplication: RC1 done
Mapping stats after de-duplication: RC2 done
Mapping stats after de-duplication: RC3 done
Mapping stats after de-duplication: WhiB1 done
Mapping stats after de-duplication: WhiB2 done
Mapping stats after de-duplication: WhiB3 done
Mapping stats after de-duplication: core_CRP1 done
Mapping stats after de-duplication: core_CRP2 done
Mapping stats after de-duplication: core_CRP3 done
Mapping stats after de-duplication: RC_CRP1 done
Mapping stats after de-duplication: RC_CRP2 done
Mapping stats after de-duplication: RC_CRP3 done
Mapping stats after de-duplication: CRP1 done
Mapping stats after de-duplication: CRP2 done
Mapping stats after de-duplication: CRP3 done


# Extract only Mtb R2 reads

In [39]:
# Set print_output = True to see command-line output
# Set return_output = False if assigning commmand-line output to a variable
print_output = False
return_output = False

Mtb_region = 'Eco_Mtb:4641653-9053361'

for row in inline_table.iterrows():
    _, sample_data = row
    # Output the R2 reads mapping to the genome with enriched ends to analyze  
    ref_R2_command = f'samtools view -b -f 0x0080 ' + \
                     f'{dedup_alignments_dir}/sorted_{sample_data.title}.bam ' + \
                     f'"{Mtb_region}" -o {R2_alignments_dir}/{sample_data.title}_R2.bam'
    ref_R2 = quickshell(ref_R2_command, print_output = print_output, return_output = return_output)
    
    # Index the R2 only bam
    index_R2_command = f'samtools index {R2_alignments_dir}/{sample_data.title}_R2.bam'
    index_R2 = quickshell(index_R2_command, print_output = print_output, return_output = return_output)
    print(f'Generate and index R2-only alignments: {sample_data.title} done')

Generate and index R2-only alignments: core1 done
Generate and index R2-only alignments: core2 done
Generate and index R2-only alignments: core3 done
Generate and index R2-only alignments: RC1 done
Generate and index R2-only alignments: RC2 done
Generate and index R2-only alignments: RC3 done
Generate and index R2-only alignments: WhiB1 done
Generate and index R2-only alignments: WhiB2 done
Generate and index R2-only alignments: WhiB3 done
Generate and index R2-only alignments: core_CRP1 done
Generate and index R2-only alignments: core_CRP2 done
Generate and index R2-only alignments: core_CRP3 done
Generate and index R2-only alignments: RC_CRP1 done
Generate and index R2-only alignments: RC_CRP2 done
Generate and index R2-only alignments: RC_CRP3 done
Generate and index R2-only alignments: CRP1 done
Generate and index R2-only alignments: CRP2 done
Generate and index R2-only alignments: CRP3 done


In [None]:
# Count and assemble all R2 reads to decide on number of reads to downsample to
R2_reads = []

for row in inline_table.iterrows():
    _, sample_data = row
    count_R2_reads_command = f'samtools view -c ' + \
                                  f'{R2_alignments_dir}/{sample_data.title}_R2.bam'
    count_R2_reads = int(quickshell(count_R2_reads_command,
                                         print_output = False,
                                         return_output=True)[0].split('\n')[0])
    R2_reads.append([sample_data.title, count_R2_reads])
    print(f'{sample_data.title} done')
    
reads_DF = pd.DataFrame(R2_reads, columns = ['Sample_Name','R2_read_counts'])

reads_DF['Condition'] = reads_DF['Sample_Name'].str[:-1]
conditions = reads_DF['Condition'].unique()
reads_DF['Rank'] = 0

DF_list = []

for i in range(len(conditions)):
    replicates = reads_DF.loc[reads_DF['Condition'] == conditions[i]]
    replicates.loc[replicates['R2_read_counts'] == replicates['R2_read_counts'].max(), 'Rank'] = 1
    replicates.loc[replicates['R2_read_counts'] == replicates['R2_read_counts'].min(), 'Rank'] = 3
    replicates.loc[replicates['Rank'] == 0, 'Rank'] = 2
    DF_list.append(replicates)
    
reads_DF_ranked = pd.concat(DF_list)

# Inspect read depths to decide on a minimum read depth to downsample all samples for subsequent analysis
reads_DF_ranked.to_csv(f'{R2_alignments_dir}/R2_read_counts.csv')

# Extract only R2 reads from Eco samples

In [1]:
# Set print_output = True to see command-line output
# Set return_output = False if assigning commmand-line output to a variable
print_output = True
return_output = False

Eco_region = 'Eco_Mtb:1-4641652'

for row in inline_table.iterrows():
    _, sample_data = row
    # Output the spike R2 reads
    spike_R2_command = f'samtools view -b -f 0x0080 ' + \
                     f'{dedup_alignments_dir}/sorted_{sample_data.title}.bam ' + \
                     f'"{Eco_region}" -o {spike_R2_alignments_dir}/{sample_data.title}_R2.bam'
    spike_R2 = quickshell(spike_R2_command, print_output = print_output, return_output = return_output)
    
    # Index the R2 only bam
    index_spike_R2_command = f'samtools index {spike_R2_alignments_dir}/{sample_data.title}_R2.bam'
    index_spike_R2 = quickshell(index_spike_R2_command, print_output = print_output, return_output = return_output)
    print(f'Generate and index R2-only spike alignments: {sample_data.title} done')

In [3]:
# Count and assemble all R2 spike reads for downstream spike analysis scaled with downsampling
R2_reads = []

for row in inline_table.iterrows():
    _, sample_data = row
    count_R2_reads_command = f'samtools view -c ' + \
                                  f'{spike_R2_alignments_dir}/{sample_data.title}_R2.bam'
    count_R2_reads = int(quickshell(count_R2_reads_command,
                                         print_output = False,
                                         return_output=True)[0].split('\n')[0])
    R2_reads.append([sample_data.title, count_R2_reads])
    print(f'{sample_data.title} spike done')
    
reads_DF = pd.DataFrame(R2_reads, columns = ['Sample_Name','R2_read_counts'])

reads_DF['Condition'] = reads_DF['Sample_Name'].str[:-1]
conditions = reads_DF['Condition'].unique()
reads_DF['Rank'] = 0

DF_list = []

for i in range(len(conditions)):
    replicates = reads_DF.loc[reads_DF['Condition'] == conditions[i]]
    replicates.loc[replicates['R2_read_counts'] == replicates['R2_read_counts'].max(), 'Rank'] = 1
    replicates.loc[replicates['R2_read_counts'] == replicates['R2_read_counts'].min(), 'Rank'] = 3
    replicates.loc[replicates['Rank'] == 0, 'Rank'] = 2
    DF_list.append(replicates)
    
reads_DF_ranked = pd.concat(DF_list)

# Inspect read depths to decide on a minimum read depth to downsample all samples for subsequent analysis
reads_DF_ranked.to_csv(f'{spike_R2_alignments_dir}/R2_read_counts.csv')