# Cell-Free Genomics: FASTQ preparation and alignment for genomic DNA samples

#### Ruby Froom, Campbell/Darst and Rock labs at Rockefeller University

The following pipeline processes raw FASTQ files, maps reads, and extracts R2 reads for the relevant genome from genomic DNA samples, to serve as a 'blacklist' sample for TTS calling from RNA 3' end datasets.

## Modules and functions

In [2]:
import subprocess, time
import pandas as pd
import numpy as np
import regex
import pysam
import csv
from Bio.Seq import Seq
from Bio import SeqIO
from os.path import exists
from Bio import SeqRecord
import scipy

In [3]:
# written by Peter Culviner, PhD to enable command-line access through Jupyter
def quickshell(command, print_output=True, output_path=None, return_output=False):
    process_output = subprocess.run(command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout = process_output.stdout.decode('utf-8')
    stderr = process_output.stderr.decode('utf-8')
    output_string = f'STDOUT:\n{stdout}\nSTDERR:\n{stderr}\n'
    if print_output:
        print('$ ' + command)
        print(output_string)
    if output_path is not None:
        with open(output_path, 'w') as f:
            f.write(output_string)
    if return_output:
        return stdout, stderr

## Initializing inputs and settings

Inputs for directory creation and pointing to files. 

The starting requirements for this pipeline are:
>**main_path**: the absolute path of the working directory. Update to reflect your own configuration.

>**input_csv_dir:** a directory containing `i7_sample_file` csv called `input_csv_files`.
>>**i7_sample_file**: Used to identify fastq files to split from. Assumes we're working with fastqs that have already been split by Illumina's i7 indexes. Minimally should have columns:
>>>**r1**: read 1 fastq, no inline barcode, template starts from first base.

>>>**r2**: read 2 fastq, this one is assumed to have the inline barcode starting from the first base.

>>>**i7**: number to designate this fastq, used internally by the code and used to designate in which unsplit fastq files final samples exist. minimally, each fastq pair should have a unique number.

>>>**title**: used internally for pool names during splitting, unique but the name itself is not important.


>**genome_dir:** a directory containing reference genome(s) for read mapping. Here, called `genome_files_misc`.

>**raw_fastq_dir:** a sub-directory containing your i7-demultiplexed fastq files. Here, called `raw_fastq`.

In [4]:
# initializing locations of input .csv files, alignment genomes and raw compressed fastq files
main_path = 'gDNA'
input_csv_dir = f'{main_path}/input_csv_files'
readPrep_dir = f'{main_path}/readPrep'
genome_dir = 'genome_files_misc'
raw_fastq_dir = f'{readPrep_dir}/raw_fastq'

The additional directories below are a suggested organization for subsequent processing steps in this notebook.

The cells below will initialize these variables and create the directories inside the working directory.

In [6]:
fastqc_dir = f'{readPrep_dir}/fastqc'

# Directory for fastqs quality filtered and Illumina adaptors trimmed
trimmed_fastq_dir = f'{readPrep_dir}/trimmed_fastq'

alignments_dir = f'{readPrep_dir}/alignments'
R2_alignments_dir = f'{readPrep_dir}/R2_alignments'

In [7]:
# Make directories if needed
!mkdir $fastqc_dir

!mkdir $trimmed_fastq_dir

!mkdir $alignments_dir
!mkdir $R2_alignments_dir

*Descriptions written by Peter Culviner, PhD with modifications from Ruby Froom.*

**threads:** number of threads to use in command-line calls (e.g. fastqc, cutadapt, bwa, end calling).

**inline_barcode_errors:** How many mismatches can be tolerated in the inline barcode. Analyze the edit distance between the barcodes used to assess a tolerable number of errors.

**barcode_length**: the length of the inline barcodes. Enables calculation of the error rate (# errors / barcode length), a required input for `cutadapt` when we are splitting based on inline barcode identification.

**minimum_insert_length:** minimum allowed length of insert after trimming of adapter sequences at the 3'-ends of read1 and read2.

**i7_trim**: For 3'-end trimming of the first read. The adapter sequence to look for at the 3'-end of the read. Note that the reverse complement of the inline barcode for a given sample will be appended to the front of this string since the inline barcode would appear just before this string.

**i5_trim**: For 3'-end trimming of the second read. The adapter sequence to look for at the 3'-end of the read.

In [8]:
# SETTINGS

# Threads (CPU x2 is max) for fastqc, cutadapt, bwa, and end calling
threads = 22

# Quality filters
minimum_insert_length = 10
quality_threshold = 20

# 3'-adapter trimming
i7_trim = 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
i5_trim = 'GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT'

i7_table = pd.read_csv(f'{input_csv_dir}/i7_barcodes_gDNA.csv')

## Generate quality reports

In [7]:
for row in i7_table.iterrows():
    _, i7_data = row
    R1 = i7_data.r1
    R2 = i7_data.r2

    command1 = f'fastqc {raw_fastq_dir}/{R1} -o {fastqc_dir} -t {threads}'
    command2 = f'fastqc {raw_fastq_dir}/{R2} -o {fastqc_dir} -t {threads}'

    quickshell(
            command1,
            print_output=False,
            return_output=False)

    print(f'R1 fastqc: {i7_data.title} done')

    quickshell(
            command2,
            print_output=False,
            return_output=False)
    
    print(f'R2 fastqc: {i7_data.title} done')

## Quality-filter reads

In [8]:
# iterate through file pairs and trim adapters
trimming_log = f'{trimmed_fastq_dir}/gDNA_split_trim_log.txt'
with open(trimming_log, 'w') as f:
    for row in i7_table.iterrows():
        _, sample_data = row
        # prepare input and output titles
        cutadapt_inputs = [
            f'{raw_fastq_dir}/{sample_data.r1}',
            f'{raw_fastq_dir}/{sample_data.r2}']
        cutadapt_outputs = [
            f'{trimmed_fastq_dir}/{sample_data.title}.R1.fastq.gz',
            f'{trimmed_fastq_dir}/{sample_data.title}.R2.fastq.gz']
        # prepare cutadapt command
        # read from read 1
        i7_adapter = f'{i7_trim}'
        # read from read 2
        i5_adapter = f'{i5_trim}'
        command = f'cutadapt --overlap=1 --minimum-length={minimum_insert_length} -q {quality_threshold} ' + \
                  f'-j {threads} -a {i7_adapter} -A {i5_adapter} -o {cutadapt_outputs[0]} ' + \
                  f'-p {cutadapt_outputs[1]} {cutadapt_inputs[0]} {cutadapt_inputs[1]}'
        # run cutadapt
        output_trim, _ = quickshell(command, print_output=False, return_output=True)
        # write full output to log
        f.write(output_trim + '\n')

# Mapping reads

Map the reads to 2 concatenated genomes: a spike genome (Eco) and an experimental genome of interest (Mtb).

To enable spike-based absolute quantification across conditions, I map to a concatenated genome where Mtb comes first, then Eco, so that ambiguously-mapping reads are removed from the spike quantitation.

For the end enrichment calling, the same logic applies, but now the Eco genome comes first to remove ambiguously-mapping reads from the final end enrichment quantification.

In [6]:
# Index the input genomes (only needs to be done once)
index_genome_command = f'bwa index {genome_dir}/Eco_Mtb_genome.fasta'

index_genome = quickshell(index_genome_command, print_output = True, return_output = False)

In [9]:
# Use BWA to align to concatenated genomes 
# for spike calculation and subsequent end enrichment analysis

# bwa algorithm. Mem is default for most applications, read documentation to decide
bwa_algorithm = 'mem'

# Set print_output = True to see command-line output
# Set return_output = True if assigning commmand-line output to a variable
print_output = True
return_output = False

for row in i7_table.iterrows():
    _, sample_data = row
    R1_reads = f'{trimmed_fastq_dir}/{sample_data.title}.R1.fastq.gz'
    R2_reads = f'{trimmed_fastq_dir}/{sample_data.title}.R2.fastq.gz'

    map_command = f'bwa {bwa_algorithm} -t {threads} {genome_dir}/Eco_Mtb_genome.fasta ' + \
                       f'{R1_reads} {R2_reads} > {alignments_dir}/{sample_data.title}.sam'
    
    map_output = quickshell(map_command, print_output = print_output, return_output = return_output)
    print(f'Initial alignments: {sample_data.title} Done')

In [1]:
# Process alignments: 
# sort, convert to bam, and index

# Set print_output = True to see command-line output
# Set return_output = True if assigning commmand-line output to a variable
print_output = False
return_output = False

for row in i7_table.iterrows():
    _, sample_data = row
    # Sort sam file and output as bam
    sort_command = f'samtools sort -O BAM {alignments_dir}/{sample_data.title}.sam > ' + \
                          f'{alignments_dir}/sorted_{sample_data.title}.bam'
    
    sort_output = quickshell(sort_command, print_output = print_output, return_output=return_output)
    
    # Index sorted bam
    index_command = f'samtools index {alignments_dir}/sorted_{sample_data.title}.bam'
    
    index_output = quickshell(index_command, print_output = print_output, return_output = return_output)

    print(f'Sort and index alignments before de-duplication: {sample_data.title} done')

In [5]:
# Generate mapping statistics

# List to generate dataframe with spike counts for normalization (needed in downstream analysis)
# and other mapping statistics (not needed)
mapping_stats = []

# Update with the name of your genome and relevant regions
Eco_region = 'Eco_Mtb:1-4641652'
Mtb_region = 'Eco_Mtb:4641653-9053361'

for row in i7_table.iterrows():
    _, sample_data = row
    # Extract mapping stats
    count_Mtb_reads_command = f'samtools view -c -F 260 ' + \
                              f'{alignments_dir}/sorted_{sample_data.title}.bam "{Mtb_region}"'
    count_Mtb_reads = int(quickshell(count_Mtb_reads_command,
                                     print_output = False,
                                     return_output=True)[0].split('\n')[0])
    
    count_all_mapped_reads_command = f'samtools view -c -F 260 ' + \
                                         f'{alignments_dir}/sorted_{sample_data.title}.bam'
    count_all_mapped_reads = int(quickshell(count_all_mapped_reads_command,
                                                print_output = False,
                                                return_output=True)[0].split('\n')[0])
    Mtb_percent = (count_Mtb_reads / count_all_mapped_reads) * 100
    
    count_Eco_reads_command = f'samtools view -c -F 260 ' + \
                                f'{alignments_dir}/sorted_{sample_data.title}.bam "{Eco_region}"'
    count_Eco_reads = int(quickshell(count_Eco_reads_command,
                                       print_output = False,
                                       return_output=True)[0].split('\n')[0])

    Eco_percent = (count_Eco_reads / count_all_mapped_reads) * 100
    
    count_unmapped_reads_command = f'samtools view -c -f 4 ' + \
                                       f'{alignments_dir}/sorted_{sample_data.title}.bam'
    count_unmapped_reads = int(quickshell(count_unmapped_reads_command,
                                              print_output = False,
                                              return_output=True)[0].split('\n')[0])
    
    
    count_all_reads_command = f'samtools view -c ' + \
                                  f'{alignments_dir}/sorted_{sample_data.title}.bam'
    count_all_reads = int(quickshell(count_all_reads_command,
                                         print_output = False,
                                         return_output=True)[0].split('\n')[0])
    unmapped_percent = (count_unmapped_reads / count_all_reads) * 100
    
    mapping_stats.append([sample_data.title, count_Eco_reads, Eco_percent, count_Mtb_reads,
                          Mtb_percent, unmapped_percent])
    
    print(f'Mapping stats: {sample_data.title} done')

mapping_DF = pd.DataFrame(mapping_stats, columns = ['Sample_Name','Eco Counts',
                                                    'Eco % Mapped','Mtb Counts','Mtb % Mapped',
                                                    '% Unmapped'])
mapping_DF.to_csv(f'{alignments_dir}/mapping_stats.csv')

# Extract only Mtb R2 reads

In [7]:
# Set print_output = True to see command-line output
# Set return_output = False if assigning commmand-line output to a variable
print_output = False
return_output = False

Mtb_region = 'Eco_Mtb:4641653-9053361'

for row in i7_table.iterrows():
    _, sample_data = row
    # Output the R2 reads mapping to the genome with enriched ends to analyze  
    ref_R2_command = f'samtools view -b -f 0x0080 ' + \
                     f'{alignments_dir}/sorted_{sample_data.title}.bam ' + \
                     f'"{Mtb_region}" -o {R2_alignments_dir}/{sample_data.title}_R2.bam'
    ref_R2 = quickshell(ref_R2_command, print_output = print_output, return_output = return_output)
    
    # Index the R2 only bam
    index_R2_command = f'samtools index {R2_alignments_dir}/{sample_data.title}_R2.bam'
    index_R2 = quickshell(index_R2_command, print_output = print_output, return_output = return_output)
    print(f'Generate and index R2-only alignments: {sample_data.title} done')