Purpose:
A deep mutational scanning experiment was performed on the Hrd1 ubiquitin ligase to determine residues required for wild-type function and find mutations specifically required for ERAD-L degradation.
The notebook below shows the pre-processing of sequencing reads from an Illumina 2x300bp MiSeq run. The libraries were prepared using amplicon sequencing across 5 regions (~110 amino acids each) of Hrd1. Nextera adaptors used and, on each amplicon primer is a 8 “N” nucleotide sequence on the 3’ side of the Nextera adaptor.

0)Input is demultiplexed fastq files provided by Azenta (formally Genewiz). 
1)Cutadapt to trim Nextera adaptors and remove the 8 “N” nucleotides on each read
2)Fastp to merge reads into single contig
3)Bowtie to align to reference sequence
4)Samtools to convert into bam and sort on name

In [1]:
## Output versions of software used
print('cutadapt version:')
!cutadapt --version
print('\nfastp version:')
!fastp --version
print('\nbowtie2 version:')
!bowtie2 --version
print('\nsamtools version:')
!samtools --version

cutadapt version:
1.18

fastp version:
fastp 0.20.0

bowtie2 version:
/usr/bin/bowtie2-align-s version 2.3.5.1
64-bit

samtools version:
samtools 1.10
Using htslib 1.10.2-3
Copyright (C) 2019 Genome Research Ltd.


In [2]:
import os

##cores dedicated for multiprocessing steps
num_core= 12

##base names of the files. Used for input and for output base name; 1 name per line
basename_file_input =  'Hrd1_DMS_Input_Name.txt' 

## fastq input location
in_dir = '0_fastq_files/' 
##output directories
cutadapt_dir = '1_cutadapt/'
fastp_dir = '2_fastp/'
bowtie_dir = '3_bowtie/'

## make directories if they don't exist
if not os.path.isdir(cutadapt_dir):
    os.mkdir(cutadapt_dir)
if not os.path.isdir(fastp_dir):
    os.mkdir(fastp_dir)
if not os.path.isdir(bowtie_dir):
    os.mkdir(bowtie_dir)


##fastq suffix for read1 and read2 (input)
R1_suffix = '_R1_001.fastq.gz'
R2_suffix = '_R2_001.fastq.gz'

##cutadapt suffix (output)
cut_R1_suffix = '_R1_cut.fastq.gz'
cut_R2_suffix = '_R2_cut.fastq.gz'

##fastp suffix (output)
merge_suffix = '_merge.fastq.gz'
unpair_R1_suffix = '_R1_unpaired.fastq.gz'
unpair_R2_suffix = '_R2_unpaired.fastq.gz'

#bowtie/ samtools suffix (output)
bowtie_suffix = '_align.sam'
##reference is hrd1 coding sequence with 90 bp of promoter and terminator
bt_ref = 'hrd1_ref/hrd1_90-prom_90-term'
bamfile_suffix ='_align.bam'


##open and process 1 paired-end read at a time
with open(basename_file_input,'r') as file:
    header = file.readline()
    for line in file:
        basename= line.strip('\n')
        ## cutadapt
        cut_r1_in=str(in_dir+basename+R1_suffix)
        cut_r2_in=str(in_dir+basename+R2_suffix)
        
        cut_r1_out = str(cutadapt_dir+basename+cut_R1_suffix)
        cut_r2_out = str(cutadapt_dir+basename+cut_R2_suffix)
        ## cutadapt trim adaptor and remove 8 N nucleotides directly 3' of 
        !cutadapt -a CTGTCTCTTATACACATCT -A CTGTCTCTTATACACATCT -u 8 -U 8 -q 30 -o $cut_r1_out -p $cut_r2_out --cores=$num_core $cut_r1_in $cut_r2_in

        ##fastp
        fastp_merge=str(fastp_dir+basename+merge_suffix)
        fastp_unpair_R1= str(fastp_dir+basename+cut_R1_suffix)
        fastp_unpair_R2= str(fastp_dir+basename+cut_R2_suffix)
        fastp_report_html= str(fastp_dir+basename+'_report.html')
        fastp_report_json = str(fastp_dir+basename+'_report.json')
        ##fastp merge with default settings
        !fastp -i $cut_r1_out -I $cut_r2_out -m --merged_out $fastp_merge -o $fastp_unpair_R1 -O $fastp_unpair_R2 -w $num_core -h $fastp_report_html -j $fastp_report_json

        ##bowtie2 align very sensitive
        bt_out = str(bowtie_dir+basename+bowtie_suffix)
        !bowtie2 -x $bt_ref --very-sensitive --threads $num_core -U $fastp_merge -S $bt_out
        
        ##samtools sort on name and convert sam to bam file
        samtool_out = str(bowtie_dir+basename+bamfile_suffix)
        !samtools sort -n -@ $num_core -o $samtool_out $bt_out
        ##remove sam file
        !rm $bt_out

This is cutadapt 1.18 with Python 3.7.16
Command line parameters: -a CTGTCTCTTATACACATCT -A CTGTCTCTTATACACATCT -u 8 -U 8 -q 30 -o 1_cutadapt/Hrd1_DMS_Rep1_Region1_Input_R1_cut.fastq.gz -p 1_cutadapt/Hrd1_DMS_Rep1_Region1_Input_R2_cut.fastq.gz --cores=12 0_fastq_files/Hrd1_DMS_Rep1_Region1_Input_R1_001.fastq.gz 0_fastq_files/Hrd1_DMS_Rep1_Region1_Input_R2_001.fastq.gz
Processing reads on 12 cores in paired-end mode ...
Finished in 1.55 s (9 us/read; 6.78 M reads/minute).

=== Summary ===

Total read pairs processed:            174,667
  Read 1 with adapter:                   9,614 (5.5%)
  Read 2 with adapter:                  10,838 (6.2%)
Pairs written (passing filters):       174,667 (100.0%)

Total basepairs processed:   104,775,635 bp
  Read 1:    52,388,959 bp
  Read 2:    52,386,676 bp
Quality-trimmed:               5,640,445 bp (5.4%)
  Read 1:     1,706,578 bp
  Read 2:     3,933,867 bp
Total written (filtered):     95,001,727 bp (90.7%)
  Read 1:    48,565,374 bp
  Read 2:   