## Parsing fastq files to reduce the number of reads or trim the length of reads returned.

We have found a couple of ways that the evSeq software could be improved and this notebook demonstrates how they can be used.

First, we found that the evSeq software fails if your returned reads are longer than the variable region and begin to read into the primer regions. As a quick fix for this, .fastq sequences can be trimmed to a desired length with the parser below `trim_fastqs`. 

Secondly, we found that if many reads are returned the software becomes difficult to run on a personal laptop. Of course, the compute can always be scaled up to accomodate more reads, but an alternative is to randomly downsample from the returned .fastq files. You can also downsample and make sure the software runs through with the new .fastq files before committing to a longer evSeq run with the full files. Downsampling can be done with `downsample_fastqs`.

These options are in beta version right now and may be incorporated into evSeq in the future.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from evSeq.util import trim_fastqs, downsample_fastqs

In [3]:
trim_fastqs?

[1;31mSignature:[0m [0mtrim_fastqs[0m[1;33m([0m[0mfolder[0m[1;33m,[0m [0mnew_length[0m[1;33m=[0m[1;36m150[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
A function to take a folder with fastqs that harbored returned reads 
longer than desired and write new fastqs with the reads trimmed to 
the desired length.

Parameters:
-----------
folder: string, path to folder containing fastq files to trim. 
    fastq files to be trimmed must have the *_R*_*.fastq or 
    *_R*_*.fastq.gz naming convention. This will be the location of a new 
    folder called trimmed_reads harboring the trimmed fastqs, which 
    will be named trimmed_ORIGINAL_NAME.fastq.
new_length: int, default 150, 
    length to trim reads to, will be used for both forward and reverse 
    read files. Reads shorter than the new_length will remain the same 
    length.
[1;31mFile:[0m      c:\users\kejoh\github\evseq\evseq\util\fastq_parsing.py
[1;31mType:[0m      function


In [4]:
# Inputs to trim_fastqs
folder = '../data/multisite_runs/'
new_length = 150
trim_fastqs(folder, new_length)

Loading reads from ..\data\multisite_runs\multisite_S305_L001_R1_001.fastq.gz


403077it [01:04, 6209.19it/s] 


Loading reads from ..\data\multisite_runs\multisite_S305_L001_R2_001.fastq.gz


403077it [01:14, 5410.34it/s] 


In [5]:
downsample_fastqs?

[1;31mSignature:[0m [0mdownsample_fastqs[0m[1;33m([0m[0mfolder[0m[1;33m,[0m [0mnum_entries[0m[1;33m=[0m[1;36m100000[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
A function to take a folder with fastqs that harbored more returned 
reads than desired and write new fastqs with fewer reads that have been
downsampled randomly. Downsampling the reads reduces the compute burden 
of evSeq, but also reduces the data available.

Parameters:
-----------
folder: string, path to folder containing fastq files to downsample. 
    fastq files to be downsampled must have the *_R*_*.fastq or 
    *_R*_*.fastq.gz naming convention. This folder will be the location 
    of a new folder called downsampled_reads harboring the downsampled 
    fastqs, which will be named downsampled_ORIGINAL_NAME.fastq.
num_entries: int, default 100000, 
    number of reads to downsample to. This will be used for both forward 
    and reverse read files, which will have been downsampled in t

In [6]:
# Inputs to downsample_fastqs
folder = '../data/multisite_runs/'
num_entries = 100000
downsample_fastqs(folder, num_entries)

Loading forward reads...


403077it [00:53, 7480.04it/s] 


Downsampling forward reads...
Loading reverse reads...


403077it [00:48, 8388.34it/s] 


Downsampling forward reads...
