# Fix reads

Jackalope produces reads with non-standard identifiers where pairs of reads don't have matching identifiers. For example:

* Pair 1: `@SH08-001-NC_011083-3048632-R/1`
* Pair 2: `@SH08-001-NC_011083-3048396-F/2`

In order to run snippy, these paired identifiers need to match (except for the `/1` and `/2` suffix).

So, I have to replace them all with something unique, but which matches in each pair of files. I do this by replacing the position (I think) with the read number (as it appears in the file). So the above identifiers become:

* Pair 1: `@SH08-001-NC_011083-1/1`
* Pair 2: `@SH08-001-NC_011083-1/2`

In [1]:
import glob
import os

# Fix warning about locale unset
os.environ['LANG'] = 'en_US.UTF-8'

files = [os.path.basename(f) for f in glob.glob('output/reads/*.fq.gz')]
!parallel -j 24 -I% 'gzip -d --stdout output/reads/% | perl replace-fastq-header.pl | gzip > output/%' \
    ::: {' '.join(files)}

In [2]:
!pushd output; prename 's/initial_//' *.fq.gz; popd

~/workspace/thesis-data-simulation/jackalope/output ~/workspace/thesis-data-simulation/jackalope
~/workspace/thesis-data-simulation/jackalope


In [3]:
import os
import glob

reference_file = 'input/S_HeidelbergSL476.fasta.gz'

# snippy only runs with uncompressed reference
!gunzip -f -k {reference_file}

reference_file_abs = os.path.abspath('input/S_HeidelbergSL476.fasta')

snippy_out = os.path.abspath('phylogeny')

if not os.path.exists(snippy_out):
    os.mkdir(snippy_out)

with open(f'{snippy_out}/snippy.fofn', 'w') as snippy_fofn:
    directory = 'output'
    for file in glob.glob(f'{directory}/*_R1.fq.gz'):
        sample = os.path.basename(file).rsplit('_R1.fq.gz')[0]
        
        files = [f'{directory}/{sample}_R1.fq.gz', f'{directory}/{sample}_R2.fq.gz']
        files = [os.path.abspath(f) for f in files]
        values = [sample]
        values.extend(files)
        snippy_fofn.write('\t'.join(values)+'\n')

In [4]:
!head -n 1 {snippy_out}/snippy.fofn

SH14-013	/home/CSCScience.ca/apetkau/workspace/thesis-data-simulation/jackalope/output/SH14-013_R1.fq.gz	/home/CSCScience.ca/apetkau/workspace/thesis-data-simulation/jackalope/output/SH14-013_R2.fq.gz


In [5]:
!conda run --name snippy snippy-multi {snippy_out}/snippy.fofn \
    --reference {reference_file_abs} --cpus 6 > {snippy_out}/snippy-commands-all.sh
!head -n-2 {snippy_out}/snippy-commands-all.sh > {snippy_out}/snippy-commands-variant.sh
!tail -n 2 {snippy_out}/snippy-commands-all.sh > {snippy_out}/snippy-commands-core.sh

Reading: /home/CSCScience.ca/apetkau/workspace/thesis-data-simulation/jackalope/phylogeny/snippy.fofn
Generating output commands for 60 isolates
Done.



In [6]:
!tail -n 2 {snippy_out}/snippy-commands-variant.sh
!echo '****'
!tail {snippy_out}/snippy-commands-core.sh

snippy --outdir 'SH14-023' --R1 '/home/CSCScience.ca/apetkau/workspace/thesis-data-simulation/jackalope/output/SH14-023_R1.fq.gz' --R2 '/home/CSCScience.ca/apetkau/workspace/thesis-data-simulation/jackalope/output/SH14-023_R2.fq.gz' --reference /home/CSCScience.ca/apetkau/workspace/thesis-data-simulation/jackalope/input/S_HeidelbergSL476.fasta --cpus 6
snippy --outdir 'SH13-005' --R1 '/home/CSCScience.ca/apetkau/workspace/thesis-data-simulation/jackalope/output/SH13-005_R1.fq.gz' --R2 '/home/CSCScience.ca/apetkau/workspace/thesis-data-simulation/jackalope/output/SH13-005_R2.fq.gz' --reference /home/CSCScience.ca/apetkau/workspace/thesis-data-simulation/jackalope/input/S_HeidelbergSL476.fasta --cpus 6
****
snippy-core --ref 'SH08-001/ref.fa' SH08-001 SH09-29 SH10-001 SH10-002 SH10-014 SH10-015 SH10-30 SH11-001 SH11-002 SH12-001 SH12-002 SH12-003 SH12-004 SH12-005 SH12-006 SH12-007 SH12-008 SH12-009 SH12-010 SH12-011 SH12-012 SH12-013 SH12-014 SH13-001 SH13-002 SH13-003 SH13-004 SH13-005

In [7]:
# Run variant calling in parallel
!(pushd {snippy_out} && conda run --name snippy \
  parallel -j 12 -a {snippy_out}/snippy-commands-variant.sh && popd) > {snippy_out}/snippy-variant.log 2>&1

In [8]:
# Run core in serial
!(pushd {snippy_out} && conda run --name snippy \
  bash {snippy_out}/snippy-commands-core.sh && popd) > {snippy_out}/snippy-core.log 2>&1

In [9]:
!column -s$'\t' -t phylogeny/core.txt

ID         LENGTH   ALIGNED  UNALIGNED  VARIANT  HET   MASKED  LOWCOV
SH08-001   4888768  4830759  47680      0        2142  0       8187
SH09-29    4888768  4830542  47678      14       2153  0       8395
SH10-001   4888768  4830089  47652      10       2130  0       8897
SH10-002   4888768  4830288  47788      12       2329  0       8363
SH10-014   4888768  4830566  47737      0        2207  0       8258
SH10-015   4888768  4830505  47669      0        2050  0       8544
SH10-30    4888768  4830636  47677      14       2091  0       8364
SH11-001   4888768  4830785  47768      10       2170  0       8045
SH11-002   4888768  4830229  47693      0        2184  0       8662
SH12-001   4888768  4831017  47632      0        2148  0       7971
SH12-002   4888768  4829811  47589      0        2165  0       9203
SH12-003   4888768  4830113  47632      0        2259  0       8764
SH12-004   4888768  4830667  47670      12       2159  0       8272
SH12-005   4888768  4830810  47607      12    