Samples sequenced on Flongle

| user | sample | i5 index | i5 sequence | i7 index | i7 sequence |
| ---- | ------ | -------- | ----------- | -------- | ----------- |
| btyeh | 197 bp barcoded oligo + 250 bp barcoded oligo | A11 | TGCTATTA | F2 | AAGCAACT |
| pbhatta2 | SPIDR RNA | E8 | TTCCAGCT | E8 | GTCTTAGT |
| pbhatta2 | C9 RAP | E3 | TCAGGCTT | E3 | ATCCGACA |
| pbhatta2 | C9 RAP | E4 | GCTGATTC | E4 | CAAGGCGA |

Reference sequences
- i5 adapter: AATGATACGGCGACCACCGAGATCTACAC<8bp i5>ACACTCTTTCCCTACACGACGCTCTTCCGATC
  - i5 rc: GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT<8bp i5 rc>GTGTAGATCTCGGTGGTCGCCGTATCATT
- i7 adapter: CAAGCAGAAGACGGCATACGAGAT<8bp i7>GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
  - i7 rc: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC<8bp i7 rc>ATCTCGTATGCCGTCTTCTGCTTG
- [Nanopore Ligation Adapters](https://community.nanoporetech.com/technical_documents/chemistry-technical-document/v/chtd_500_v1_revaj_07jul2016/ligation-sequencing-kit-family)
  - Top strand: 5'-TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCT-3’
  - Bottom strand: 5’-GCAATACGTAACTGAACGAAGTACAGG-3’
    - RC: CCTGTACTTCGTTCAGTTACGTATTGC
    - C: CGTTATGCATTGACTTGCTTCATGTCC
    - R: GGACATGAAGCAAGTCAATGCATAACG

In [1]:
import gzip
import os
import re

In [2]:
DIR_PROJECT = '/central/groups/guttman/btyeh/scBarcode'
DIR_DATA = os.path.join(DIR_PROJECT, 'data', '20230831')
DIR_AUX = os.path.join(DIR_PROJECT, 'data_aux', '20230831')

In [3]:
with gzip.open(os.path.join(DIR_DATA, 'all_concat.fastq.gz')) as f:
    for line in f:
        break
print('basecalling model:', re.search('model_version_id=\S+', line.decode().strip()).group())

basecalling model: model_version_id=dna_r10.4.1_e8.2_5khz_400bps_sup@v4.2.0


In [4]:
n_reads = !unpigz -c "{os.path.join(DIR_DATA, 'all_concat.fastq.gz')}" | wc -l | awk '{{print $$1 / 4}}'
n_reads = int(n_reads[0])
print('number of passing reads from basecaller:', n_reads)

number of passing reads from basecaller: 335717


# Understanding Nanopore Reads

Note about splitcode options
- `--assign`
  > A second reason to use `--assign` is if you want only certain reads that meet a tag condition to be outputted. This means that all reads that **don’t meet the minFinds/minFindsG criteria** (i.e. aren’t found the minimum number of times specified) or have **zero tags identified** will be considered **unassigned**. Those unassigned reads can be written to separate output files via the `--unassigned` option. If the `--assign` option is not specified, those unassigned reads will still be outputted as normal with the rest of the output. [[FAQ](https://splitcode.readthedocs.io/en/latest/FAQ.html#when-should-i-use-assign-when-running-splitcode)]
- `--no-outb`: do not output final barcode sequences, and do not prepend them to the reads of the first output FASTQ file [[FAQ](https://splitcode.readthedocs.io/en/latest/FAQ.html#how-do-i-specify-the-output-of-the-final-barcodes)]
- `--mapping`: required (if not provided, program will raise an error message)

## Ligation adapter sequences

Number of reads that begin with the top-strand sequence
- splitcode config: look for 15 nt of the 3' end of top-strand sequence within the first 40 nt of the read, allowing for a distance of 3 (up to 2 indels or 2 substitutions)
- note that there are a handful of reads containing 2 copies of the ligation adapter top strand sequence within the first 40 nt

In [5]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/all_concat.fastq.gz"
PATH_CONFIG="$DIR_AUX/splitcode-config_ligation-adapter-top.txt"
PATH_OUTPUT="$DIR_AUX/out_ligation-adapter-top.fasta"
PATH_MAPPING="$DIR_AUX/mapping_ligation-adapter-top.txt"
PATH_SUMMARY="$DIR_AUX/summary_ligation-adapter-top.txt"

rm -f "$PATH_OUTPUT" "$PATH_MAPPING" "$PATH_SUMMARY"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --mod-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo ""
echo "number of reads beginning with top-strand sequence:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
echo -e "\nMapping"
cat "$PATH_MAPPING"

* Using a list of 1 tags (vector size: 1; map size: 632,499; num elements in map: 655,238)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/all_concat.fastq.gz
* processing the reads ...
done 
* processed 335,717 reads, 264,980 reads were assigned



number of reads beginning with top-strand sequence: 264980

Mapping
AAAAAAAAAAAAAAAA	LA	264953
AAAAAAAAAAAAAAAC	LA,LA	27


Number of reads ending with the bottom-strand sequence
- splitcode config: check presence of at least 7 nt (with 1 substitution allowed) from first 15 nt of bottom-strand sequence

In [6]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/all_concat.fastq.gz"
PATH_CONFIG="$DIR_AUX/splitcode-config_ligation-adapter-bottom.txt"
PATH_OUTPUT="$DIR_AUX/out_ligation-adapter-bottom.fasta"
PATH_MAPPING="$DIR_AUX/mapping_ligation-adapter-bottom.txt"
PATH_SUMMARY="$DIR_AUX/summary_ligation-adapter-bottom.txt"

rm -f "$PATH_OUTPUT" "$PATH_MAPPING" "$PATH_SUMMARY"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --mod-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo ""
echo "number of reads ending with bottom-strand sequence:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
echo -e "\nMapping"
cat "$PATH_MAPPING"

* Using a list of 1 tags (vector size: 2; map size: 8,805; num elements in map: 12,910)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/all_concat.fastq.gz
* processing the reads ...
done 
* processed 335,717 reads, 169,914 reads were assigned



number of reads ending with bottom-strand sequence: 169914

Mapping
AAAAAAAAAAAAAAAA	LA	169914


## 2Puni, 2Pbc sequences

Reads matching both 2Puni and 2Pbc

In [7]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/all_concat.fastq.gz"
PATH_CONFIG="$DIR_AUX/splitcode-config_2Puni-2Pbc-strict.txt"
PATH_OUTPUT="$DIR_AUX/out_2Puni-2Pbc-strict.fasta"
PATH_MAPPING="$DIR_AUX/mapping_2Puni-2Pbc-strict.txt"
PATH_SUMMARY="$DIR_AUX/summary_2Puni-2Pbc-strict.txt"

rm -f "$PATH_OUTPUT" "$PATH_MAPPING" "$PATH_SUMMARY"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --mod-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo "number of reads with 2Puni and 2Pbc sequences:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
echo -e "\nMapping:"
sort -n -r -k 3 "$PATH_MAPPING" | cut -f 2,3 | head -n 8

* Using a list of 400 tags (vector size: 400; map size: 13,341,104; num elements in map: 13,752,483)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/all_concat.fastq.gz
* processing the reads ...
done 
* processed 335,717 reads, 116,236 reads were assigned


number of reads with 2Puni and 2Pbc sequences: 116236

Mapping:
2Puni_5prime_1,2Puni_5prime_2,i5_E8,2Puni_3prime_1,2Puni_3prime_2,2Pbc_3prime_2_rc,2Pbc_3prime_1_rc,i7rc_E8,2Pbc_5prime_2_rc,2Pbc_5prime_1_rc	45224
2Pbc_5prime_1,2Pbc_5prime_2,i7_E8,2Pbc_3prime_1,2Pbc_3prime_2,2Puni_3prime_2_rc,2Puni_3prime_1_rc,i5rc_E8,2Puni_5prime_2_rc,2Puni_5prime_1_rc	42189
2Puni_5prime_1,2Puni_5prime_2,i5_E3,2Puni_3prime_1,2Puni_3prime_2,2Pbc_3prime_2_rc,2Pbc_3prime_1_rc,i7rc_E3,2Pbc_5prime_2_rc,2Pbc_5prime_1_rc	9545
2Puni_5prime_1,2Puni_5prime_2,i5_E2,2Puni_3prime_1,2Puni_3prime_2,2Pbc_3prime_2_rc,2Pbc_3prime_1_rc,i7rc_E2,2Pbc_5prime_2_rc,2Pbc_5prime_1_rc	7061
2Pbc_5prime_1,2Pbc_5prime_2,i7_E3,2Pbc_3prime_1,2Pbc_3prime_2,2Puni_3prime_2_rc,2Puni_3prime_1_rc,i5rc_E3,2Puni_5prime_2_rc,2Puni_5prime_1_rc	4116
2Puni_5prime_1,2Puni_5prime_2,i5_A11,2Puni_3prime_1,2Puni_3prime_2,2Pbc_3prime_2_rc,2Pbc_3prime_1_rc,i7rc_F2,2Pbc_5prime_2_rc,2Pbc_5prime_1_rc	2990
2Pbc_5prime_1,2Pbc_5prime_2,i7_F2,2Pbc_3prime_1,2Pb

In [9]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/all_concat.fastq.gz"
PATH_CONFIG="$DIR_AUX/splitcode-config_orient.txt"
PATH_OUTPUT="$DIR_AUX/out_orient.fasta"
PATH_MAPPING="$DIR_AUX/mapping_orient.txt"
PATH_SUMMARY="$DIR_AUX/summary_orient.txt"
PATH_UNASSIGNED="$DIR_AUX/unassigned_orient.fasta"

rm -f "$PATH_OUTPUT" "$PATH_MAPPING" "$PATH_SUMMARY"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --mod-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --unassigned "$PATH_UNASSIGNED" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo "number of reads with 2Puni and 2Pbc sequences:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
echo -e "\nMapping:"
sort -n -r -k 3 "$PATH_MAPPING" | cut -f 2,3 | head -n 8

* Using a list of 8 tags (vector size: 8; map size: 7,496,328; num elements in map: 7,735,466)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/all_concat.fastq.gz
* processing the reads ...
done 
* processed 335,717 reads, 203,160 reads were assigned


number of reads with 2Puni and 2Pbc sequences: 203160

Mapping:
2Puni_3prime_2,2Pbc_3prime_2_rc	110619
2Pbc_3prime_2,2Puni_3prime_2_rc	92536
2Pbc_3prime_2,2Pbc_3prime_2,2Puni_3prime_2_rc	3
2Puni_3prime_2,2Puni_3prime_2,2Pbc_3prime_2_rc	1
2Pbc_3prime_2,2Puni_3prime_2,2Pbc_3prime_2_rc	1


In [12]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/all_concat.fastq.gz"
PATH_CONFIG="$DIR_AUX/splitcode-config_orient-long.txt"
PATH_OUTPUT="$DIR_AUX/out_orient-long.fasta"
PATH_MAPPING="$DIR_AUX/mapping_orient-long.txt"
PATH_SUMMARY="$DIR_AUX/summary_orient-long.txt"
PATH_UNASSIGNED="$DIR_AUX/unassigned_orient-long.fasta"

rm -f "$PATH_OUTPUT" "$PATH_MAPPING" "$PATH_SUMMARY"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --loc-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --unassigned "$PATH_UNASSIGNED" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo "number of reads with 2Puni and 2Pbc sequences:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
echo -e "\nMapping:"
sort -n -r -k 3 "$PATH_MAPPING" | cut -f 2,3 | head -n 8

* Using a list of 8 tags (vector size: 8; map size: 783,040; num elements in map: 786,487)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/all_concat.fastq.gz
* processing the reads ...
done 
* processed 335,717 reads, 300,683 reads were assigned


number of reads with 2Puni and 2Pbc sequences: 300683

Mapping:
2Puni_5prime,2Puni_3prime,2Pbc_3prime_rc,2Pbc_5prime_rc	57280
2Pbc_5prime,2Pbc_3prime,2Puni_3prime_rc,2Puni_5prime_rc	44056
2Pbc_3prime_rc,2Pbc_5prime_rc	27007
2Pbc_5prime	26187
2Pbc_5prime,2Pbc_3prime,2Puni_3prime_rc	25848
2Puni_5prime	22038
2Puni_3prime_rc,2Puni_5prime_rc	19616
2Puni_5prime,2Puni_3prime	18206


# Tests

In [31]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/test_orient.fastq"
PATH_CONFIG="$DIR_AUX/splitcode-config_test.txt"
PATH_OUTPUT="$DIR_AUX/out_test.fasta"
PATH_MAPPING="$DIR_AUX/mapping_test.txt"
PATH_SUMMARY="$DIR_AUX/summary_test.txt"
PATH_UNASSIGNED="$DIR_AUX/unassigned_test.fasta"

rm -f "$PATH_OUTPUT" "$PATH_MAPPING" "$PATH_SUMMARY"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --loc-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --unassigned "$PATH_UNASSIGNED" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo "number of reads with 2Puni and 2Pbc sequences:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
echo -e "\nMapping:"
sort -n -r -k 3 "$PATH_MAPPING" | cut -f 2,3 | head -n 8

cat "$PATH_OUTPUT"

* Using a list of 2 tags (vector size: 2; map size: 975,875; num elements in map: 1,010,913)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/test_orient.fastq
* processing the reads ...
done 
* processed 1 reads, 1 reads were assigned


number of reads with 2Puni and 2Pbc sequences: 1

Mapping:
2Puni_3prime_2,2Pbc_3prime_2_rc	1
>67d75a6c-fd5d-4803-86c9-f0bdf8ae8a20 LX:Z:2Puni_3prime_2:0,87-103,2Pbc_3prime_2_rc:0,163-179
ATTTATATCCTACTTCGTTCAGTTACGTATTGCTAATGATACGGCGACCACCGAGATCTACACTGCTATTAACACTCTTTCCCTACACGACGCTCTTCCGATCTAAGGTAGCTAAATTGCTCCAAGTCAAAGATACTTCTGTAGGCAGTTGTCAACGCATAGAAGATCGGAAGACATAGAAAGTCAAGCTAGATTCCACGAAGAGTTGTAGAGGTAGCAGGAGATTTCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTGCTTATCTCGTATGCCGTCTTCTGCTTGAGCAATACGTGG


In [14]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/test_orient.fastq"
PATH_CONFIG="$DIR_AUX/splitcode-config_test.txt"
PATH_OUTPUT="$DIR_AUX/out_test.fasta"
PATH_MAPPING="$DIR_AUX/mapping_test.txt"
PATH_SUMMARY="$DIR_AUX/summary_test.txt"
PATH_UNASSIGNED="$DIR_AUX/unassigned_test.fasta"

cat "$PATH_OUTPUT"

>67d75a6c-fd5d-4803-86c9-f0bdf8ae8a20 LX:Z:2Puni_3prime_1:0,71-87,2Puni_3prime_2:0,87-103,2Pbc_3prime_2_rc:0,163-179,2Pbc_3prime_1_rc:0,240-257
ATTTATATCCTACTTCGTTCAGTTACGTATTGCTAATGATACGGCGACCACCGAGATCTACACTGCTATTAACACTCTTTCCCTACACGACGCTCTTCCGATCTAAGGTAGCTAAATTGCTCCAAGTCAAAGATACTTCTGTAGGCAGTTGTCAACGCATAGAAGATCGGAAGACATAGAAAGTCAAGCTAGATTCCACGAAGAGTTGTAGAGGTAGCAGGAGATTTCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTGCTTATCTCGTATGCCGTCTTCTGCTTGAGCAATACGTGG


Reads matching 2Puni

In [69]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/test_2Puni.fastq"
PATH_CONFIG="$DIR_AUX/splitcode-config_2Puni-12.txt"
PATH_OUTPUT="$DIR_AUX/out_test-2Puni-12.fasta"
PATH_MAPPING="$DIR_AUX/mapping_test-2Puni-12.txt"
PATH_SUMMARY="$DIR_AUX/summary_test-2Puni-12.txt"

rm -f "$PATH_OUTPUT" "$PATH_MAPPING" "$PATH_SUMMARY"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --loc-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo "number of reads with 2Puni sequences:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
echo -e "\nMapping:"
sort -n -r -k 3 "$PATH_MAPPING" | cut -f 2,3 | head -n 8
cat "$PATH_OUTPUT"

* Using a list of 98 tags (vector size: 98; map size: 416,239; num elements in map: 441,712)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/test_2Puni.fastq
* processing the reads ...
done 
* processed 1 reads, 1 reads were assigned


number of reads with 2Puni sequences: 1

Mapping:
2Puni_5prime_2,i5_E3,2Puni_3prime_1	1
>eaf97d93-2a95-4a53-8764-6ff71f263314 LX:Z:2Puni_5prime_2:0,43-55,i5_E3:0,55-63,2Puni_3prime_1:0,63-71
AACCTATTGGTTCGGTTGGTCTTGCTAATGATACGGCGACCACCGAGATCTACACTCAGGCTTACACTCTTTCCCTACACGACGCTCTTCCGATCTGACGCTCTTCCGATCTGACGCTCTTCCGATCTGACGCTCTTCCGATCTGACGCTCTTCCGATCTGACGCTCTTCCCGATCTGACGCTCTTCCGATCTGACGCTCTTCCGATCTGACGGAAGAGCACACGTCTGAACTCCAGTCACTGTCGGATATCTCATTATGCCATC


In [70]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/test_2Puni.fastq"
PATH_CONFIG="$DIR_AUX/splitcode-config_2Puni-8.txt"
PATH_OUTPUT="$DIR_AUX/out_test-2Puni-8.fasta"
PATH_MAPPING="$DIR_AUX/mapping_test-2Puni-8.txt"
PATH_SUMMARY="$DIR_AUX/summary_test-2Puni-8.txt"

rm -f "$PATH_OUTPUT" "$PATH_MAPPING" "$PATH_SUMMARY"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --loc-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo "number of reads with 2Puni sequences:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
echo -e "\nMapping:"
sort -n -r -k 3 "$PATH_MAPPING" | cut -f 2,3 | head -n 8
cat "$PATH_OUTPUT"

* Using a list of 98 tags (vector size: 98; map size: 129,039; num elements in map: 145,675)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/test_2Puni.fastq
* processing the reads ...
done 
* processed 1 reads, 0 reads were assigned


number of reads with 2Puni sequences: 0

Mapping:


In [75]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/test_2Puni.fastq"
PATH_CONFIG="$DIR_AUX/splitcode-config_test.txt"
PATH_OUTPUT="$DIR_AUX/out_test-2Puni-8.fasta"
PATH_MAPPING="$DIR_AUX/mapping_test-2Puni-8.txt"
PATH_SUMMARY="$DIR_AUX/summary_test-2Puni-8.txt"

rm -f "$PATH_OUTPUT" "$PATH_MAPPING" "$PATH_SUMMARY"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --loc-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo "number of reads with 2Puni sequences:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
echo -e "\nMapping:"
sort -n -r -k 3 "$PATH_MAPPING" | cut -f 2,3 | head -n 8
cat "$PATH_OUTPUT"

* Using a list of 2 tags (vector size: 2; map size: 67,060; num elements in map: 73,663)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/test_2Puni.fastq
* processing the reads ...
done 
* processed 1 reads, 1 reads were assigned


number of reads with 2Puni sequences: 1

Mapping:
2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,i5_E3,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2,2Puni_5prime_2	1
>eaf97d93-2a95-4a53-8764-6ff71f263314 LX:Z:2Puni_5prime_2:0,27-34,2Puni_5prime_2:0,38-44,2Puni_5prime_2:0,47-55,i5_E3:0,55-63,2Puni_5prime_2:0,63-69,2Puni_5prime_2:0,70-80,2Puni_5prime_2:0,84-91,2Puni_5prime_2:0,91-101,2Puni_5prime_2:0,101-107,2Puni_5prime_2:0,107-117,2Puni_5prime_2:0,117-123,2Puni_5prime_2:0,123-133,2Puni_5prime_2:0,133-139,2Puni_5prime_2:0,139-149,2Puni_5prime_2:0,149-155,2Puni_5prime_2:0,155-165,2Puni_5prime_2:0,165-172,2Puni_5prime_2:0,172-182,2Puni_5prime_2:0,182-188,2Puni_5prime_2:0,188-198,2Puni_5prime_2:0,198-204,2Puni_5p

In [63]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/all_concat.fastq.gz"
PATH_CONFIG="$DIR_AUX/splitcode-config_2Puni-12.txt"
PATH_OUTPUT="$DIR_AUX/out_2Puni-12.fasta"
PATH_MAPPING="$DIR_AUX/mapping_2Puni-12.txt"
PATH_SUMMARY="$DIR_AUX/summary_2Puni-12.txt"

rm -f "$PATH_OUTPUT" "$PATH_MAPPING" "$PATH_SUMMARY"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --mod-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo "number of reads with 2Puni sequences:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
echo -e "\nMapping:"
sort -n -r -k 3 "$PATH_MAPPING" | cut -f 2,3 | head -n 8

* Using a list of 98 tags (vector size: 98; map size: 416,239; num elements in map: 441,712)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/all_concat.fastq.gz
* processing the reads ...
done 
* processed 335,717 reads, 132,290 reads were assigned


number of reads with 2Puni sequences: 132290

Mapping:
2Puni_5prime_2,i5_E8,2Puni_3prime_1	90165
2Puni_5prime_2,i5_E3,2Puni_3prime_1	20001
2Puni_5prime_2,i5_E2,2Puni_3prime_1	14905
2Puni_5prime_2,i5_A11,2Puni_3prime_1	6184
2Puni_5prime_2,i5_E2,2Puni_3prime_1,2Puni_5prime_2	342
2Puni_5prime_2,i5_E3,2Puni_3prime_1,2Puni_5prime_2	276
2Puni_5prime_2,i5_C11,2Puni_3prime_1	145
2Puni_5prime_2,i5_E8,2Puni_3prime_1,2Puni_5prime_2	104


In [64]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/all_concat.fastq.gz"
PATH_CONFIG="$DIR_AUX/splitcode-config_2Puni-8.txt"
PATH_OUTPUT="$DIR_AUX/out_2Puni-8.fasta"
PATH_MAPPING="$DIR_AUX/mapping_2Puni-8.txt"
PATH_SUMMARY="$DIR_AUX/summary_2Puni-8.txt"

rm -f "$PATH_OUTPUT" "$PATH_MAPPING" "$PATH_SUMMARY"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --mod-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo "number of reads with 2Puni sequences:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
echo -e "\nMapping:"
sort -n -r -k 3 "$PATH_MAPPING" | cut -f 2,3 | head -n 8

* Using a list of 98 tags (vector size: 98; map size: 129,039; num elements in map: 145,675)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/all_concat.fastq.gz
* processing the reads ...
done 
* processed 335,717 reads, 1,322 reads were assigned


number of reads with 2Puni sequences: 1322

Mapping:
2Puni_5prime_2,i5_E8,2Puni_3prime_1,2Puni_5prime_2	442
2Puni_5prime_2,i5_E3,2Puni_3prime_1,2Puni_5prime_2	227
2Puni_5prime_2,i5_E2,2Puni_3prime_1,2Puni_5prime_2	184
2Puni_5prime_2,i5_A11,2Puni_3prime_1,2Puni_5prime_2	45
2Puni_5prime_2,i5_H2,2Puni_3prime_1,2Puni_5prime_2	40
2Puni_5prime_2,i5_G2,2Puni_3prime_1,2Puni_5prime_2	27
2Puni_5prime_2,i5_H12,2Puni_3prime_1,2Puni_5prime_2	21
2Puni_5prime_2,i5_H8,2Puni_3prime_1,2Puni_5prime_2	21


Reads matching 2Puni

In [35]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

cd "$DIR_AUX"

PATH_INPUT="$DIR_DATA/test.fastq"
PATH_CONFIG="$DIR_AUX/splitcode-config_2Puni-2Pbc-strict.txt"
PATH_OUTPUT="$DIR_AUX/out_test.fasta"
PATH_MAPPING="$DIR_AUX/mapping_test.txt"
PATH_SUMMARY="$DIR_AUX/summary_test.txt"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --assign --out-fasta --mod-names --no-outb \
    --mapping "$PATH_MAPPING" --summary "$PATH_SUMMARY" --output "$PATH_OUTPUT" \
    "$PATH_INPUT"
echo "number of reads with 2Puni and 2Pbc sequences:" $(wc -l "$PATH_OUTPUT" | awk '{print $1 / 2}')
# rm "$PATH_OUTPUT"

* Using a list of 400 tags (vector size: 400; map size: 438,072; num elements in map: 447,631)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/test.fastq
* processing the reads ...
done 
* processed 2 reads, 1 reads were assigned


number of reads with 2Puni and 2Pbc sequences: 1


Separate sequences starting with i5 vs. i7 sequence

In [10]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

PATH_INPUT="$DIR_DATA/all_concat.fastq.gz"
PATH_CONFIG="$DIR_AUX/splitcode-config_i5-vs-i7.txt"
PATH_SELECT="$DIR_AUX/splitcode-select_i5-vs-i7.txt"

cd "$DIR_AUX"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --keep-grp="$PATH_SELECT" --out-fasta --no-output --unassigned="unmapped.fasta" "$PATH_INPUT"

* Using a list of 8 tags (vector size: 8; map size: 18,106,818; num elements in map: 18,247,768)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data/20230831/all_concat.fastq.gz
* processing the reads ...
done 
* processed 335,717 reads


Check for ligation adapter

In [12]:
%%bash -s {DIR_DATA} {DIR_AUX}
DIR_DATA="$1"
DIR_AUX="$2"
source ~/.bashrc

PATH_INPUT="$DIR_AUX/i5_0.fasta"
PATH_CONFIG="$DIR_AUX/splitcode-config_ligation-adapter.txt"

cd "$DIR_AUX"

splitcode -c "$PATH_CONFIG" --nFastqs=1 --out-fasta --no-output --summary='i5_ligation.summary' "$PATH_INPUT"

* Using a list of 1 tags (vector size: 1; map size: 4,115,826; num elements in map: 4,209,561)
* will process sample 1: /central/groups/guttman/btyeh/scBarcode/data_aux/20230831/i5_0.fasta
* processing the reads ...bash: line 10: 149949 Segmentation fault      splitcode -c "$PATH_CONFIG" --nFastqs=1 --out-fasta --no-output --summary='i5_ligation.summary' "$PATH_INPUT"


CalledProcessError: Command 'b'DIR_DATA="$1"\nDIR_AUX="$2"\nsource ~/.bashrc\n\nPATH_INPUT="$DIR_AUX/i5_0.fasta"\nPATH_CONFIG="$DIR_AUX/splitcode-config_ligation-adapter.txt"\n\ncd "$DIR_AUX"\n\nsplitcode -c "$PATH_CONFIG" --nFastqs=1 --out-fasta --no-output --summary=\'i5_ligation.summary\' "$PATH_INPUT"\n'' returned non-zero exit status 139.

Observations
- Essentially all sequences starting with i5 are prefixed with the top strand ligation adapter sequence

1. How many ways the reads start


Check for all 4 orientations of my sequences

| start | end |


- i5, i7 r
- i5, i7 rc
- i5 