# bc_parser.py
This notebook parses barcodes from Illumina sequencing data. Input is listed in [data/samplelist.csv](./data/samplelist.csv). Output is stored in the [results/](./results/) directory. For more details, see [README](./README.md).

Imports

In [1]:
from Bio import SeqIO
import gzip
import pandas as pd
import regex as re

Visualization styles

In [2]:
# Dutch-field from https://gist.github.com/afcotroneo/ca9716f755128b5e9b2ed1fe4186f4df
palette = ["#e60049", "#0bb4ff", "#50e991", "#e6d800", "#9b19f5", "#ffa300", "#dc0ab4", "#b3d4ff", "#00bfa0"]

## input data
### Samples

In [3]:
samplelist = './data/samplelist.csv'
samples_df = pd.read_csv(samplelist, comment='#')
display(samples_df)

Unnamed: 0,sample,fastq_file,bc_len,upstream_seq,downstream_seq
0,luketest,/Users/dbacsik/Downloads/luke_test/ZIKV_DMS_NS...,8,,CCATGAATTCATTAAAGAGGAGAAAGGTACC


Check sample input conforms to requirements.  
Requirements:  
* `sample` (sample name) is unique
* There is exactly 1 `fastq.gz` file listed in `fastq_file` column

In [4]:
assert samples_df['sample'].is_unique, \
    "Sample names are not unique"
assert (samples_df['fastq_file'].str.count('.fastq.gz') == 1).all(), \
    "One (1) fastq.gz file per sample required."

### Reads

In [5]:
for sample, df in samples_df.iterrows():
    print(f"Parsing barcodes for {df['sample']} from {df['fastq_file']}.")
    print("Search parameters are:")
    print(f"\tBarcode length: {df['bc_len']}")
    if pd.notnull(df['upstream_seq']):
        upstream_seq = df['upstream_seq']
        print(f"\tUpstream sequence: {upstream_seq}")
    else:
        print("\tUpstream sequence: not specified")
        upstream_seq = ''
    if pd.notnull(df['downstream_seq']):
        downstream_seq = df['downstream_seq']
        print(f"\tDownstream sequence: {downstream_seq}")
    else:
        print("\tDownstream sequence: not specified")
        downstream_seq = ''
    

Parsing barcodes for luketest from /Users/dbacsik/Downloads/luke_test/ZIKV_DMS_NS5_EvansLab/data/fastqs/Dantas_wt_AM-F0413_AM-F0314_TATCGTGAGC_GATATCACGC_S4_R1_001.fastq.gz.
Search parameters are:
	Barcode length: 8
	Upstream sequence: not specified
	Downstream sequence: CCATGAATTCATTAAAGAGGAGAAAGGTACC
