In [1]:
%reload_ext watermark
%matplotlib inline
from os.path import exists

from metapool.metapool import *
from metapool import (validate_plate_metadata, assign_emp_index, make_sample_sheet, KLSampleSheet, parse_prep, validate_and_scrub_sample_sheet, generate_qiita_prep_file)
%watermark -i -v -iv -m -h -p metapool,sample_sheet,openpyxl -u

Last updated: 2023-03-07T22:23:33.649315-08:00

Python implementation: CPython
Python version       : 3.9.16
IPython version      : 8.11.0

metapool    : 0+untagged.133.g7d6d0f4.dirty
sample_sheet: 0.13.0
openpyxl    : 3.1.1

Compiler    : Clang 14.0.6 
OS          : Darwin
Release     : 21.6.0
Machine     : x86_64
Processor   : i386
CPU cores   : 16
Architecture: 64bit

Hostname: Applejack.lan

re        : 2.2.1
json      : 2.0.9
seaborn   : 0.12.2
matplotlib: 3.7.0
numpy     : 1.24.2
pandas    : 1.5.3



# Knight Lab Amplicon Mapping File (Pre-Preparation File) Generator

### What is it?

This Jupyter Notebook allows you to automatically generate mapping files for amplicon sequencing. It will allow you to merge multiple mapping files from additional PCR preps.


### Here's how it should work.

You'll start out with a **384-well plate map** (platemap.tsv) in a 384-well compressed list format that indicates sample name, well IDs, project plates, etc.

You can use this google sheet template to generate your plate map:

https://docs.google.com/spreadsheets/d/1JCfnGO-6RRFuhOB1yVGMSj5qRFjiiUAUzprsw6IhugY/edit#gid=0

Next you'll enter processing information (project/plate info, plating, extraction PCR), automatically assign EMP barcodes, and then generate a **mapping file** that can be used in combination with the rest of the sequence processing pipeline. 

**Please designate what kind of amplicon sequencing you want to perform:**

In [2]:
seq_type = '16S'
#options are ['16S', '18S', 'ITS']

## Step 1: Input Plate Map

**Enter the correct path to the plate map file**. This will serve as the plate map for relating all subsequent information. Plate maps should be in .tsv format.

In [3]:
plate_map_fp = './test_data/amplicon/compressed-map.tsv'
# plate_map_fp = './2023-02-20_ABTX_204_207_210_215 - Map.tsv'

if not exists(plate_map_fp):
    print("Error: %s is not a path to a valid file" % plate_map_fp)

**Read in the plate map**. It should look something like this:

```
Sample	Row	Col	Blank
GLY_01_012	A	1	False
GLY_14_034	B	1	False
GLY_11_007	C	1	False
GLY_28_018	D	1	False
GLY_25_003	E	1	False
GLY_06_106	F	1	False
GLY_07_011	G	1	False
GLY_18_043	H	1	False
GLY_28_004	I	1	False
```

**Make sure there a no duplicate IDs.** If each sample doesn't have a different name, an error will be thrown and you won't be able to generate a sample sheet.

In [4]:
# Uncomment and replace function call below in order to validate sample_names against Qiita.
# Please contact Antonio or Charlie for path_to_qiita_config_file.
# plate_df = read_plate_map_csv(open(plate_map_fp, 'r'), qiita_oauth2_conf_fp='path_to_qiita_config_file')
plate_df = read_plate_map_csv(open(plate_map_fp, 'r'))

plate_df.head()



Unnamed: 0,Sample,Row,Col,Blank,Project Plate,Project Name,Compressed Plate Name,Well
0,X00180471,A,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,A1
1,X00180199,C,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,C1
2,X00179789,E,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,E1
3,X00180201,G,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,G1
4,X00180464,I,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,I1


## Step 2: Input Processing Information & Assign Barcodes According to Primer Plate

This portion of the notebook will assign a barcode to each sample according to the primer plate number. Additionally, you will add sample plate information and processing information that is obtained during plating, extraction, and PCR.

As inputs, it requires:
1. A plate map dataframe (from previous step)
2. Most importantly, we need the Primer Plate # so we know what **EMP barcodes** to assign to each plate
3. Project and extraction plate information
4. Processing information, or preparation metadata, for each plate

The workflow then:
1. Joins the processing information & barcode assignments with the plate metadata
2. Assigns indices per sample
3. Generates mapping files

## Enter and validate the PCR Primers and additional processing information

- It is absolutely critical that the `Primer Plate #` and the `Plate Position` are accurate. `Primer Plate #` determines which EMP barcodes will be used for this plate. `Plate Position` determines the physical location of the plate. Make sure this input is consistent with what is recorded in the processing progress!
- If you are plating less than four plates, then remove the metadata for that plate by deleting the text between the curly braces.
- For missing fields, write 'not applicable' between the single quotes for example `'not applicable'`.
- To enter a plate copy and paste the contents from the plates below.

In [5]:
_metadata = [
    {
        # top left plate
        'Plate Position': '1',
        'Primer Plate #': '1',
        
        'Sample Plate': 'ABTX_11052_Plate_204', # PROJECTNAME_QIITA_ID_Plate_#
        'Project_Name': 'ABTX_11052', # PROJECTNAME_QIITAID
        'center_project_name': 'Rob ABTX', # what the wetlab calls the project
        'experiment_design_description': '16S sequencing of antibiotic time series', # brief but specific project description
        
        'Plating': 'SF', # initials
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17', # yyyy-mm-dd
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM10 8 Tool': '865HS8',
        'Processing Robot': 'Echo550',
        'TM300 8 Tool': 'not applicable',
        'TM50 8 Tool': 'not applicable',
        'instrument_model': 'Illumina MiSeq',
        'run_date': '2023-03-02', # date of MiSeq run
        'Original Name': '' # leave empty
    },
    {
        # top right plate
        'Plate Position': '2',
        'Primer Plate #': '2',
    
        'Sample Plate': 'ABTX_11052_Plate_207',
        'Project_Name': 'ABTX_11052',
        'center_project_name': 'Rob ABTX',
        'experiment_design_description': '16S sequencing of antibiotic time series',

        'Plating': 'SF',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17',
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM10 8 Tool': '865HS8',
        'Processing Robot': 'Echo550',
        'TM300 8 Tool': 'not applicable',
        'TM50 8 Tool': 'not applicable',
        'instrument_model': 'Illumina MiSeq',
        'run_date': '2023-03-02',
        'Original Name': ''
    },
    {
        # bottom left plate
        'Plate Position': '3',
        'Primer Plate #': '3',
        
        'Sample Plate': 'ABTX_11052_Plate_210',
        'Project_Name': 'ABTX_11052',
        'center_project_name': 'Rob ABTX',
        'experiment_design_description': '16S sequencing of antibiotic time series',
        
        'Plating': 'SF',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17',
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM10 8 Tool': '865HS8',
        'Processing Robot': 'Echo550',
        'TM300 8 Tool': 'not applicable',
        'TM50 8 Tool': 'not applicable',
        'instrument_model': 'Illumina MiSeq',
        'run_date': '2023-03-02',
        'Original Name': ''
    },
    {
        # bottom right plate
        'Plate Position': '4',
        'Primer Plate #': '4',
        
        'Sample Plate': 'ABTX_11052_Plate_215',
        'Project_Name': 'ABTX_11052',
        'center_project_name': 'Rob ABTX',
        'experiment_design_description': '16S sequencing of antibiotic time series',
        
        'Plating': 'SF',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17',
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM10 8 Tool': '865HS8',
        'Processing Robot': 'Echo550',
        'TM300 8 Tool': 'not applicable',
        'TM50 8 Tool': 'not applicable',
        'instrument_model': 'Illumina MiSeq',
        'run_date': '2023-03-02',
        'Original Name': ''
    },
]

plate_metadata = validate_plate_metadata(_metadata)
plate_metadata

Unnamed: 0,Plate Position,Primer Plate #,Sample Plate,Project_Name,center_project_name,experiment_design_description,Plating,Extraction Kit Lot,Extraction Robot,TM1000 8 Tool,Primer Date,MasterMix Lot,Water Lot,TM10 8 Tool,Processing Robot,TM300 8 Tool,TM50 8 Tool,instrument_model,run_date,Original Name
0,1,1,ABTX_11052_Plate_204,ABTX_11052,Rob ABTX,16S sequencing of antibiotic time series,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,RNBJ0628,865HS8,Echo550,not applicable,not applicable,Illumina MiSeq,2023-03-02,
1,2,2,ABTX_11052_Plate_207,ABTX_11052,Rob ABTX,16S sequencing of antibiotic time series,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,RNBJ0628,865HS8,Echo550,not applicable,not applicable,Illumina MiSeq,2023-03-02,
2,3,3,ABTX_11052_Plate_210,ABTX_11052,Rob ABTX,16S sequencing of antibiotic time series,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,RNBJ0628,865HS8,Echo550,not applicable,not applicable,Illumina MiSeq,2023-03-02,
3,4,4,ABTX_11052_Plate_215,ABTX_11052,Rob ABTX,16S sequencing of antibiotic time series,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,RNBJ0628,865HS8,Echo550,not applicable,not applicable,Illumina MiSeq,2023-03-02,


The `Plate Position` and `Primer Plate #` allow us to figure out which wells are associated with each of the EMP barcodes.

In [6]:
if plate_metadata is not None:
    plate_df = assign_emp_index(plate_df, plate_metadata, seq_type).reset_index()

    plate_df.head()
else:
    print('Error: Please fix the errors in the previous cell')

As you can see in the table above, the resulting table is now associated with the corresponding EMP barcodes (`Golay Barcode`, `Forward Primer Linker`, etc), and the plating metadata (`Primer Plate #`, `Primer Date`, `Water Lot`, etc).

In [7]:
plate_df.head()

Unnamed: 0,index,Sample,Row,Col,Blank,Project Plate,Project Name,Compressed Plate Name,Well,Plate Position,...,Original Name,Plate,EMP Primer Plate Well,Name,Illumina 5prime Adapter,Golay Barcode,Forward Primer Pad,Forward Primer Linker,515FB Forward Primer (Parada),Primer For PCR
0,0,X00180471,A,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,A1,1,...,,1,A1,515rcbc0,AATGATACGGCGACCACCGAGATCTACACGCT,AGCCTTCGTCGC,TATGGTAATT,GT,GTGYCAGCMGCCGCGGTAA,AATGATACGGCGACCACCGAGATCTACACGCTAGCCTTCGTCGCTA...
1,1,X00180199,C,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,C1,1,...,,1,B1,515rcbc12,AATGATACGGCGACCACCGAGATCTACACGCT,CGTATAAATGCG,TATGGTAATT,GT,GTGYCAGCMGCCGCGGTAA,AATGATACGGCGACCACCGAGATCTACACGCTCGTATAAATGCGTA...
2,2,X00179789,E,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,E1,1,...,,1,C1,515rcbc24,AATGATACGGCGACCACCGAGATCTACACGCT,TGACTAATGGCC,TATGGTAATT,GT,GTGYCAGCMGCCGCGGTAA,AATGATACGGCGACCACCGAGATCTACACGCTTGACTAATGGCCTA...
3,3,X00180201,G,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,G1,1,...,,1,D1,515rcbc36,AATGATACGGCGACCACCGAGATCTACACGCT,GTGGAGTCTCAT,TATGGTAATT,GT,GTGYCAGCMGCCGCGGTAA,AATGATACGGCGACCACCGAGATCTACACGCTGTGGAGTCTCATTA...
4,4,X00180464,I,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,I1,1,...,,1,E1,515rcbc48,AATGATACGGCGACCACCGAGATCTACACGCT,TGATGTGCTAAG,TATGGTAATT,GT,GTGYCAGCMGCCGCGGTAA,AATGATACGGCGACCACCGAGATCTACACGCTTGATGTGCTAAGTA...


## Step 3: Mapping File Generation for Qiita
The Mapping File is generated before the MiSeq run and sent to the KL team as soon as the MiSeq run starts. Additional run information is added to the mapping file post-sequencing in order to generate the preparation file.


Output file needs to be in .txt and have the following format:
**YYYYMMDD_SEQPRIMERS_PROJECT_QIITAID_Plate_#s.txt**
- SEQ Primers 16S: **IL515fBC_806**
- SEQ Primers ITS: **ILITS**
- SEQ Primers: 18S: **IL18S**

Generate mapping file for current samples

In [8]:
# output file needs to have .txt extension and contain the correct format (shown above).
output_filename = 'test_output/amplicon/20230207_515f806r_ABTX_11052_1-4.txt'
# output_filename = './20230302_IL515fBC_806_ABTX_11052_Plate_204_207_210_215.txt'

qiita_df = generate_qiita_prep_file(plate_df, seq_type)

qiita_df.head()

Unnamed: 0,sample_name,barcode,primer,primer_plate,well_id,plating,extractionkit_lot,extraction_robot,tm1000_8_tool,primer_date,...,Illumina 5prime Adapter,Name,Original Name,Plate,Plate Position,Primer For PCR,Project Plate,Project_Name,Row,index
0,X00180471,AGCCTTCGTCGC,GTGYCAGCMGCCGCGGTAA,1,A1,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc0,,1,1,AATGATACGGCGACCACCGAGATCTACACGCTAGCCTTCGTCGCTA...,THDMI_10317_PUK2,ABTX_11052,A,0
1,X00180199,CGTATAAATGCG,GTGYCAGCMGCCGCGGTAA,1,C1,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc12,,1,1,AATGATACGGCGACCACCGAGATCTACACGCTCGTATAAATGCGTA...,THDMI_10317_PUK2,ABTX_11052,C,1
2,X00179789,TGACTAATGGCC,GTGYCAGCMGCCGCGGTAA,1,E1,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc24,,1,1,AATGATACGGCGACCACCGAGATCTACACGCTTGACTAATGGCCTA...,THDMI_10317_PUK2,ABTX_11052,E,2
3,X00180201,GTGGAGTCTCAT,GTGYCAGCMGCCGCGGTAA,1,G1,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc36,,1,1,AATGATACGGCGACCACCGAGATCTACACGCTGTGGAGTCTCATTA...,THDMI_10317_PUK2,ABTX_11052,G,3
4,X00180464,TGATGTGCTAAG,GTGYCAGCMGCCGCGGTAA,1,I1,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc48,,1,1,AATGATACGGCGACCACCGAGATCTACACGCTTGATGTGCTAAGTA...,THDMI_10317_PUK2,ABTX_11052,I,4


In [9]:
qiita_df.set_index('sample_name', verify_integrity=True, inplace=True)

qiita_df.to_csv(output_filename, sep='\t')

qiita_df

Unnamed: 0_level_0,barcode,primer,primer_plate,well_id,plating,extractionkit_lot,extraction_robot,tm1000_8_tool,primer_date,mastermix_lot,...,Illumina 5prime Adapter,Name,Original Name,Plate,Plate Position,Primer For PCR,Project Plate,Project_Name,Row,index
sample_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
X00180471,AGCCTTCGTCGC,GTGYCAGCMGCCGCGGTAA,1,A1,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc0,,1,1,AATGATACGGCGACCACCGAGATCTACACGCTAGCCTTCGTCGCTA...,THDMI_10317_PUK2,ABTX_11052,A,0
X00180199,CGTATAAATGCG,GTGYCAGCMGCCGCGGTAA,1,C1,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc12,,1,1,AATGATACGGCGACCACCGAGATCTACACGCTCGTATAAATGCGTA...,THDMI_10317_PUK2,ABTX_11052,C,1
X00179789,TGACTAATGGCC,GTGYCAGCMGCCGCGGTAA,1,E1,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc24,,1,1,AATGATACGGCGACCACCGAGATCTACACGCTTGACTAATGGCCTA...,THDMI_10317_PUK2,ABTX_11052,E,2
X00180201,GTGGAGTCTCAT,GTGYCAGCMGCCGCGGTAA,1,G1,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc36,,1,1,AATGATACGGCGACCACCGAGATCTACACGCTGTGGAGTCTCATTA...,THDMI_10317_PUK2,ABTX_11052,G,3
X00180464,TGATGTGCTAAG,GTGYCAGCMGCCGCGGTAA,1,I1,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc48,,1,1,AATGATACGGCGACCACCGAGATCTACACGCTTGATGTGCTAAGTA...,THDMI_10317_PUK2,ABTX_11052,I,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
X00179548,GTCCTCGCGACT,GTGYCAGCMGCCGCGGTAA,4,H24,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc335,,4,4,AATGATACGGCGACCACCGAGATCTACACGCTGTCCTCGCGACTTA...,THDMI_10317_PUS6,ABTX_11052,H,379
X00179326,CGTTCGCTAGCC,GTGYCAGCMGCCGCGGTAA,4,J24,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc347,,4,4,AATGATACGGCGACCACCGAGATCTACACGCTCGTTCGCTAGCCTA...,THDMI_10317_PUS6,ABTX_11052,J,380
X00179165,TGCCTGCTCGAC,GTGYCAGCMGCCGCGGTAA,4,L24,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc359,,4,4,AATGATACGGCGACCACCGAGATCTACACGCTTGCCTGCTCGACTA...,THDMI_10317_PUS6,ABTX_11052,L,381
X00179035,TCTTACCCATAA,GTGYCAGCMGCCGCGGTAA,4,N24,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,...,AATGATACGGCGACCACCGAGATCTACACGCT,515rcbc371,,4,4,AATGATACGGCGACCACCGAGATCTACACGCTTCTTACCCATAATA...,THDMI_10317_PUS6,ABTX_11052,N,382


## Step 4: Combine Plates (Optional)

If you would like to combine existing plates with these samples, enter the path to their corresponding mapping (pre-preparation) files below.

In [13]:
merged_output_filename = 'test_output/amplicon/20230203_IL515fBC_806_ABTX_11052_Plates_174_178_182_185_204_207_210_215_.txt'
# merged_output_filename = './20230203_IL515fBC_806_ABTX_11052_Plates_174_178_182_185_204_207_210_215_.txt'


In [None]:
files = ['20230201_IL515fBC_806r_ABTX_11052_174_178_182_185_MF_notebook.txt']
files = ['./20230201_IL515fBC_806r_ABTX_11052_174_178_182_185_MF_notebook.txt']
preps = []

for f in files:
    preps.append(parse_prep(f))
    
if len(files):
    print('%d file(s) loaded' % len(files))

In [12]:
if len(preps):
    prep = qiita_df.append(preps)
    prep.to_csv(merged_output_filename, sep='\t')

In [None]:
prep