In [None]:
%reload_ext watermark
%matplotlib inline
from os.path import exists

from metapool.metapool import *
from metapool import (validate_plate_metadata, assign_emp_index, make_sample_sheet, KLSampleSheet, parse_prep, validate_and_scrub_sample_sheet, generate_qiita_prep_file)
%watermark -i -v -iv -m -h -p metapool,sample_sheet,openpyxl -u

# Knight Lab Amplicon Mapping File (pre-preparation file) and Sample Sheet Generator

### What is it?

This Jupyter Notebook allows you to automatically generate mapping files and sample sheets for amplicon sequencing. It will allow you to merge multiple mapping files and sample sheets from additional PCR preps.


### Here's how it should work.

You'll start out with a **basic plate map** (platemap.tsv) , which just links each sample to it's approprite row and column. This will be in a 384-well compressed list format.

You can use this google sheet template to generate your plate map:

https://docs.google.com/spreadsheets/d/1xPjB6iR3brGeG4bm2un4ISSsTDxFw5yME09bKqz0XNk/edit?usp=sharing

Next you'll enter processing information (PCR) and automatically assign EMP barcodes, and then generate a **mapping file** and a **sample sheet** (samplesheet.csv) that can be used in combination with the rest of the sequence processing pipeline. 

**Please designate what kind of amplicon sequencing you want to perform:**

In [None]:
seq_type = '16S'
#options are ['16S', '18S', 'ITS']

## Step 1: Input plate map

**Enter the correct path to the plate map file**. This will serve as the plate map for relating all subsequent information. Plate maps should be in .tsv format.

In [None]:
plate_map_fp = './test_data/amplicon/compressed-map.tsv'

if not exists(plate_map_fp):
    print("Error: %s is not a path to a valid file" % plate_map_fp)

**Read in the plate map**. It should look something like this:

```
Sample	Row	Col	Blank
GLY_01_012	A	1	False
GLY_14_034	B	1	False
GLY_11_007	C	1	False
GLY_28_018	D	1	False
GLY_25_003	E	1	False
GLY_06_106	F	1	False
GLY_07_011	G	1	False
GLY_18_043	H	1	False
GLY_28_004	I	1	False
```

**Make sure there a no duplicate IDs.** If each sample doesn't have a different name, an error will be thrown and you won't be able to generate a sample sheet.

In [None]:
# Uncomment and replace function call below in order to validate sample_names against Qiita.
# Please contact Antonio or Charlie for path_to_qiita_config_file.
# plate_df = read_plate_map_csv(open(plate_map_fp, 'r'), qiita_oauth2_conf_fp='path_to_qiita_config_file')
plate_df = read_plate_map_csv(open(plate_map_fp, 'r'))

plate_df.head()

# Input processing information & Assign barcodes according to primer plate

This portion of the notebook will assign a barcode to each sample according to the primer plate number. Additionally, you will add sample processing information that is obtained during the PCR step.

As inputs, it requires:
1. A plate map dataframe (from previous step)
2. Most importantly, we need the Primer Plate # so we know what **EMP barcodes** to assign to each plate
3. Processing information, or preparation metadata, for each plate

The workflow then:
1. Joins the processing information & barcode assignments with the plate metadata
2. Assigns indices per sample
3. Generates mapping files and samplesheets

## Enter and validate the PCR Primers and additional processing information

- It is absolutely critical that the `Primer Plate #` and the `Plate Position` are accurate. `Primer Plate #` determines which EMP barcodes will be used for this plate. `Plate Position` determines the physical location of the plate. Make sure this input is consistent with what is recorded in the processing progress!
- If you are plating less than four plates, then remove the metadata for that plate by deleting the text between the curly braces.
- For missing fields, write NA between the single quotes for example `'NA'`.
- To enter a plate copy and paste the contents from the plates below.

In [None]:
_metadata = [
    {
        # top left plate
        'Plate Position': '1',
        'Primer Plate #': '1',
        
        'Sample Plate': 'THDMI_UK_Plate_2',
        'Project_Name': 'THDMI UK',

        'Plating': 'SF',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17', # yyyy-mm-dd
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM300 8 Tool': 'NA',
        'TM50 8 Tool': 'NA',
        'Processing Robot': 'Echo550',
        'Original Name': ''
    },
    {
        # top right plate
        'Plate Position': '2',
        'Primer Plate #': '2',
        
        'Sample Plate': 'THDMI_UK_Plate_3',
        'Project_Name': 'THDMI UK',

        'Plating':'AS',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF4',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17', # yyyy-mm-dd
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM300 8 Tool': 'NA',
        'TM50 8 Tool': 'NA',
        'Processing Robot': 'Echo550',
        'Original Name': ''
    },
    {
        # bottom left plate
        'Plate Position': '3',
        'Primer Plate #': '3',
        
        'Sample Plate': 'THDMI_UK_Plate_4',
        'Project_Name': 'THDMI UK',

        'Plating':'MB_SF',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17', # yyyy-mm-dd
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM300 8 Tool': 'NA',
        'TM50 8 Tool': 'NA',
        'Processing Robot': 'Echo550',
        'Original Name': ''
    },
    {
        # bottom right plate
        'Plate Position': '4',
        'Primer Plate #': '4',
        
        'Sample Plate': 'THDMI_US_Plate_6',
        'Project_Name': 'THDMI US',

        'Plating':'AS',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF4',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17', # yyyy-mm-dd
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM300 8 Tool': 'NA',
        'TM50 8 Tool': 'NA',
        'Processing Robot': 'Echo550', 
        'Original Name': ''
    },
]

plate_metadata = validate_plate_metadata(_metadata)
plate_metadata

The `Plate Position` and `Primer Plate #` allow us to figure out which wells are associated with each of the EMP barcodes.

In [None]:
if plate_metadata is not None:
    plate_df = assign_emp_index(plate_df, plate_metadata, seq_type).reset_index()

    plate_df.head()
else:
    print('Error: Please fix the errors in the previous cell')

As you can see in the table above, the resulting table is now associated with the corresponding EMP barcodes (`Golay Barcode`, `Forward Primer Linker`, etc), and the plating metadata (`Primer Plate #`, `Primer Date`, `Water Lot`, etc).

In [None]:
plate_df.head()

# Mapping File Generation for Qiita
The Mapping File is generated before the MiSeq run and sent to the KL team as soon as the MiSeq run starts. Additional run information is added to the mapping file post-sequencing in order to generate the preparation file.


Output file needs to be in .txt and have the following format:
**YYYYMMDD_SEQPRIMERS_PROJECT_QIITA.txt**
- SEQ Primers 16S: **IL515fBC_806**
- SEQ Primers ITS: **ILITS**
- SEQ Primers: 18S: **IL18S**

Generate mapping file for current samples

In [None]:
# output file needs to have .txt extension and contain the correct format (shown above).
output_filename = 'test_output/amplicon/20230207_515f806r_ABTX_11052_1-4.txt'

qiita_df = generate_qiita_prep_file(plate_df, seq_type)

qiita_df.head()

In [None]:
qiita_df.set_index('sample_name', verify_integrity=True, inplace=True)

qiita_df.to_csv(output_filename, sep='\t')

qiita_df

# Combine plates (optional)

If you would like to combine existing plates with these samples, enter the path to their corresponding sample sheets and mapping (preparation) files below. Otherwise you can skip to the Mapping File Generation section.

In [None]:
merged_output_filename = 'test_output/amplicon/20230203_IL515fBC_806_ABTX_11052_Plates_174_178_182_185_204_207_210_215_.txt'

In [None]:
files = [
    # uncomment the line below and point to the correct filepaths to combine with previous plates
    # ['test_output/amplicon/2021_08_17_THDMI-4-6_samplesheet.csv', 'test_output/amplicon/2021-08-01-515f806r_prep.tsv'],
]
sheets, preps = [], []

for sheet, prep in files:
    sheets.append(KLSampleSheet(sheet))
    preps.append(parse_prep(prep))
    
if len(files):
    print('%d pair of files loaded' % len(files))

In [None]:
if len(preps):
    prep = prep.append(preps, ignore_index=True)
    prep.to_csv(merged_output_filename, sep='\t')
    prep