In [None]:
%reload_ext watermark
%matplotlib inline
from os.path import exists

from metapool.metapool import *
from metapool import (validate_plate_metadata, assign_emp_index, make_sample_sheet, KLSampleSheet, parse_prep, validate_and_scrub_sample_sheet, generate_qiita_prep_file)
%watermark -i -v -iv -m -h -p metapool,sample_sheet,openpyxl -u

# Knight Lab Amplicon Pre-Preparation File Generator

<font color='red'><B>A VERY IMPORTANT Note on Plate Compression and Positions:</B></font>

This notebook works SPECIFICALLY with the STANDARD EpMotion compression format ONLY.
PRIMERS are tied to our standard SPECIFIC PLATE POSITIONS outlined in the code, and CANNOT be changed.

If you are not compressing in the standard compression format (position 1, position 2, position 3, position 4) that uses the plate map template below,
DO NOT USE THIS NOTEBOOK TO GENERATE THE PRE-PREPARATION FILE.

### What is it?

This Jupyter Notebook allows you to automatically generate pre-preparation files for amplicon sequencing. It will allow you to merge multiple pre-preparation files from additional PCR preps.


### Here's how it should work.

You'll start out with a **384-well plate map** (platemap.tsv) in a 384-well compressed list format that indicates sample name, well IDs, project plates, etc.

You can use this google sheet template to generate your plate map:

https://docs.google.com/spreadsheets/d/1JCfnGO-6RRFuhOB1yVGMSj5qRFjiiUAUzprsw6IhugY/edit#gid=0

Next you'll enter processing information (project/plate info, plating, extraction PCR), automatically assign EMP barcodes, and then generate a **pre-preparation file** that can be used in combination with the rest of the sequence processing pipeline. 

**Please designate what kind of amplicon sequencing you want to perform:**

In [None]:
seq_type = '16S'
#options are ['16S', '18S', 'ITS']

## Step 1: Input Plate Map

**Enter the correct path to the plate map file**. This will serve as the plate map for relating all subsequent information. Plate maps should be in .tsv format.<br>
<font color='red'>&#42;&#42;</font>
**If you are working with ABTX samples and find instances of a duplicate name, please re-name those samples <duplicated_name>.A, <duplicated_name>.B, <duplicated_name>.C, etc, to make each sample name unique**
<font color='red'>&#42;&#42;</font>

In [None]:
plate_map_fp = './2023-02-20_ABTX_204_207_210_215 - Map.tsv'

if not exists(plate_map_fp):
    print("Error: %s is not a path to a valid file" % plate_map_fp)

**Read in the plate map**. It should look something like this:

```
Sample	Row	Col	Blank
GLY_01_012	A	1	False
GLY_14_034	B	1	False
GLY_11_007	C	1	False
GLY_28_018	D	1	False
GLY_25_003	E	1	False
GLY_06_106	F	1	False
GLY_07_011	G	1	False
GLY_18_043	H	1	False
GLY_28_004	I	1	False
```

**Make sure there a no duplicate IDs.** If each sample doesn't have a different name, an error will be thrown and you won't be able to generate a sample sheet.

In [None]:
# Uncomment and replace function call below in order to validate sample_names against Qiita.
# Please contact Antonio or Charlie for path_to_qiita_config_file.
# plate_df = read_plate_map_csv(open(plate_map_fp, 'r'), qiita_oauth2_conf_fp='path_to_qiita_config_file')
plate_df = read_plate_map_csv(open(plate_map_fp, 'r'))

plate_df.head()

## Step 2: Input Processing Information & Assign Barcodes According to Primer Plate

This portion of the notebook will assign a barcode to each sample according to the primer plate number. Additionally, you will add sample plate information and processing information that is obtained during plating, extraction, and PCR.

As inputs, it requires:
1. A plate map dataframe (from previous step)
2. Most importantly, we need the Primer Plate # so we know what **EMP barcodes** to assign to each plate
3. Project and extraction plate information
4. Processing information, or preparation metadata, for each plate

The workflow then:
1. Joins the processing information & barcode assignments with the plate metadata
2. Assigns indices per sample
3. Generates pre-preparation files

## Enter and validate the PCR Primers and additional processing information

- It is absolutely critical that the `Primer Plate #` and the `Plate Position` are accurate. `Primer Plate #` determines which EMP barcodes will be used for this plate. `Plate Position` determines the physical location of the plate. Make sure this input is consistent with what is recorded in the processing progress!
- If you are plating less than four plates, then remove the metadata for that plate by deleting the text between the curly braces.
- For missing fields, write 'not applicable' between the single quotes for example `'not applicable'`.
- To enter a plate copy and paste the contents from the plates below.

<font color="red"><B>REMINDER: ONLY use this notebook if you compress your 384-well plate using the standard compression format and know what plates are located in each position.</B></font>

In [None]:
_metadata = [
    {
        # top left plate
        'Plate Position': '1',
        'Primer Plate #': '1',
        
        'Sample Plate': 'ABTX_11052_Plate_204', # PROJECTNAME_QIITA_ID_Plate_#
        'Project_Name': 'ABTX_11052', # PROJECTNAME_QIITAID
        'center_project_name': 'Rob ABTX', # what the wetlab calls the project
        'experiment_design_description': '16S sequencing of antibiotic time series', # brief but specific project description
        
        'Plating': 'SF', # initials
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17', # yyyy-mm-dd
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM10 8 Tool': '865HS8',
        'Processing Robot': 'Echo550',
        'TM300 8 Tool': 'not applicable',
        'TM50 8 Tool': 'not applicable',
        'instrument_model': 'Illumina MiSeq',
        'run_date': '2023-03-02', # date of MiSeq run
        'Original Name': '' # leave empty
    },
    {
        # top right plate
        'Plate Position': '2',
        'Primer Plate #': '2',
    
        'Sample Plate': 'ABTX_11052_Plate_207',
        'Project_Name': 'ABTX_11052',
        'center_project_name': 'Rob ABTX',
        'experiment_design_description': '16S sequencing of antibiotic time series',

        'Plating': 'SF',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17',
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM10 8 Tool': '865HS8',
        'Processing Robot': 'Echo550',
        'TM300 8 Tool': 'not applicable',
        'TM50 8 Tool': 'not applicable',
        'instrument_model': 'Illumina MiSeq',
        'run_date': '2023-03-02',
        'Original Name': ''
    },
    {
        # bottom left plate
        'Plate Position': '3',
        'Primer Plate #': '3',
        
        'Sample Plate': 'ABTX_11052_Plate_210',
        'Project_Name': 'ABTX_11052',
        'center_project_name': 'Rob ABTX',
        'experiment_design_description': '16S sequencing of antibiotic time series',
        
        'Plating': 'SF',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17',
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM10 8 Tool': '865HS8',
        'Processing Robot': 'Echo550',
        'TM300 8 Tool': 'not applicable',
        'TM50 8 Tool': 'not applicable',
        'instrument_model': 'Illumina MiSeq',
        'run_date': '2023-03-02',
        'Original Name': ''
    },
    {
        # bottom right plate
        'Plate Position': '4',
        'Primer Plate #': '4',

        
        'Sample Plate': 'ABTX_11052_Plate_215',
        'Project_Name': 'ABTX_11052',
        'center_project_name': 'Rob ABTX',
        'experiment_design_description': '16S sequencing of antibiotic time series',
        
        'Plating': 'SF',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17',
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'TM10 8 Tool': '865HS8',
        'Processing Robot': 'Echo550',
        'TM300 8 Tool': 'not applicable',
        'TM50 8 Tool': 'not applicable',
        'instrument_model': 'Illumina MiSeq',
        'run_date': '2023-03-02',
        'Original Name': ''
    },
]

plate_metadata = validate_plate_metadata(_metadata)
plate_metadata

After metadata is validated, compare the Project_Name values in plate_metadata against those in plate_df.

In [None]:
project_names_in_input_plate_map_file = set(plate_df['Project Name'])
project_names_in_metadata = set(plate_metadata['Project_Name'])

if project_names_in_input_plate_map_file == project_names_in_metadata:
    print("Project-names in input plate-map file and metadata match.")
else:
    print(f"Error: Project-names in input plate-map file ({project_names_in_input_plate_map_file}) "
          f"and metadata ({project_names_in_metadata}) do not match.\nPlease correct this.")

The `Plate Position` and `Primer Plate #` allow us to figure out which wells are associated with each of the EMP barcodes.

In [None]:
if plate_metadata is not None:
    plate_df = assign_emp_index(plate_df, plate_metadata, seq_type).reset_index()

    plate_df.head()
else:
    print('Error: Please fix the errors in the previous cell')

As you can see in the table above, the resulting table is now associated with the corresponding EMP barcodes (`Golay Barcode`, `Forward Primer Linker`, etc), and the plating metadata (`Primer Plate #`, `Primer Date`, `Water Lot`, etc).

In [None]:
plate_df.head()

## Step 3: Pre-Preparation File Generation for Qiita
The Pre-Preparation File is generated before the MiSeq run and sent to the KL team as soon as the MiSeq run starts. Additional run information is added to the pre-preparation file post-sequencing in order to generate the preparation file.


Output file needs to be in .txt and have the following format:
**YYYYMMDD_SEQPRIMERS_PROJECT_QIITAID_Plate_#s.txt**
- SEQ Primers 16S: **IL515fBC_806**
- SEQ Primers ITS: **ILITS**
- SEQ Primers: 18S: **IL18S**

Generate pre-preparation file for current samples

In [None]:
# output file needs to have .txt extension and contain the correct format (shown above).
output_filename = './20230302_IL515fBC_806_ABTX_11052_Plate_204_207_210_215.txt'

qiita_df = generate_qiita_prep_file(plate_df, seq_type)

qiita_df.info()

#qiita_df['well_description']

In [None]:
qiita_df.set_index('sample_name', verify_integrity=True, inplace=True)

qiita_df.to_csv(output_filename, sep='\t')

qiita_df

## Step 4: Combine Plates (Optional)

If you would like to combine existing plates with these samples, enter the path to their corresponding pre-preparation files below.

In [None]:
# additional prep-prep files to merge w/qiita_df
files = ['./20230201_IL515fBC_806r_ABTX_11052_174_178_182_185_MF_notebook.txt']

# filename for the merged-output file:
merged_output_filename = './20230203_IL515fBC_806_ABTX_11052_Plates_174_178_182_185_204_207_210_215_.txt'

In [None]:
preps = []

for f in files:
    preps.append(parse_prep(f))
    
# if running Step 4, assume preps and files are not empty
    
'%d file(s) loaded' % len(files)

In [None]:
prep = qiita_df.append(preps)
prep

In [None]:
# lambda function used to extract plate-number from 'sample_plate' column
get_plate_num = lambda x: int(x.split('_')[-1])

# create a temporary column to record the row's plate-number
prep['plate_number'] = prep['sample_plate'].apply(get_plate_num)

# list all available sample_plate values for selection
sorted(list(prep.sample_plate.unique()))

In [None]:
# Add the plate-numbers to retain in final merged output, or leave emtpy to write all to file.
keep_these = ['ABTX_11052_Plate_204', 'ABTX_11052_Plate_207']

if keep_these:
    # filter out other plate-numbers
    prep = prep.loc[prep['sample_plate'].isin(keep_these)]

prep

In [None]:
# delete temporary column and write the final result to file.
prep = prep.drop('plate_number', axis=1)
prep.to_csv(merged_output_filename, sep='\t')