In [1]:
%matplotlib inline

from metapool.metapool import *
from metapool import validate_plate_metadata, assign_emp_index

# Knight Lab 16S Sample Sheet and Mapping File (preparation file) Generator 

### What is it?

This Jupyter Notebook allows you to automatically generate sample sheets for amplicon sequencing. 


### Here's how it should work.

You'll start out with a **basic plate map** (platemap.tsv) , which just links each sample to it's approprite row and column.

You can use this google sheet template to generate your plate map:

https://docs.google.com/spreadsheets/d/1xPjB6iR3brGeG4bm2un4ISSsTDxFw5yME09bKqz0XNk/edit?usp=sharing

Next you'll automatically assign EMP barcodes in order to produce a **sample sheet** (samplesheet.csv) that can be used in combination with the rest of the sequence processing pipeline. 

## Step 1: read in plate map

**Enter the correct path to the plate map file**. This will serve as the plate map for relating all subsequent information.

In [2]:
plate_map_fp = './test_data/amplicon/compressed-map.tsv'

if not os.path.isfile(plate_map_fp):
    print("Problem! %s is not a path to a valid file" % plate_map_fp)

**Read in the plate map**. It should look something like this:

```
Sample	Row	Col	Blank
GLY_01_012	A	1	False
GLY_14_034	B	1	False
GLY_11_007	C	1	False
GLY_28_018	D	1	False
GLY_25_003	E	1	False
GLY_06_106	F	1	False
GLY_07_011	G	1	False
GLY_18_043	H	1	False
GLY_28_004	I	1	False
```

In [3]:
plate_df = read_plate_map_csv(open(plate_map_fp,'r'))

plate_df.head()

Unnamed: 0,Sample,Row,Col,Blank,Project Plate,Project Name,Compressed Plate Name,Well
0,X00180471,A,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,A1
1,X00180199,C,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,C1
2,X00179789,E,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,E1
3,X00180201,G,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,G1
4,X00180464,I,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,I1


## Step 2: check for duplicate sample IDs

This messes things up downstream. Make sure each sample has a different name.

In [4]:
try:
    assert(len(set(plate_df['Sample'])) == len(plate_df['Sample']))
except AssertionError as e:
    prev = ''
    for sample in sorted(plate_df['Sample']):
        if sample == prev:
            print('\nDuplicates:')
            print(plate_df.loc[plate_df['Sample'] == prev,])
            print(plate_df.loc[plate_df['Sample'] == prev,])
        
        prev = sample
    print('\n\nWarning! Some samples names are duplicate! Please update plate map to fix duplciates')
    raise e

# Assign barcodes according to primer plate

This portion of the notebook will assign a barcode to each sample according to the primer plate number.

As inputs, it requires:
1. A plate map dataframe (from previous step)
2. Preparation metadata for the plates, importantly we need the Primer Plate # so we know what **EMP barcodes** to assign to each plate.

The workflow then:
1. Joins the preparation metadata with the plate metadata.
2. Assigns indices per sample

## Enter and validate the plating metadata

- In general you will want to update all the fields, but the most important ones are the `Primer Plate #` and the `Plate Position`. `Primer Plate #` determines which EMP barcodes will be used for this plate. `Plate Position` determines the physical location of the plate.
- If you are plating less than four plates, then remove the metadata for that plate by deleting the text between teh curly braces.
- For missing fields, write NA between the single quotes for example `'NA'`.
- To enter a plate copy and paste the contents from the plates below.

In [5]:
_metadata = [
    {
        # top left plate
        'Plate Position': '1',
        'Primer Plate #': '1',

        'Plating': 'SF',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17', # yyyy-mm-dd
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'Processing Robot': 'Echo550',
        'Sample Plate': 'THDMI_UK_Plate_2',
        'Project_Name': 'THDMI UK',
        'Original Name': ''
    },
    {
        # top right plate
        'Plate Position': '2',
        'Primer Plate #': '2',

        'Plating':'AS',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF4',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17', # yyyy-mm-dd
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'Processing Robot': 'Echo550',
        'Sample Plate': 'THDMI_UK_Plate_3',
        'Project_Name': 'THDMI UK',
        'Original Name': ''
    },
    {
        # bottom left plate
        'Plate Position': '3',
        'Primer Plate #': '3',

        'Plating':'MB_SF',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF3',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17', # yyyy-mm-dd
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'Processing Robot': 'Echo550',
        'Sample Plate': 'THDMI_UK_Plate_4',
        'Project_Name': 'THDMI UK',
        'Original Name': ''
    },
    {
        # bottom right plate
        'Plate Position': '4',
        'Primer Plate #': '4',

        'Plating':'AS',
        'Extraction Kit Lot': '166032128',
        'Extraction Robot': 'Carmen_HOWE_KF4',
        'TM1000 8 Tool': '109379Z',
        'Primer Date': '2021-08-17', # yyyy-mm-dd
        'MasterMix Lot': '978215',
        'Water Lot': 'RNBJ0628',
        'Processing Robot': 'Echo550',
        'Sample Plate': 'THDMI_US_Plate_6',
        'Project_Name': 'THDMI US',
        'Original Name': ''
    },
]

plate_metadata = validate_plate_metadata(_metadata)
plate_metadata

Unnamed: 0,Plate Position,Primer Plate #,Plating,Extraction Kit Lot,Extraction Robot,TM1000 8 Tool,Primer Date,MasterMix Lot,Water Lot,Processing Robot,Sample Plate,Project_Name,Original Name
0,1,1,SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,RNBJ0628,Echo550,THDMI_UK_Plate_2,THDMI UK,
1,2,2,AS,166032128,Carmen_HOWE_KF4,109379Z,2021-08-17,978215,RNBJ0628,Echo550,THDMI_UK_Plate_3,THDMI UK,
2,3,3,MB_SF,166032128,Carmen_HOWE_KF3,109379Z,2021-08-17,978215,RNBJ0628,Echo550,THDMI_UK_Plate_4,THDMI UK,
3,4,4,AS,166032128,Carmen_HOWE_KF4,109379Z,2021-08-17,978215,RNBJ0628,Echo550,THDMI_US_Plate_6,THDMI US,


The `Plate Position` and `Primer Plate #` allow us to figure out which wells are associated with each of the EMP barcodes.

In [6]:
if plate_metadata is not None:
    plate_df = assign_emp_index(plate_df, plate_metadata).reset_index()

    plate_df.head()
else:
    print('Error: Please fix the errors in the previous cell')

As you can see in the table above, the resulting table is now associated with the corresponding EMP barcodes (`Golay Barcode`, `Forward Primer Linker`, etc), and the plating metadata (`Primer Plate #`, `Primer Date`, `Water Lot`, etc).

# Combine plates (optional)

If you would like to combine existing plates with these samples, enter the path to their corresponding sample sheets and preparation files below. Otherwise you can skip to the next section.

- sample sheet and preparation

# Make Sample Sheet

This workflow takes the pooled sample information and writes an Illumina sample sheet that can be given directly to the sequencing center or processing pipeline. Note that as of writing `bcl2fastq` does not support error-correction in Golay barcodes so the sample sheet is used to generate a Qiita preparation file but not to demultiplex sequences. Demultiplexing takes place in [Qiita](https://qiita.ucsd.edu).

As inputs, this notebook requires:
1. A plate map DataFrame (from previous step)

The workflow:
1. formats sample names as bcl2fastq-compatible
2. formats sample data
3. sets values for sample sheet fields and formats sample sheet.
4. writes the sample sheet to a file

## Step 1: Format sample names to be bcl2fastq-compatible

bcl2fastq requires *only* alphanumeric, hyphens, and underscore characters. We'll replace all non-those characters
with underscores and add the bcl2fastq-compatible names to the DataFrame.

In [7]:
plate_df['sample sheet Sample_ID'] = plate_df['Sample'].map(bcl_scrub_name)

plate_df.head()

Unnamed: 0,index,Sample,Row,Col,Blank,Project Plate,Project Name,Compressed Plate Name,Well,Plate Position,...,Plate,Name,Illumina 5' Adapter,Golay Barcode,Forward Primer Pad,Forward Primer Linker,515FB Forward Primer (Parada),Primer For PCR,EMP Primer Plate Well,sample sheet Sample_ID
0,0,X00180471,A,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,A1,1,...,1,515rcbc0,AATGATACGGCGACCACCGAGATCTACACGCT,AGCCTTCGTCGC,TATGGTAATT,GT,GTGYCAGCMGCCGCGGTAA,AATGATACGGCGACCACCGAGATCTACACGCTAGCCTTCGTCGCTA...,A1,X00180471
1,1,X00180199,C,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,C1,1,...,1,515rcbc12,AATGATACGGCGACCACCGAGATCTACACGCT,CGTATAAATGCG,TATGGTAATT,GT,GTGYCAGCMGCCGCGGTAA,AATGATACGGCGACCACCGAGATCTACACGCTCGTATAAATGCGTA...,B1,X00180199
2,2,X00179789,E,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,E1,1,...,1,515rcbc24,AATGATACGGCGACCACCGAGATCTACACGCT,TGACTAATGGCC,TATGGTAATT,GT,GTGYCAGCMGCCGCGGTAA,AATGATACGGCGACCACCGAGATCTACACGCTTGACTAATGGCCTA...,C1,X00179789
3,3,X00180201,G,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,G1,1,...,1,515rcbc36,AATGATACGGCGACCACCGAGATCTACACGCT,GTGGAGTCTCAT,TATGGTAATT,GT,GTGYCAGCMGCCGCGGTAA,AATGATACGGCGACCACCGAGATCTACACGCTGTGGAGTCTCATTA...,D1,X00180201
4,4,X00180464,I,1,False,THDMI_10317_PUK2,THDMI_10317,THDMI_10317_UK2-US6,I1,1,...,1,515rcbc48,AATGATACGGCGACCACCGAGATCTACACGCT,TGATGTGCTAAG,TATGGTAATT,GT,GTGYCAGCMGCCGCGGTAA,AATGATACGGCGACCACCGAGATCTACACGCTTGATGTGCTAAGTA...,E1,X00180464


## Format the sample sheet data

This step formats the data columns appropriately for the sample sheet, using the values we've calculated previously.

The newly-created `bcl2fastq`-compatible names will be in the `Sample ID` and `Sample Name` columns. The original sample names will be in the Description column.

Modify lanes to indicate which lanes this pool will be sequenced on.

The `Project Name` and `Project Plate` columns will be placed in the `Sample_Project` and `Sample_Name` columns, respectively.

sequencer is important for making sure the i5 index is in the correct orientation for demultiplexing. `HiSeq4000`, `HiSeq3000`, `NextSeq`, and `MiniSeq` all require reverse-complemented i5 index sequences. If you enter one of these exact strings in for sequencer, it will revcomp the i5 sequence for you.

`HiSeq2500`, `MiSeq`, and `NovaSeq` will not revcomp the i5 sequence.

In [8]:
sequencer = 'MiSeq'

data = format_sample_data(plate_df['sample sheet Sample_ID'],
                          len(plate_df) * [''],
                          len(plate_df) * [''],
                          plate_df['Name'],
                          sequencer_i5_index(sequencer, plate_df['Golay Barcode']),
                          wells=plate_df['Well'],
                          sample_plate=plate_df['Project Plate'],
                          description=plate_df['Sample'],
                          sample_proj=plate_df['Project Name'],
                          sep=',')

MiSeq: i5 barcodes are output in standard direction


In [9]:
contacts = {'Jeff Dereus': 'jdereus@ucsd.edu',
            'Gail Ackermann': 'ackermag@ucsd.edu',
            'MacKenzie Bryant': 'mmbryant@ucsd.edu'}

PI = {'Knight': 'robknight@ucsd.edu'}

other = None

Make sure the following two parameters are also accurate:

In [10]:
# date:
date = '2021-08-17'

# Experiment name: 
experiment = 'RKL_experiment'

In [11]:
# The other fields in the sample sheet can also be edited if necessary, but for most runs should stay the same

sample_sheet_dict = {'comments': format_sheet_comments(PI=PI, contacts=contacts, other=other),
          'IEMFileVersion': '4',
          'Investigator Name': 'Knight',
          'Experiment Name': experiment,
          'Date': date,
          'Workflow': 'GenerateFASTQ',
          'Application': 'FASTQ Only',
          'Assay': 'Amplicon',
          'Description': '',
          'Chemistry': 'Default',
          'read1': 150,
          'read2': 150,
          'ReverseComplement': '0',
          'data': data}

# format sample sheet
sample_sheet = format_sample_sheet(sample_sheet_dict)

## Step 4: Write the sample sheet to file

In [12]:
# write sample sheet as .csv
sample_sheet_fp = './test_output/amplicon/2021_08_17_THDMI-4-6_samplesheet.csv'

if os.path.isfile(sample_sheet_fp):
    print("Warning! This file exists already.")



In [13]:
with open(sample_sheet_fp,'w') as f:
    f.write(sample_sheet)
    
!head -n 30 {sample_sheet_fp}

# PI,Knight,robknight@ucsd.edu
# Contact,Gail Ackermann,Greg Humphrey,Jeff Dereus,Jon Sanders
# ,ackermag@ucsd.edu,ghsmu414@gmail.com,jdereus@ucsd.edu,jonsan@gmail.com
[Header]
IEMFileVersion,4
Investigator Name,Knight
Experiment Name,RKL_experiment
Date,2021-08-17
Workflow,GenerateFASTQ
Application,FASTQ Only
Assay,Amplicon
Description,
Chemistry,Default

[Reads]
150
150

[Settings]
ReverseComplement,0

[Data]
Lane,Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
1,X00180471,X00180471,THDMI_10317_PUK2,A1,,,515rcbc0,AGCCTTCGTCGC,THDMI_10317,X00180471
1,X00180199,X00180199,THDMI_10317_PUK2,C1,,,515rcbc12,CGTATAAATGCG,THDMI_10317,X00180199
1,X00179789,X00179789,THDMI_10317_PUK2,E1,,,515rcbc24,TGACTAATGGCC,THDMI_10317,X00179789
1,X00180201,X00180201,THDMI_10317_PUK2,G1,,,515rcbc36,GTGGAGTCTCAT,THDMI_10317,X00180201
1,X00180464,X00180464,THDMI_10317_PUK2,I1,,,515rcbc48,TGATGTGCTAAG,THDMI_10317,X0018046

# Create a preparation file for Qiita

In [14]:
output_filename = 'test_output/amplicon/2021-08-01-515f806r_prep.tsv'

column_renamer = {
    'Sample': 'sample_name',
    'Golay Barcode': 'barcode',
    '515FB Forward Primer (Parada)': 'primer',
    'Project Plate': 'project_plate',
    'Project Name': 'project_name',
    'Well': 'well',
    'Primer Plate #': 'primer_plate_number',
    'Plating': 'plating',
    'Extraction Kit Lot': 'extractionkit_lot',
    'Extraction Robot': 'extraction_robot',
    'TM1000 8 Tool': 'tm1000_8_tool',
    'Primer Date': 'primer_date',
    'MasterMix Lot': 'mastermix_lot',
    'Water Lot': 'water_lot',
    'Processing Robot': 'processing_robot',
    'sample sheet Sample_ID': 'well_description'
}

prep = plate_df[column_renamer.keys()]
prep.columns = [column_renamer[col] for col in prep.columns]
prep.set_index('sample_name', verify_integrity=True).to_csv(output_filename, sep='\t')

In [15]:
!head -n 5 {output_filename}

sample_name	barcode	primer	project_plate	project_name	well	primer_plate_number	plating	extractionkit_lot	extraction_robot	tm1000_8_tool	primer_date	mastermix_lot	water_lot	processing_robot	well_description
X00180471	AGCCTTCGTCGC	GTGYCAGCMGCCGCGGTAA	THDMI_10317_PUK2	THDMI_10317	A1	1	SF	166032128	Carmen_HOWE_KF3	109379Z	2021-08-17	978215	RNBJ0628	Echo550	X00180471
X00180199	CGTATAAATGCG	GTGYCAGCMGCCGCGGTAA	THDMI_10317_PUK2	THDMI_10317	C1	1	SF	166032128	Carmen_HOWE_KF3	109379Z	2021-08-17	978215	RNBJ0628	Echo550	X00180199
X00179789	TGACTAATGGCC	GTGYCAGCMGCCGCGGTAA	THDMI_10317_PUK2	THDMI_10317	E1	1	SF	166032128	Carmen_HOWE_KF3	109379Z	2021-08-17	978215	RNBJ0628	Echo550	X00179789
X00180201	GTGGAGTCTCAT	GTGYCAGCMGCCGCGGTAA	THDMI_10317_PUK2	THDMI_10317	G1	1	SF	166032128	Carmen_HOWE_KF3	109379Z	2021-08-17	978215	RNBJ0628	Echo550	X00180201
