# GMM Operation Test Notebooks Overview

This document outlines a series of operations designed to streamline data processing and analysis for GMM (Genetically Modified Microorganisms) testing. Our goal is to create an intuitive and user-friendly testing environment that ensures consistency across all scripts and application.


In [1]:
from pathlib import Path
from operations import *

path = Path('.') / "input_files"
plate_layout_file_path = path / "op1.plate_layout_to_spreadsheet" / "plate_spreadsheet_template.xlsx"
fcs_input_file_path = path / "op2.collate_fcs_files" / "14Jun23_INX_NKC_084_LCE123.fcs"
sample_sheet_file_path = path / "op3.merge_data_into_spreadsheet" / "sample_sheet.tsv"
collated_fcs_file_path = path / "op3.merge_data_into_spreadsheet" / "fcs_data.tsv"
template_sheet_file_path = path / "op3.merge_data_into_spreadsheet" / "template_sheet.xlsx"
primer_index_file_path = path / "op4.merge_primer_index" / "primer_index_template.xlsx"

## Operations Breakdown

### Operation 1: Create Sample Sheet from Plate Layout

**Purpose:** Converts the provided "Plate Layout" sheet into a detailed sample sheet. This process involves translating the information from an excel file, which includes colored plate information, into a structured format.

**Input File Preview:**

<img src="./images/plate_spreadsheet_template.png" alt="plate_spreadsheet_template.png" width="800px" />

**Output:** A sample sheet containing the following columns:
    - `plate number`
    - `well position`
    - `sample name`

**Output File Preview:** 

<img src="./images/sample_sheet.png" alt="created_sample_sheet" width="800px" />

**How It Works:** This operation takes the visual and textual information from the Plate Layout sheet and organizes it into a tabular format that is easier to use for further analysis.

In [2]:
plate_to_samplesheet(plate_layout_file_path)

Unnamed: 0,plate,well_position,sample
0,LCE123,A1,Test before sort
1,LCE123,A2,Test after sort
2,LCE123,A3,NKC_084
3,LCE123,A4,NKC_084
4,LCE123,A5,NKC_084
...,...,...,...
379,LCE123,P20,NKC_085
380,LCE123,P21,NKC_085
381,LCE123,P22,NKC_085
382,LCE123,P23,Test after sort


### Operation 2: Combine FCS Files into One Document

**Purpose:** Merges multiple FCS files into a single TSV (Tab Separated Values) file using a vertical merge principle. This operation is essential for consolidating flow cytometry data.

**Input File Preview:** FCS should has the similar file name: `14Jun23_INX_NKC_084_LCE662.fcs`.

After parsing use fcsparser, it will return the following dataframe:

<img src="./images/raw_fcs_file_dataframe.png" alt='raw FCS file after load in pandas dataframe' width="800px" />


**Output:** A single TSV file containing merged data from all provided FCS files.

**Output File Preview:** 

<img src="./images/collated_fcs_file.png" alt='raw FCS file after load in pandas dataframe' width="800px" />

**How It Works:** By vertically merging the data, we ensure that all information from the individual FCS files is preserved and compiled in a coherent order.

In [3]:
collate_fcs_files([fcs_input_file_path], "")

Unnamed: 0,FSC-A,FSC-H,SSC-A,SSC-H,CD16 FITC,CD56 PE,DAPI,Time,plate,well_position,sample
0,115089.296875,69562.0,75108.632812,44791.0,129.710007,37108.621094,61.410000,1698.400024,LCE123,A3,NKC_084
1,127671.296875,83696.0,64815.761719,43674.0,112.270004,17709.240234,52.509998,1865.300049,LCE123,A4,NKC_084
2,102016.796875,74176.0,50605.429688,35155.0,105.730003,25967.160156,41.829998,2052.899902,LCE123,A5,NKC_084
3,123290.093750,71681.0,100057.640625,55735.0,553.720032,66.299995,203.809998,2330.300049,LCE123,A6,NKC_084
4,110112.296875,68001.0,86295.304688,55742.0,1148.859985,149.940002,97.900002,2758.300049,LCE123,A7,NKC_084
...,...,...,...,...,...,...,...,...,...,...,...
351,155246.390625,88072.0,93204.812500,52107.0,69.760002,84.659996,163.759995,123816.203125,LCE123,O20,NKC_084
352,76473.000000,60291.0,48538.792969,37203.0,620.210022,19658.458984,29.369999,123984.898438,LCE123,O21,NKC_084
353,108873.000000,73971.0,101284.984375,70075.0,700.869995,23535.480469,110.360001,124672.398438,LCE123,O22,NKC_084
354,80415.898438,63039.0,33754.031250,25382.0,707.410034,9447.240234,37.380001,125308.500000,LCE123,O23,NKC_084


### Operation 3: Merge All Data into Comprehensive File

**Purpose:** Integrates the sample sheet from Operation 1, a template sheet provided by the lab, and the FCS results from Operation 2 into a unified document. This operation facilitates comprehensive data analysis by combining all relevant data points.

**Input Files Preview:** 

Sample sheet generated from operation 1:

<img src="./images/sample_sheet.png" alt='sample sheet generated from operation 1' width="800px" />

Combined FCS files from operation 2:

<img src="./images/collated_fcs_file.png" alt='collated fcs file from operation 2' width="800px" />

Template sheet provided by genomics lab:

<img src="./images/template_sheet.png" alt='template sheet provide by the genomics lab' width="800px" />

**Output:** A single file that merges the aforementioned documents based on `plate number`, `well position`, and `sample name`.

**Output File Preview:** 

<img src="./images/merged_all_data.png" alt='raw FCS file after load in pandas dataframe' width="800px" />

**How It Works:** This operation aligns data from different sources using key identifiers, ensuring that each data point is accurately matched and consolidated.


In [4]:
merged_samplesheet_fcs_and_template_sheet_df = merge_data_with_samplesheet(spreadsheet_filepath=sample_sheet_file_path.as_posix(), 
                            fcs_file=collated_fcs_file_path.as_posix(), 
                            template_sheet_filepath=template_sheet_file_path)

merged_samplesheet_fcs_and_template_sheet_df.head(10)

Unnamed: 0,plate,well_position,Sample type\n(SC or MB),Tissue type\n(if required),sample,FACs gate\n(if required),C-RT1-_Primer name,RD1 index (cell index)_index sequence \n(as in C-RT1-primer),Illumina index\nIndex number_(separate index read),Illumina index\nIndex sequence_(separate index read),RT1 index primer sequences,Indexing,FSC-A,FSC-H,SSC-A,SSC-H,CD16 FITC,CD56 PE,DAPI,Time
0,LCE123,A1,empty,,Test before sort,,removed,removed,removed,removed,HPR control,,,,,,,,,
1,LCE123,A2,empty,,Test after sort,,removed,removed,removed,removed,HPR control,,,,,,,,,
2,LCE123,A3,,,NKC_084,,99,GGTCTATG,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,115089.3,69562.0,75108.63,44791.0,129.71,37108.62,61.41,1698.4
3,LCE123,A4,,,NKC_084,,100,GTCCGAAT,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,127671.3,83696.0,64815.76,43674.0,112.270004,17709.24,52.51,1865.3
4,LCE123,A5,,,NKC_084,,101,TAGTGCGT,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,102016.8,74176.0,50605.43,35155.0,105.73,25967.16,41.829998,2052.9
5,LCE123,A6,,,NKC_085,,102,GACTGTAC,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,,,,,,,,
6,LCE123,A7,,,NKC_085,,103,TCCAGTAG,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,,,,,,,,
7,LCE123,A8,,,NKC_085,,104,AGCGTTGT,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,,,,,,,,
8,LCE123,A9,,,NKC_085,,105,GATGCGTT,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,,,,,,,,
9,LCE123,A10,,,NKC_085,,106,CCGTTAAG,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,,,,,,,,


### Operation 4 (Optional): Add Primer Index to Comprehensive File

**Purpose:** (Optional) Enhances the comprehensive file from Operation 3 by adding Primer Index information provided by the lab. This step is optional but recommended for more detailed analysis.

**Input File Preview:** 

Merged data from operation 3:

<img src="./images/merged_all_data.png" alt='merged data from operation 3' width="800px" />

Given Primer Index (Sample):

<img src="./images/primer_index.png" alt='primer_index_file' width="800px" />

**Output:** The comprehensive file with added Primer Index information for each sample.

**Output File Preview:** 

<img src="./images/merged_primer_index.png" alt='merged_primer_index' width="800px" />

**How It Works:** This operation appends the Primer Index data to the existing file, enriching the dataset with additional details that can be critical for certain analyses.

In [5]:
# load primer index template
primer_index_df = pd.read_excel(primer_index_file_path, sheet_name='Sample primer & index', skiprows=3)
primer_index_df.rename({'Plate#': 'plate', 'Well position': 'well_position', 'Sample name': 'sample'}, axis=1, inplace=True)

# generate mockup test result file
merged_primer_index_df = pd.merge(merged_samplesheet_fcs_and_template_sheet_df, primer_index_df, 
                                  left_on=['plate', 'well_position', 'sample'], 
                                  right_on=['plate', 'well_position', 'sample'],
                                  suffixes=('', '_primer'), how='left')

merged_primer_index_df.head(10)

Unnamed: 0,plate,well_position,Sample type\n(SC or MB),Tissue type\n(if required),sample,FACs gate\n(if required),C-RT1-_Primer name,RD1 index (cell index)_index sequence \n(as in C-RT1-primer),Illumina index\nIndex number_(separate index read),Illumina index\nIndex sequence_(separate index read),...,DAPI,Time,Strain,Embryo / Adult,Cell Type,Primer name,index sequence \n(as in C-RT1-primer),(separate index read),(separate index read).1,RT1 index primer sequences_primer
0,LCE123,A1,empty,,Test before sort,,removed,removed,removed,removed,...,,,,,,,,,,
1,LCE123,A2,empty,,Test after sort,,removed,removed,removed,removed,...,,,,,,,,,,
2,LCE123,A3,,,NKC_084,,99,GGTCTATG,,,...,61.41,1698.4,,,,,,,,
3,LCE123,A4,,,NKC_084,,100,GTCCGAAT,,,...,52.51,1865.3,,,,,,,,
4,LCE123,A5,,,NKC_084,,101,TAGTGCGT,,,...,41.829998,2052.9,,,,,,,,
5,LCE123,A6,,,NKC_085,,102,GACTGTAC,,,...,,,,,,,,,,
6,LCE123,A7,,,NKC_085,,103,TCCAGTAG,,,...,,,,,,,,,,
7,LCE123,A8,,,NKC_085,,104,AGCGTTGT,,,...,,,,,,,,,,
8,LCE123,A9,,,NKC_085,,105,GATGCGTT,,,...,,,,,,,,,,
9,LCE123,A10,,,NKC_085,,106,CCGTTAAG,,,...,,,,,,,,,,


## Streamline operations and generate one final output

In [4]:
from pathlib import Path
from operations import *

# create a folder to store temparol results
temp_file_path = Path('.') / 'temp'
if not temp_file_path.exists():
    temp_file_path.mkdir(parents=True, exists_ok=True)

path = Path(".") / "input_files"
plate_layout_file_path = path / "op1.plate_layout_to_spreadsheet" / "plate_spreadsheet_template.xlsx"
fcs_input_file_path = path / "op2.collate_fcs_files" / "14Jun23_INX_NKC_084_LCE123.fcs"
sample_sheet_file_path = path / "op3.merge_data_into_spreadsheet" / "sample_sheet.tsv" 
template_sheet_file_path = path / "op3.merge_data_into_spreadsheet" / "template_sheet.xlsx"
primer_index_file_path = path / "op4.merge_primer_index" / "primer_index_template.xlsx"

In [7]:
# operation 1: create sample sheet from plate layout template
sample_sheet_df = plate_to_samplesheet(plate_layout_file_path)
sample_sheet_df.to_csv('temp/op1.plate_layout_to_spreadsheet.tsv', sep='\t', index=False)

# operation 2: combine fcs files into one tsv file
collated_fcs_df = collate_fcs_files([fcs_input_file_path], "")  # provide a list of fcs files
collated_fcs_df.to_csv('temp/op2.collate_fcs_files.tsv', sep='\t', index=False)

# opeartion 3: merge all data into a comprehensive file
merged_samplesheet_fcs_and_template_sheet_df =  merge_data_with_samplesheet(spreadsheet_filepath=sample_sheet_file_path.as_posix(), 
                                                                            fcs_file="temp/op2.collate_fcs_files.tsv", 
                                                                            template_sheet_filepath=template_sheet_file_path)

# operation 4 (option): add primer index to comprehensive file
primer_index_df = pd.read_excel(primer_index_file_path, sheet_name='Sample primer & index', skiprows=3)
primer_index_df.rename({'Plate#': 'plate', 'Well position': 'well_position', 'Sample name': 'sample'}, axis=1, inplace=True)

# generate mockup test result file
merged_primer_index_df = pd.merge(merged_samplesheet_fcs_and_template_sheet_df, primer_index_df, 
                                  left_on=['plate', 'well_position', 'sample'], 
                                  right_on=['plate', 'well_position', 'sample'],
                                  suffixes=('', '_primer'), how='left')

merged_primer_index_df.to_csv('temp/final.tsv', sep='\t', index=False)
merged_primer_index_df.to_excel('temp/final.xlsx', index=False)

In [6]:
# compare output file with pre-defined test result
!md5sum temp/final.tsv
!md5sum output_files/op4.merged_primer_index.tsv

f3fe361c378334c831838df7ceaae478  temp/final.tsv
f3fe361c378334c831838df7ceaae478  output_files/op4.merged_primer_index.tsv
