# GMM Operation Test Notebooks Overview

This document outlines a series of operations designed to streamline data processing and analysis for GMMGenomics Metadata Multiplexingms) testing. Our goal is to create an intuitive and user-friendly testing environment that ensures consistency across all scripts and application.


In [21]:
%load_ext autoreload
%autoreload 1

from pathlib import Path
from operations_bulk import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [36]:
# Define the path to the 'data' directory
path = Path('.') / 'data'

# Obtain the file path for a plate file
plate_path = next(path.glob('*plate_spreadsheet*'), None)
if plate_path is not None:
    plate_path = path / plate_path.name

# Obtain the file paths for all .fcs files
fcs_paths = list(path.glob('*.fcs'))

# Obtain the file path for a template file
template_path = next(path.glob('*template*'), None)
if template_path is not None:
    template_path = path / template_path.name

# Obtain the file path for a primer file
primer_path = next(path.glob('*primer_index*'), None)
if primer_path is not None:
    primer_path = path / primer_path.name
    
    
(
    plate_path,
    fcs_path,
    template_path,
    primer_path
)

(PosixPath('data/plate_spreadsheet_template.xlsx'),
 [PosixPath('data/14Jun23_INX_Ref_Ctrl_LCE123.fcs'),
  PosixPath('data/14Jun23_INX_NKC_084_LCE663.fcs'),
  PosixPath('data/14Jun23_INX_NKC_085_LCE123.fcs'),
  PosixPath('data/14Jun23_INX_NKC_084_LCE123.fcs')],
 PosixPath('data/template_sheet.xlsx'),
 PosixPath('data/primer_index_template.xlsx'))

## Operations Breakdown

### Operation 1: Create Sample Sheet from Plate Layout

**Purpose:** Converts the provided "Plate Layout" sheet into a detailed sample sheet. This process involves translating the information from an excel file, which includes colored plate information, into a structured format.

**Input File Preview:**

<img src="./images/plate_spreadsheet_template.png" alt="plate_spreadsheet_template.png" width="800px" />

**Output:** A sample sheet containing the following columns:
    - `plate number`
    - `well position`
    - `sample name`

**Output File Preview:** 

<img src="./images/sample_sheet.png" alt="created_sample_sheet" width="800px" />

**How It Works:** This operation takes the visual and textual information from the Plate Layout sheet and organizes it into a tabular format that is easier to use for further analysis.

In [25]:
plate_to_samplesheet(plate_path)
plate_to_samplesheet(plate_path).to_csv('input_files/op3.merge_data_into_spreadsheet/sample_sheet.tsv', index=False, sep='\t')
plate_to_samplesheet(plate_path).to_csv('output_files/op1.plate_layout_to_spreadsheet.tsv', index=False, sep='\t')

plate_to_samplesheet(plate_path).head()

Unnamed: 0,Plate#,Well position,Sample name
0,LCE123,A1,Test before sort
1,LCE123,A2,Test after sort
2,LCE123,A3,NKC_084
3,LCE123,A4,NKC_084
4,LCE123,A5,NKC_084


### Operation 2: Combine FCS Files into One Document

**Purpose:** Merges multiple FCS files into a single TSV (Tab Separated Values) file using a vertical merge principle. This operation is essential for consolidating flow cytometry data.

**Input File Preview:** FCS should has the similar file name: `14Jun23_INX_NKC_084_LCE662.fcs`.

After parsing use fcsparser, it will return the following dataframe:

<img src="./images/raw_fcs_file_dataframe.png" alt='raw FCS file after load in pandas dataframe' width="800px" />


**Output:** A single TSV file containing merged data from all provided FCS files.

**Output File Preview:** 

<img src="./images/collated_fcs_file.png" alt='raw FCS file after load in pandas dataframe' width="800px" />

**How It Works:** By vertically merging the data, we ensure that all information from the individual FCS files is preserved and compiled in a coherent order.

In [27]:
collate_fcs_files(fcs_path, "").to_csv('input_files/op3.merge_data_into_spreadsheet/fcs_data.tsv', sep='\t', index=False)
collate_fcs_files(fcs_path, "").to_csv('output_files/op3.merged_sample_sheet.tsv', sep='\t', index=False)
collate_fcs_files(fcs_path, "").head()

Unnamed: 0,FSC-A,FSC-H,SSC-A,SSC-H,CD16 FITC,CD56 PE,DAPI,Time,Plate#,Well position,Sample name
0,103059.0,75474.0,50749.3125,33419.0,197.290009,59648.578125,116.589996,1673.800049,LCE123,P3,Ref_Ctrl
1,76914.0,64132.0,20557.400391,15926.0,65.400002,17639.878906,45.389999,1864.0,LCE123,P4,Ref_Ctrl
2,72203.398438,58039.0,22769.009766,16914.0,59.950001,11540.279297,30.26,2106.300049,LCE123,P5,Ref_Ctrl
3,64366.199219,54005.0,28015.181641,24125.0,59.950001,7833.599609,-3.56,2295.199951,LCE123,P6,Ref_Ctrl
4,79505.101562,63489.0,35856.640625,26323.0,37.060001,15918.120117,-24.029999,2463.899902,LCE123,P7,Ref_Ctrl


### Operation 3: Merge All Data into Comprehensive File

**Purpose:** Integrates the sample sheet from Operation 1, a template sheet provided by the lab, and the FCS results from Operation 2 into a unified document. This operation facilitates comprehensive data analysis by combining all relevant data points.

**Input Files Preview:** 

Sample sheet generated from operation 1:

<img src="./images/sample_sheet.png" alt='sample sheet generated from operation 1' width="800px" />

Combined FCS files from operation 2:

<img src="./images/collated_fcs_file.png" alt='collated fcs file from operation 2' width="800px" />

Template sheet provided by genomics lab:

<img src="./images/template_sheet.png" alt='template sheet provide by the genomics lab' width="800px" />

**Output:** A single file that merges the aforementioned documents based on `plate number`, `well position`, and `sample name`.

**Output File Preview:** 

<img src="./images/merged_all_data.png" alt='raw FCS file after load in pandas dataframe' width="800px" />

**How It Works:** This operation aligns data from different sources using key identifiers, ensuring that each data point is accurately matched and consolidated.


In [34]:
sample_sheet_file_path = Path("input_files") / "op3.merge_data_into_spreadsheet" / "sample_sheet.tsv"
collated_fcs_file_path = Path("input_files") / "op3.merge_data_into_spreadsheet" / "fcs_data.tsv"


merged_samplesheet_fcs_and_template_sheet_df = merge_data_with_samplesheet(
                            spreadsheet_filepath=sample_sheet_file_path.as_posix(), 
                            fcs_file=collated_fcs_file_path.as_posix(), 
                            template_sheet_filepath=template_path)

merged_samplesheet_fcs_and_template_sheet_df.to_csv('output_files/op3.merged_sample_sheet.tsv', index=False, sep='\t')
merged_samplesheet_fcs_and_template_sheet_df

Unnamed: 0,Plate#,Well position,Sample type\n(SC or MB),Tissue type\n(if required),Sample name,FACs gate\n(if required),C-RT1-_Primer name,RD1 index (cell index)_index sequence \n(as in C-RT1-primer),Illumina index\nIndex number_(separate index read),Illumina index\nIndex sequence_(separate index read),RT1 index primer sequences,Indexing,FSC-A,FSC-H,SSC-A,SSC-H,CD16 FITC,CD56 PE,DAPI,Time
0,LCE123,A1,empty,,Test before sort,,removed,removed,removed,removed,HPR control,,,,,,,,,
1,LCE123,A2,empty,,Test after sort,,removed,removed,removed,removed,HPR control,,,,,,,,,
2,LCE123,A3,,,NKC_084,,99,GGTCTATG,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,115089.3,69562.0,75108.63,44791.0,129.710000,37108.62,61.410000,1698.4
3,LCE123,A4,,,NKC_084,,100,GTCCGAAT,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,127671.3,83696.0,64815.76,43674.0,112.270004,17709.24,52.510000,1865.3
4,LCE123,A5,,,NKC_084,,101,TAGTGCGT,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,102016.8,74176.0,50605.43,35155.0,105.730000,25967.16,41.829998,2052.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
380,LCE123,P21,,,NKC_085,,457,ATGAGCTC,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,,,,,,,,
381,LCE123,P22,,,NKC_085,,478,ACGACTCA,,,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...,,,,,,,,,
382,LCE123,P23,empty,,Test after sort,,,removed,removed,removed,HPR control,,,,,,,,,
383,LCE123,P24,empty,,Test before sort,,,removed,removed,removed,HPR control,,,,,,,,,


### Operation 4 (Optional): Add Primer Index to Comprehensive File

**Purpose:** (Optional) Enhances the comprehensive file from Operation 3 by adding Primer Index information provided by the lab. This step is optional but recommended for more detailed analysis.

**Input File Preview:** 

Merged data from operation 3:

<img src="./images/merged_all_data.png" alt='merged data from operation 3' width="800px" />

Given Primer Index (Sample):

<img src="./images/primer_index.png" alt='primer_index_file' width="800px" />

**Output:** The comprehensive file with added Primer Index information for each sample.

**Output File Preview:** 

<img src="./images/merged_primer_index.png" alt='merged_primer_index' width="800px" />

**How It Works:** This operation appends the Primer Index data to the existing file, enriching the dataset with additional details that can be critical for certain analyses.

In [35]:
# load primer index template
primer_index_df = pd.read_excel(primer_path, sheet_name='Sample primer & index', skiprows=3)
# primer_index_df.rename({'Sample name': 'Sample name'}, axis=1, inplace=True)

# generate mockup test result file
merged_primer_index_df = pd.merge(merged_samplesheet_fcs_and_template_sheet_df, primer_index_df, 
                                  on=['Plate#', 'Well position', 'Sample name'], 
                                  suffixes=('', '_primer'), how='left')
merged_primer_index_df.to_csv('output_files/op4.merged_primer_index.tsv', sep='\t', index=False)
merged_primer_index_df.head(10)

Unnamed: 0,Plate#,Well position,Sample type\n(SC or MB),Tissue type\n(if required),Sample name,FACs gate\n(if required),C-RT1-_Primer name,RD1 index (cell index)_index sequence \n(as in C-RT1-primer),Illumina index\nIndex number_(separate index read),Illumina index\nIndex sequence_(separate index read),...,DAPI,Time,Strain,Embryo / Adult,Cell Type,Primer name,index sequence \n(as in C-RT1-primer),(separate index read),(separate index read).1,RT1 index primer sequences_primer
0,LCE123,A1,empty,,Test before sort,,removed,removed,removed,removed,...,,,,,,,,,,
1,LCE123,A2,empty,,Test after sort,,removed,removed,removed,removed,...,,,,,,,,,,
2,LCE123,A3,,,NKC_084,,99,GGTCTATG,,,...,61.41,1698.4,WT,e 14 Embryo,MPP4,RT1-99,GGTCTATG,RPI9,GATCAG,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...
3,LCE123,A4,,,NKC_084,,100,GTCCGAAT,,,...,52.51,1865.3,WT,e 14 Embryo,MPP4,RT1-100,GTCCGAAT,RPI9,GATCAG,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...
4,LCE123,A5,,,NKC_084,,101,TAGTGCGT,,,...,41.829998,2052.9,WT,e 14 Embryo,MPP4,RT1-101,TAGTGCGT,RPI9,GATCAG,CGATTGAGGCCGGTAATACGACTCACTATAGGGGTTCAGAGTTCTA...
5,LCE123,A6,,,Ref_Ctrl,,102,GACTGTAC,,,...,,,,,,,,,,
6,LCE123,A7,,,Ref_Ctrl,,103,TCCAGTAG,,,...,,,,,,,,,,
7,LCE123,A8,,,NKC_085,,104,AGCGTTGT,,,...,71.2,3001.6,,,,,,,,
8,LCE123,A9,,,NKC_085,,105,GATGCGTT,,,...,105.909996,3369.7,,,,,,,,
9,LCE123,A10,,,NKC_085,,106,CCGTTAAG,,,...,187.79,4343.7,,,,,,,,


## Streamline operations and generate one final output

In [1]:
%load_ext autoreload
%autoreload 1

from pathlib import Path
from operations_bulk import *

# Define the path to the 'data' directory
path = Path('.') / 'data'

# Obtain the file path for a plate file
plate_path = next(path.glob('*plate_spreadsheet*'), None)
if plate_path is not None:
    plate_path = path / plate_path.name

# Obtain the file paths for all .fcs files
fcs_paths = list(path.glob('*.fcs'))

# Obtain the file path for a template file
template_path = next(path.glob('*template*'), None)
if template_path is not None:
    template_path = path / template_path.name

# Obtain the file path for a primer file
primer_path = next(path.glob('*primer_index*'), None)
if primer_path is not None:
    primer_path = path / primer_path.name
    
    
(
    plate_path,
    fcs_paths,
    template_path,
    primer_path
)

(PosixPath('data/plate_spreadsheet_template.xlsx'),
 [PosixPath('data/14Jun23_INX_Ref_Ctrl_LCE123.fcs'),
  PosixPath('data/14Jun23_INX_NKC_084_LCE663.fcs'),
  PosixPath('data/14Jun23_INX_NKC_085_LCE123.fcs'),
  PosixPath('data/14Jun23_INX_NKC_084_LCE123.fcs')],
 PosixPath('data/template_sheet.xlsx'),
 PosixPath('data/primer_index_template.xlsx'))

In [2]:
# create a folder to store temparol results
temp_file_path = Path('.') / 'temp'
if not temp_file_path.exists():
    temp_file_path.mkdir(parents=True)

In [3]:
# operation 1: create sample sheet from plate layout template
sample_sheet_df = plate_to_samplesheet(plate_path)
sample_sheet_df.to_csv('temp/op1.plate_layout_to_spreadsheet.tsv', sep='\t', index=False)

# operation 2: combine fcs files into one tsv file
collated_fcs_df = collate_fcs_files(fcs_paths, "")  # provide a list of fcs files
collated_fcs_df.to_csv('temp/op2.collate_fcs_files.tsv', sep='\t', index=False)

# opeartion 3: merge all data into a comprehensive file
merged_samplesheet_fcs_and_template_sheet_df =  merge_data_with_samplesheet(spreadsheet_filepath='temp/op1.plate_layout_to_spreadsheet.tsv', 
                                                                            fcs_file="temp/op2.collate_fcs_files.tsv", 
                                                                            template_sheet_filepath=template_path)

# operation 4 (option): add primer index to comprehensive file
primer_index_df = pd.read_excel(primer_path, sheet_name='Sample primer & index', skiprows=3)

# generate mockup test result file
merged_primer_index_df = pd.merge(merged_samplesheet_fcs_and_template_sheet_df, primer_index_df, 
                                  on=['Plate#', 'Well position', 'Sample name'], 
                                  suffixes=('', '_primer'), how='left')

merged_primer_index_df.to_csv('temp/final.tsv', sep='\t', index=False)
merged_primer_index_df.to_excel('temp/final.xlsx', index=False)

In [3]:
# script check
from fcs_converter import *

process_files(
    plate_layout_path=plate_path,
    fcs_files=fcs_paths,
    template_sheet_path=template_path,
    primer_index_path=primer_path,
    output_file="temp/final_bulk.tsv"
)

In [4]:
# compare output file with pre-defined test result
!md5sum temp/final.tsv
!md5sum temp/final_bulk.tsv
!md5sum output_files/op4.merged_primer_index.tsv

172c947caf7726aad2e756f22bce861f  temp/final.tsv
172c947caf7726aad2e756f22bce861f  temp/final_bulk.tsv
172c947caf7726aad2e756f22bce861f  output_files/op4.merged_primer_index.tsv
