### DATA COLLECTION AND SIMULATION FOR HPLC/GC QC ANALYSIS

#### Abstract

In regulated analytical laboratories, quality control (QC) data are typically generated directly by chromatographic instrument software (HPLC or GC). These raw outputs are often proprietary, locked within vendor systems, or subject to confidentiality constraints. As a result, direct sharing or reuse of real QC datasets is frequently impractical.

To address these limitations while preserving analytical realism, this project employs a simulated chromatographic QC dataset designed to emulate real-world laboratory behavior. The simulated data enable controlled evaluation of precision, accuracy, variability, and trend detection, while ensuring reproducibility, auditability, and data privacy.

### Dataset Design Rationale

The dataset was structured to reflect routine HPLC and GC assay workflows, incorporating expected performance characteristics for each technique.

|Variable              | Purpose                                          |
|--------------------- | ------------------------------------------------ |
|**Sample_ID**         | Unique identifier for each QC injection          |
|**Instrument_ID**     | Differentiates HPLC vs GC systems                |
|**RetentionTime_min** | Indicates chromatographic stability              |
|**Peak_Area**         | Primary quantitative detector response           |
|**PeakWidth_min**     | Reflects column efficiency and system dispersion |
|**Concentration_mgL** | Calculated analyte concentration                 |
|**TrueValue_mgL**     | Reference value for accuracy assessment          |
|**Run_Date**          | Enables time-based trend analysis                |

Each variable supports downstream QC evaluations such as precision (%RSD), accuracy (%recovery), and system suitability trending.

### Data Simulation Strategy (Excel-Based)
#### Why Excel Was Used:
Simulating data in Excel also mirrors non-automated workflows often encountered
in legacy laboratory environments. Excel remains widely used in regulated labs for:

- Initial data generation
- Rapid prototyping
- Manual QC review

#### Simulation Scope:
- 450 total rows representing mixed HPLC and GC runs
- Randomized assignment of instrument type
- Controlled variability applied per instrument class
- Time-distributed run dates to support trend analysis

### Excel Simulation Logic (Explained)
Each column was generated with logic that reflects expected chromatographic behavior, not arbitrary randomness. 450 rows of mixed mock raw dataset for HPLC and GC instruments were simulated, and random variations were respectively added for each column using the following codes:

**Sample Identifier (Sample_ID):** 

Created by dragging the initial value down 450th rows.
Ensures traceability of individual injections.

**Instrument Assignment (Instrument_ID):**

=IF(RAND()<0.5,"HPLC_01","GC_02")

Introduces mixed-mode datasets typical in multi-instrument QC environments.

**Retention Time (RetentionTime_min):**

=IF(B5="HPLC_01", ROUND(5+RAND()*0.5,2), ROUND(3.5+RAND()*0.4,2))

- HPLC and GC exhibit different retention regimes
- Small random ranges simulate normal system variation without instability

**Peak Area (Peak_Area):**

=IF(B5="HPLC_01", ROUND(10000+RAND()*800,0), ROUND(9500+RAND()*700,0))

- Shows natural injection-to-injection variability
- Shows instrument-dependent response levels

**Peak Width (PeakWidth_min):**

=MAX(0.03, 0.03 + 0.015 * [@Retention_min] + NORM.S.INV(RAND()) * 0.01)

- Ensures physically realistic peak widths
- Includes stochastic dispersion to reflect column and system effects

**Concentration (Concentration_mgl):**

=ROUND(D5/10000,2)

Simulates standard calibration logic where concentration is proportional to detector response.

**True Value (TrueValue_mgl):**

=[@Concentration_mgL] + (RAND() - 0.5) * 0.1

Introduces controlled bias, enabling downstream accuracy and recovery calculations.

**Run Date (Run_Date):**

=DATE(2025,1,1)+RANDBETWEEN(0,90)

- Creates temporal spacing
- Enables drift detection, control charts, and time-series QC metrics

### Raw Dataset Structure

The resulting file was saved in the repository folder **Chrom-Data-Analytics/excel_files** as:
**hplc_gc_qc_data_raw.xlsx**

#### Column Layout:

|Sample_ID |Instrument_ID |RetentionTime_min |Peak_Area |PeakWidth_min |Concentration_mgL |Run_Date   |
|----------|--------------|------------------|----------|--------------|------------------|-----------|
|S001	   |HPLC_01       |5.2	             |10450     |0.106901      |1.00              |2025-01-02 |
|S002      |HPLC_01       |5.3               |10620     |0.112822      |1.02              |2025-01-03 |
|S003      |GC_02         |3.8               |9850      |0.083105      |0.98              |2025-01-03 |


### Initial Data Profiling and Cleaning (Excel)
#### Why Profiling Is Critical

Before statistical QC analysis:

- Missing values must be eliminated
- Outliers must be understood (not blindly removed)
- Date formats must be consistent for time-series analysis

#### Checks Performed

**1. Missing Values:** Missing analytical values invalidate QC statistics and control limits.

Method:
Highlight column title, first row→ Data→ Filter→ click dropdown menu for each column and check for blanks.

**2. Outlier Screening:** Highlights extreme values that may indicate:

- Integration errors
- Simulated system failures
- Data entry issues

Method:
Highlight column→ Home→ Conditiona Formatting→ Top/Bottom Rules→ More Rules → Format only top or bottom ranked values→ Format with→ Ok

**3. Date Consistency:** Ensures correct parsing during downstream Python and SQL analysis.

Method:
Highlight Run_Date column→ Home→ Data format→ Date

### Cleaned Dataset Output

After validation, the dataset was saved in the repository folder **Chrom-Data-Analytics/excel_files** as:

**hplc_gc_qc_data_cleaned.xlsx**

This cleaned file serves as the single source of truth for all subsequent:

- QC metric computation
- Control charting
- Trend and stability analysis

### Repository Folder Structure