# Process data from PDC assays
## Notes
DO 1-23-20226. I copied this from an older document "PDC assay data processing.ipynb" I have a separate notebook for processing the .KD files to avoid the need to frequently re-install the uv_pro library.

## Analysis plan
* Load the "Enzyme_assay_metadata" spreadsheet and identify the assays we want to process
* Find all of the .csv files with PDC enzyme assay data
* For each csv file:
  * Add filename information
  * Measure initial pyruvate
    * Determine the expected initial pyruvate concentration (Pyruvate_mM) and Blank_time_s from the Enzyme_assay_metadata dataframe
    * Calculate the pyruvate concentration using the _calculate_blank_pyruvate() function imported from the "Compiling_spectrum_data.ipynb" notebook in the "Spectrum files from Agilent spec" folder
    * If the difference from the expected pyruvate concentration is >50%, throw a warning and use the expected pyruvate concentration instead (note, it might make senese to update this in the _calculate_blank_pyruvate() function
  * Measure NADH concentration
    * use the process_pdc_timecourse() function

* Combine the data into a single pandas dataframe for plotting
* Plot NADH concentration vs. offset time (i.e. where the assay start time has been shifted to zero) for all samples. This will allow us to do a rough examination of the data

* Data processing for subsequent analysis:
  * For each assay, measure the maximum slope (V), after the assay start.
  * Normalize V to the enzyme concentration (V/E)

* Determine the effect of Adh enzyme concentration
  * Select only the "Varying Adh" assay group
  * Plot V/E vs. the Adh concentration

* Create a kcat plot
  * Plot V/E vs. the substrate concentration
  * Adjust the units so that we can measure kcat directly from the plot
  * Color by filename

* Measure NADH degradation (see if we have good enough data for this)

* Convert to an EnzymeML file
* Upload EnzymeML file, colab notebook, and raw data to Janis Shin's github folder for subsequent modeling.




In [1]:
## Start by importing python libraries for data import and analysis
import plotly.express as px # for plotting the output
import pandas as pd
import numpy as np

In [2]:
import os
from google.colab import drive
drive.mount('/content/drive')
os.getcwd() # Check starting directory

#PROJECT_ROOT = "/content/drive/MyDrive/PDC+ADH+FDH assay data Evelyn 2025"  # 按你Drive里显示的完整名字填
PROJECT_ROOT = "/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025" # Dan's google drive
%cd "$PROJECT_ROOT"

os.getcwd() # Confirm that we have changed to the correct directory

Mounted at /content/drive
/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025


'/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025'

In [3]:
# Load data from the Enzyme_assay_metadata google doc
public_csv_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRVpwYqImFkaUigsWgrO9MRtWjYWwps82EExnomLqNr_hOUNViKF_fFyAhJfIqe3hDq0IEG76W4v_fO/pub?output=csv"
meta_df = pd.read_csv(public_csv_url)
#display(meta_df.head())

# Filter to just rows with the PDC_fwd assay and where 'Ignore' column is blank (NaN or empty string)
filtered_meta_df = meta_df[(meta_df['Assay'] == 'PDC_fwd') & (meta_df['Ignore'].isna() | (meta_df['Ignore'] == ''))]

# Define the subfolder name for CSV files (generated from KD files by another script).
# This assumes we've already moved to the PDC+ADH+FDH assay data Evelyn 2025 folder
base_path = os.path.join(os.getcwd(), "KD files from Agilent spec")

filtered_meta_df

Unnamed: 0,Experiment_ID,Ignore,Filename,Assay,Assay Group,Cuvette,Start_time_s,Mask_until_s,Blank_time_s,Blank_340,...,Tris-HCl_mM,TPP_mM,MgCl2_mM,Pyruvate_mM,Acetaldehyde_mM,Ethanol_mM,NADH_mM,NAD_mM,Adh_ug_ml,Pdc_ug_ml
15,Assay 11,,1222 PDC-9.KD,PDC_fwd,Varying Adh,CELL_1,397.4,461.4,115.8,0.0040,...,100.0,0.4,5.0,20.0,,,0.30,,3.2680,0.85659
16,Assay 11,,1222 PDC-9.KD,PDC_fwd,Varying Adh,CELL_2,397.4,416.6,115.8,0.0060,...,100.0,0.4,5.0,20.0,,,0.30,,3.2680,0.85659
17,Assay 11,,1222 PDC-9.KD,PDC_fwd,Varying Adh,CELL_3,403.8,416.6,109.4,0.0170,...,100.0,0.4,5.0,20.0,,,0.30,,3.2680,0.85659
18,Assay 11,,1222 PDC-11.KD,PDC_fwd,Varying Adh,CELL_1,519.1,544.9,186.2,0.0025,...,100.0,0.4,5.0,20.0,,,0.30,,1.6341,0.85659
19,Assay 11,,1222 PDC-11.KD,PDC_fwd,Varying Adh,CELL_2,519.1,544.9,199.2,0.0010,...,100.0,0.4,5.0,20.0,,,0.30,,1.6341,0.85659
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111,,,0121 200 100 40MM PYR-2.KD,PDC_fwd,Varying pyr high NADH,CELL_3,320.7,352.7,122.3,0.0000,...,100.0,0.4,5.0,40.0,,,1.75,,15.2506,3.99740
112,,,0121 20 16 8 4MM PYR-3.KD,PDC_fwd,Varying pyr high NADH,CELL_1,340.9,390.8,100.3,0.0090,...,100.0,0.4,5.0,20.0,,,1.75,,15.2506,3.99740
113,,,0121 20 16 8 4MM PYR-3.KD,PDC_fwd,Varying pyr high NADH,CELL_2,349.2,382.5,109.3,0.0000,...,100.0,0.4,5.0,16.0,,,1.75,,15.2506,3.99740
114,,,0121 20 16 8 4MM PYR-3.KD,PDC_fwd,Varying pyr high NADH,CELL_3,340.9,423.9,116.8,0.0050,...,100.0,0.4,5.0,8.0,,,1.75,,15.2506,3.99740


* Add filename information
* Measure initial pyruvate
* Determine the expected initial pyruvate concentration (Pyruvate_mM) and Blank_time_s from the Enzyme_assay_metadata dataframe
* Calculate the pyruvate concentration using the _calculate_blank_pyruvate() function imported from the "Compiling_spectrum_data.ipynb" notebook in the "Spectrum files from Agilent spec" folder
* If the difference from the expected pyruvate concentration is >50%, throw a warning and use the expected pyruvate concentration instead (note, it might make senese to update this in the _calculate_blank_pyruvate() function
* Measure NADH concentration
use the process_pdc_timecourse() function

In [5]:
# Process one row in the metadata file
row = 0
meta_row = filtered_meta_df.iloc[row]
filename_to_find = meta_row['Filename'].replace('.KD', '.csv')
cuvette = meta_row['Cuvette']

# Read the csv file with raw data (note, there may be several cuvettes)
print(f"Processing filename: {filename_to_find}, cuvette: {cuvette}")
file_path = os.path.join(base_path, filename_to_find)
df = pd.read_csv(file_path)
df = df.loc[df['sample'] == cuvette, :]

df

Processing filename: 1222 PDC-9.csv, cuvette: CELL_1


Unnamed: 0,sample,Time_s,190,191,192,193,194,195,196,197,...,1091,1092,1093,1094,1095,1096,1097,1098,1099,filename
0,CELL_1,1.3,-0.229889,-0.236069,-0.198827,-0.227364,-0.243619,-0.237341,-0.228086,-0.253552,...,-0.017337,-0.022972,-0.023099,-0.023687,-0.025043,-0.024079,-0.029623,-0.029672,-0.029365,1222 PDC-9.KD
1,CELL_1,7.0,-0.207426,-0.239761,-0.216961,-0.246830,-0.245172,-0.234387,-0.225255,-0.250117,...,-0.019892,-0.023104,-0.020780,-0.024611,-0.026352,-0.024914,-0.027442,-0.026089,-0.027331,1222 PDC-9.KD
2,CELL_1,13.4,-0.229995,-0.250758,-0.215909,-0.219599,-0.226750,-0.236955,-0.242416,-0.254307,...,-0.018507,-0.019392,-0.022905,-0.025990,-0.024989,-0.023331,-0.025867,-0.028555,-0.030050,1222 PDC-9.KD
3,CELL_1,20.2,-0.227447,-0.258912,-0.213358,-0.234167,-0.219644,-0.234999,-0.245830,-0.260394,...,-0.019046,-0.022886,-0.025647,-0.026373,-0.024261,-0.021654,-0.027519,-0.028392,-0.029357,1222 PDC-9.KD
4,CELL_1,26.2,-0.221700,-0.238885,-0.196493,-0.247545,-0.242846,-0.242864,-0.216627,-0.235869,...,-0.020377,-0.024396,-0.024342,-0.024142,-0.026915,-0.024112,-0.029789,-0.028462,-0.030776,1222 PDC-9.KD
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
150,CELL_1,960.6,-0.218845,-0.245361,-0.219336,-0.227642,-0.217304,-0.226779,-0.220643,-0.260612,...,-0.024676,-0.027367,-0.028564,-0.031089,-0.032236,-0.031374,-0.032363,-0.032451,-0.035073,1222 PDC-9.KD
151,CELL_1,967.0,-0.215957,-0.224817,-0.211372,-0.235142,-0.232874,-0.219087,-0.232705,-0.239658,...,-0.025971,-0.026784,-0.027905,-0.032001,-0.032183,-0.030543,-0.033770,-0.033567,-0.036741,1222 PDC-9.KD
152,CELL_1,973.4,-0.216125,-0.245321,-0.216345,-0.228510,-0.217950,-0.207681,-0.235423,-0.250070,...,-0.028459,-0.026213,-0.027192,-0.031685,-0.034378,-0.030645,-0.031229,-0.033049,-0.039123,1222 PDC-9.KD
153,CELL_1,979.9,-0.224421,-0.236732,-0.225552,-0.237053,-0.234365,-0.232685,-0.222670,-0.227389,...,-0.025811,-0.029363,-0.031357,-0.031988,-0.030756,-0.032851,-0.036735,-0.034828,-0.036248,1222 PDC-9.KD


In [6]:
!pip install import_ipynb

Collecting import_ipynb
  Downloading import_ipynb-0.2-py3-none-any.whl.metadata (2.3 kB)
Collecting jedi>=0.16 (from IPython->import_ipynb)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading import_ipynb-0.2-py3-none-any.whl (4.0 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m41.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi, import_ipynb
Successfully installed import_ipynb-0.2 jedi-0.19.2


In [7]:
import json
import os
import types

# Define the path to the directory containing the other notebook
spectrum_folder = os.path.join(PROJECT_ROOT, "Spectrum files from Agilent spec")
notebook_path = os.path.join(spectrum_folder, "Compiling_spectrum_data.ipynb")

# Create a mock module object to store the imported function
csd = types.ModuleType('Compiling_spectrum_data')

print(f"Reading notebook from: {notebook_path}")

try:
    with open(notebook_path, 'r', encoding='utf-8') as f:
        nb = json.load(f)

    func_name = "_calculate_blank_pyruvate"
    found_code = None

    # Iterate through cells to find the function definition
    for cell in nb['cells']:
        if cell['cell_type'] == 'code':
            source = "".join(cell['source'])
            # Simple check to find the function definition
            if f"def {func_name}" in source:
                found_code = source
                break

    if found_code:
        # Execute the function definition in the current global scope
        # This ensures it has access to global imports like pd and np
        exec(found_code, globals())

        # Bind the function to the csd module object so it mimics the import
        if func_name in globals():
            setattr(csd, func_name, globals()[func_name])
            print(f"Successfully extracted '{func_name}' and assigned it to 'csd'.")
        else:
            print(f"Error: executed code but '{func_name}' was not found in globals.")
    else:
        print(f"Error: Function '{func_name}' not found in the notebook.")

except Exception as e:
    print(f"An error occurred while extracting the function: {e}")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Google Drive mounted successfully.
Current working directory: /content/drive/My Drive/Research/PDC+ADH+FDH assay data Evelyn 2025/Spectrum files from Agilent spec
New working directory: /content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/Spectrum files from Agilent spec
Found 9 .wav files (relative paths):
0_05MM NADH SPECTRUM.WAV
0_15MM NADH SPECTRUM.WAV
0_25MM NADH SPECTRUM.WAV
1MM NADH SPECTRUM.WAV
1MM PYR SPECTRUM.WAV
10MM PYR SPECTRUM.WAV
100MM PYR SPECTRUM.WAV
50MM PYR SPECTRUM.WAV
200mM TRIS SPECTRUM.WAV


KeyError: 'In'

Unnamed: 0,Wavelength,Absorbance,Compound,File_Name,Expected_mM
0,190.0,2.548025,NADH,0_05MM NADH SPECTRUM.WAV,0.05
1,191.0,2.592308,NADH,0_05MM NADH SPECTRUM.WAV,0.05
2,192.0,2.776916,NADH,0_05MM NADH SPECTRUM.WAV,0.05
3,193.0,2.935247,NADH,0_05MM NADH SPECTRUM.WAV,0.05
4,194.0,2.87652,NADH,0_05MM NADH SPECTRUM.WAV,0.05


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8199 entries, 0 to 8198
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Wavelength   8199 non-null   float64
 1   Absorbance   8199 non-null   float64
 2   Compound     8199 non-null   object 
 3   File_Name    8199 non-null   object 
 4   Expected_mM  8199 non-null   float64
dtypes: float64(3), object(2)
memory usage: 320.4+ KB


None

Filtered NADH Plotting Data (Absorbance <= 2.5):


Unnamed: 0,Wavelength,Absorbance,Compound,File_Name,Expected_mM
150,340.0,0.269806,NADH,0_05MM NADH SPECTRUM.WAV,0.05
1061,340.0,0.833405,NADH,0_15MM NADH SPECTRUM.WAV,0.15
1972,340.0,1.387911,NADH,0_25MM NADH SPECTRUM.WAV,0.25


Theoretical Slope (Extinction Coefficient): 6.2200
Experimental Slope from previous plot: 5.5483
Scaling Factor for Concentration Adjustment: 0.8920

Adjusted NADH Plotting Data (Absorbance <= 2.5, with new Concentration_mM for plot):


Unnamed: 0,Wavelength,Absorbance,Compound,File_Name,Expected_mM,Concentration_mM
150,340.0,0.269806,NADH,0_05MM NADH SPECTRUM.WAV,0.05,0.044601
1061,340.0,0.833405,NADH,0_15MM NADH SPECTRUM.WAV,0.15,0.133802
1972,340.0,1.387911,NADH,0_25MM NADH SPECTRUM.WAV,0.25,0.223004


Wide NADH DataFrame Head:


KeyError: 'In'

Unnamed: 0,Wavelength,NADH_0.0446mM_Absorbance,NADH_0.1338mM_Absorbance,NADH_0.2230mM_Absorbance,NADH_0.8920mM_Absorbance
0,190.0,2.548025,2.903203,3.171649,4.0
1,191.0,2.592308,3.095706,3.09525,3.530418
2,192.0,2.776916,3.389744,3.623547,4.0
3,193.0,2.935247,3.487925,3.778119,3.973017
4,194.0,2.87652,3.449419,3.555676,3.722551



Wide NADH DataFrame saved to Adjusted_NADH_Concentration_Absorbance.csv
Filtered Pyruvate Plotting Data:


Unnamed: 0,Wavelength,Absorbance,Compound,File_Name,Expected_mM,Concentration_mM
3774,320.0,0.010163,PYR,1MM PYR SPECTRUM.WAV,1.0,
4685,320.0,0.194522,PYR,10MM PYR SPECTRUM.WAV,10.0,
5596,320.0,2.029487,PYR,100MM PYR SPECTRUM.WAV,100.0,
6507,320.0,0.966928,PYR,50MM PYR SPECTRUM.WAV,50.0,


Wide Pyruvate DataFrame Head:


KeyError: 'In'

Unnamed: 0,Wavelength,PYR_1.0000mM_Absorbance,PYR_10.0000mM_Absorbance,PYR_50.0000mM_Absorbance,PYR_100.0000mM_Absorbance
0,190.0,1.978241,2.898288,3.304986,3.593821
1,191.0,1.979834,2.919439,3.638203,3.240218
2,192.0,1.973983,3.276395,3.784174,3.9266
3,193.0,1.962775,3.308305,3.88068,3.991495
4,194.0,1.933148,3.31809,3.71204,3.964487



Wide Pyruvate DataFrame saved to Pyruvate_Concentration_Absorbance.csv
NADH Filtered DataFrame Head:


KeyError: 'In'

Unnamed: 0,Wavelength,Absorbance,Compound,File_Name,Expected_mM,Concentration_mM
22,212.0,2.465739,NADH,0_05MM NADH SPECTRUM.WAV,0.05,0.044601
23,213.0,2.380679,NADH,0_05MM NADH SPECTRUM.WAV,0.05,0.044601
24,214.0,2.257281,NADH,0_05MM NADH SPECTRUM.WAV,0.05,0.044601
25,215.0,2.064577,NADH,0_05MM NADH SPECTRUM.WAV,0.05,0.044601
26,216.0,1.829588,NADH,0_05MM NADH SPECTRUM.WAV,0.05,0.044601



Shape of NADH Filtered DataFrame: (3390, 6)
NADH Linearity Analysis Results (Head):


KeyError: 'In'

Unnamed: 0,Wavelength,Slope,R_squared,Num_Points
0,219.0,16.907245,0.985095,2
1,220.0,14.991107,0.989265,2
2,221.0,13.151934,0.991832,2
3,222.0,11.568007,0.993651,2
4,223.0,9.951321,0.997973,3


Slope at 340 nm: 5.5483
Scaling Factor (Target 6.22 / Slope 340nm): 1.1211

First 5 rows with Extinction Coefficient:


KeyError: 'In'

Unnamed: 0,Wavelength,Slope,R_squared,Num_Points,Extinction_Coefficient
0,219.0,16.907245,0.985095,2,18.954002
1,220.0,14.991107,0.989265,2,16.8059
2,221.0,13.151934,0.991832,2,14.74408
3,222.0,11.568007,0.993651,2,12.968407
4,223.0,9.951321,0.997973,3,11.156008


Pyruvate Filtered DataFrame Head:


KeyError: 'In'

Unnamed: 0,Wavelength,Absorbance,Compound,File_Name,Expected_mM,Concentration_mM
3644,190.0,1.978241,PYR,1MM PYR SPECTRUM.WAV,1.0,1.0
3645,191.0,1.979834,PYR,1MM PYR SPECTRUM.WAV,1.0,1.0
3646,192.0,1.973983,PYR,1MM PYR SPECTRUM.WAV,1.0,1.0
3647,193.0,1.962775,PYR,1MM PYR SPECTRUM.WAV,1.0,1.0
3648,194.0,1.933148,PYR,1MM PYR SPECTRUM.WAV,1.0,1.0



Shape of Pyruvate Filtered DataFrame: (3415, 6)
Pyruvate Linearity Analysis Results (Head):


KeyError: 'In'

Unnamed: 0,Wavelength,Slope,R_squared,Num_Points
0,243.0,0.188171,0.999991,2
1,244.0,0.17299,0.999997,2
2,245.0,0.158551,0.999999,2
3,246.0,0.145141,1.0,2
4,247.0,0.132951,0.999999,2


Standards DataFrame Head:


KeyError: 'In'

Unnamed: 0,Wavelength,NADH_Coeff,PYR_Coeff
0,243.0,8.37143,0.188171
1,244.0,8.73697,0.17299
2,245.0,9.114069,0.158551
3,246.0,9.543467,0.145141
4,247.0,10.003543,0.132951



Shape of Standards DataFrame: (858, 3)
Standards DataFrame saved to NADH_Pyruvate_Standards.csv
Kinetic Data Head (first 10 rows):


KeyError: 'In'

Unnamed: 0,sample,Time_s,190,191,192,193,194,195,196,197,...,1091,1092,1093,1094,1095,1096,1097,1098,1099,filename
0,CELL_1,1.4,-0.04019,-0.025164,-0.02046,0.002255,0.00701,-0.042197,-0.01332,0.015717,...,-0.01517,-0.015754,-0.014958,-0.017492,-0.01609,-0.014441,-0.018435,-0.019453,-0.019712,1229 PDC PYRUVATE 100MM-8.KD
1,CELL_1,7.0,-0.035162,-0.015901,-0.045796,-0.000527,0.01702,-0.041877,-0.011095,-0.005792,...,-0.015467,-0.017028,-0.018837,-0.018133,-0.019844,-0.014425,-0.020117,-0.017702,-0.018785,1229 PDC PYRUVATE 100MM-8.KD
2,CELL_1,13.5,-0.025568,-0.025858,-0.040857,-0.01777,-0.007006,-0.029105,0.00096,-0.017584,...,-0.01592,-0.016428,-0.015409,-0.017478,-0.018126,-0.013627,-0.019478,-0.017424,-0.018097,1229 PDC PYRUVATE 100MM-8.KD
3,CELL_1,19.9,-0.035588,-0.045554,-0.058393,0.002479,-0.019413,-0.049197,-0.021289,-0.010043,...,-0.014615,-0.017081,-0.017593,-0.01887,-0.019313,-0.012831,-0.020371,-0.019111,-0.019593,1229 PDC PYRUVATE 100MM-8.KD
4,CELL_1,26.2,-0.030612,-0.030499,-0.030189,0.002214,0.002298,-0.031126,-0.013102,-0.01021,...,-0.015845,-0.016278,-0.019495,-0.016925,-0.017804,-0.012013,-0.020674,-0.018943,-0.016878,1229 PDC PYRUVATE 100MM-8.KD
5,CELL_1,32.7,-0.03528,-0.025367,-0.020052,-1.9e-05,-0.022268,-0.035635,-0.027234,-0.001531,...,-0.015112,-0.020629,-0.017226,-0.017594,-0.016663,-0.012801,-0.019484,-0.017867,-0.018404,1229 PDC PYRUVATE 100MM-8.KD
6,CELL_1,39.0,-0.030412,-0.02574,-0.029966,-0.000128,-0.025808,-0.014565,-0.014451,0.003623,...,-0.013878,-0.017083,-0.016978,-0.016145,-0.017808,-0.012908,-0.019199,-0.016729,-0.015973,1229 PDC PYRUVATE 100MM-8.KD
7,CELL_1,45.5,-0.005201,-0.031645,-0.049096,0.014328,0.011323,-0.037076,-0.008773,0.01484,...,-0.013252,-0.017771,-0.015123,-0.017083,-0.017722,-0.012039,-0.018786,-0.01633,-0.018835,1229 PDC PYRUVATE 100MM-8.KD
8,CELL_1,52.1,-0.040246,-0.030464,-0.040719,-0.005971,0.017302,-0.041159,-0.000295,0.004899,...,-0.015264,-0.014437,-0.015357,-0.014946,-0.015792,-0.012037,-0.01826,-0.016474,-0.018178,1229 PDC PYRUVATE 100MM-8.KD
9,CELL_1,58.2,-0.025513,-0.015899,-0.035326,0.007373,-0.00285,-0.02562,-0.014807,-0.013025,...,-0.014754,-0.014239,-0.018074,-0.016269,-0.016409,-0.012829,-0.017312,-0.017653,-0.018355,1229 PDC PYRUVATE 100MM-8.KD



Shape of Kinetic DataFrame: (333, 913)


NADH Concentration (Single-Wavelength) calculated.


KeyError: 'In'

Unnamed: 0,Time_s,340,NADH_Conc_SingleWav
0,1.4,-0.003061,-0.270589
1,7.0,-0.002829,-0.270551
2,13.5,-0.002923,-0.270566
3,19.9,-0.002592,-0.270513
4,26.2,-0.002926,-0.270567


Target Time: 602.2 s
Actual Time Found: 602.2 s

Spectrum at ~602.2s (190-1100 nm):


KeyError: 'In'

Unnamed: 0,Wavelength,Absorbance
0,190.0,0.000395
1,191.0,0.027115
2,192.0,0.016452
3,193.0,0.032912
4,194.0,0.025004


Shape: (910, 2)
Target Time: 180 s
Actual Time Found: 179.9 s


Target Time: 333 s
Actual Time Found: 333.4 s


IndentationError: unexpected indent (<string>, line 17)

In [7]:
df_list = []

# Loop through filenames and check if the file path is valid
unique_filenames = filtered_meta_df['Filename'].unique()
print("#### Processing CSV files: ####")
for filename in unique_filenames:
    # Replace .KD extension with .csv for finding the file
    filename_to_find = filename.replace('.KD', '.csv')
    file_path = os.path.join(base_path, filename_to_find)

    if os.path.exists(file_path):
        print(f"- {filename_to_find}: EXISTS ({file_path})")
        try:
            # Read the .CSV file
            df = pd.read_csv(file_path)

            # Ensure column names match expected for later processing (e.g., 'Time_s')
            if 'Time (s)' in df.columns:
                df.rename(columns={'Time (s)': 'Time_s'}, inplace=True)
            # If 'sample' column is not already present or needs specific formatting from meta_df
            # This assumes the CSV files have a 'sample' column similar to KD file output
            # For now, if the CSV already contains 'sample' as expected, no change needed.
            # If not, further mapping might be required.

            # Add the *original* filename (with .KD extension) for merging with meta_df
            df['filename'] = filename
            df_list.append(df)
        except Exception as e:
            print(f"Error reading {file_path}: {e}")
    else:
        print(f"- {filename_to_find}: DOES NOT EXIST ({file_path})")

# Concatenate all dataframes in df_list into a single dataframe
if df_list: # Only concatenate if df_list is not empty
    assay_data_df = pd.concat(df_list, ignore_index=True)
    print("\nCombined DataFrame created successfully.")
    print("Head of the combined DataFrame:")
    display(assay_data_df.head())
    print(f"Shape of the combined DataFrame: {assay_data_df.shape}")
else:
    print("\nNo CSV files were found or processed to create the combined DataFrame.")
    assay_data_df = pd.DataFrame() # Initialize an empty DataFrame

#### Processing CSV files: ####
- 1222 PDC-9.csv: EXISTS (/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/KD files from Agilent spec/1222 PDC-9.csv)
- 1222 PDC-11.csv: EXISTS (/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/KD files from Agilent spec/1222 PDC-11.csv)
- 1222 PDC-10.csv: EXISTS (/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/KD files from Agilent spec/1222 PDC-10.csv)
- 1223 PDC-PYRUVATE-2.csv: EXISTS (/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/KD files from Agilent spec/1223 PDC-PYRUVATE-2.csv)
- 1223 PDC-PYRUVATE-3.csv: EXISTS (/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/KD files from Agilent spec/1223 PDC-PYRUVATE-3.csv)
- 1223 PDC-PYRUVATE-4.csv: EXISTS (/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/KD files from Agilent spec/1223 PDC-PYRUVATE-4.csv)
- 1224 pdc pyruvate 8mM-1.csv: EXISTS (/content/drive/MyDrive/Research/PDC+ADH+FDH assay data 

Unnamed: 0,sample,Time_s,190,191,192,193,194,195,196,197,...,1091,1092,1093,1094,1095,1096,1097,1098,1099,filename
0,CELL_1,1.3,-0.229889,-0.236069,-0.198827,-0.227364,-0.243619,-0.237341,-0.228086,-0.253552,...,-0.017337,-0.022972,-0.023099,-0.023687,-0.025043,-0.024079,-0.029623,-0.029672,-0.029365,1222 PDC-9.KD
1,CELL_1,7.0,-0.207426,-0.239761,-0.216961,-0.24683,-0.245172,-0.234387,-0.225255,-0.250117,...,-0.019892,-0.023104,-0.02078,-0.024611,-0.026352,-0.024914,-0.027442,-0.026089,-0.027331,1222 PDC-9.KD
2,CELL_1,13.4,-0.229995,-0.250758,-0.215909,-0.219599,-0.22675,-0.236955,-0.242416,-0.254307,...,-0.018507,-0.019392,-0.022905,-0.02599,-0.024989,-0.023331,-0.025867,-0.028555,-0.03005,1222 PDC-9.KD
3,CELL_1,20.2,-0.227447,-0.258912,-0.213358,-0.234167,-0.219644,-0.234999,-0.24583,-0.260394,...,-0.019046,-0.022886,-0.025647,-0.026373,-0.024261,-0.021654,-0.027519,-0.028392,-0.029357,1222 PDC-9.KD
4,CELL_1,26.2,-0.2217,-0.238885,-0.196493,-0.247545,-0.242846,-0.242864,-0.216627,-0.235869,...,-0.020377,-0.024396,-0.024342,-0.024142,-0.026915,-0.024112,-0.029789,-0.028462,-0.030776,1222 PDC-9.KD


Shape of the combined DataFrame: (13532, 913)


We define a function to read KD files, and export the result as a pandas dataframe for subsequent processing

In [None]:
import os

# Define KD File Reading Function
def read_kd_to_dataframe(file_path):
    """
    Reads a .KD file, converts its spectra data to a pandas DataFrame,
    adds a 'filename' column, and returns the DataFrame.

    Args:
        file_path (str): The full path to the .KD file.

    Returns:
        pd.DataFrame: A DataFrame containing the spectra data, with 'sample',
                      'Time_s', and 'filename' columns.
    """
    kd_file = KDFile(file_path)
    spectra_df = kd_file.spectra.T.reset_index()
    spectra_df.rename(columns={'Time (s)': 'Time_s'}, inplace=True)
    spectra_df.insert(0, 'sample', kd_file.samples_cell)

    # Remove 'SAMPLES_' prefix from the 'sample' column to better match what is written
    # in the Enzyme_assay_metadata spreadsheet
    spectra_df['sample'] = spectra_df['sample'].str.replace('SAMPLES_', '', regex=False)

    # Add the base filename as a new column
    base_filename = os.path.basename(file_path)
    spectra_df['filename'] = base_filename
    return spectra_df


# Test the modified function
# test_file_path = '/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/KD files from Agilent spec/251211 SERIES PDC FORWARD-1.KD'
# print(f"Testing read_kd_to_dataframe with: {test_file_path}")
# cleaned_df = read_kd_to_dataframe(test_file_path)
# print("Head of the DataFrame after cleaning 'sample' column:")
# display(cleaned_df.head())

## Find all of the .KD files with PDC assay data
To read files shared on your google drive, you need to mount them first. Do that with the following code. The PROJECT_ROOT variable will need to be changed depending on the user's google drive structure. Uncomment the one that is relevant for your use.



We will read the Enzyme_assay_metadata spreadsheet to know which assays data to read, and the conditions for each assay. This google sheet document has been set up to have its data published in comma-separated-variable (CSV) format at a publicly-available website. It is possible that the CSV data may take a few minutes to update, however, after the google doc has been edited.

In [None]:
# create a unique identifier for each run
assay_data_df['sample_filename'] = assay_data_df['sample'].astype(str) + '_' + assay_data_df['filename']

# plot the data
fig = px.line(assay_data_df,
              x='Time_s',
              y=390,
              color='sample_filename',
              title='Absorbance at 340 nm vs. Time by Sample and Filename',
              markers=True,
              labels={'Time_s': 'Time (seconds)', 340: 'Absorbance at 340 nm'})

fig.show()

# Task
Extract, merge, and process PDC assay data to calculate adjusted time and NADH concentration (mM), then visualize `NADH_mM` vs. `Adj_time_s` for each sample and filename, and prepare the data for kinetic analysis.

## Extract and Merge Metadata for Calculations

### Subtask:
Extract relevant columns from `assay_data_df`: 'sample', 'Time_s', '340', and 'filename'. Then, merge this data with `meta_df` to bring in 'Start_time_s' and 'Blank_340' values, which are essential for per-group calculations. Ensure the merge is performed on appropriate columns like 'sample' and 'filename'.


**Reasoning**:
The subtask requires extracting specific columns from `assay_data_df`, selecting relevant columns from `meta_df`, renaming columns in the metadata for alignment, and then merging these two dataframes. This code block will perform all these data manipulation steps and display the head of the resulting merged DataFrame.



In [None]:
import pandas as pd

# 1. Create a new DataFrame, say `processed_df`, by selecting the 'sample', 'Time_s', '340', and 'filename' columns from the `assay_data_df` DataFrame.
processed_df = assay_data_df[['sample', 'Time_s', 340, 'filename']].copy()

# 2. Create another DataFrame by selecting 'Filename', 'Cuvette', 'Start_time_s', and 'Blank_340' columns from the `meta_df` DataFrame.
meta_subset_df = meta_df[['Filename', 'Cuvette', 'Start_time_s', 'Blank_340']].copy()

# 3. Rename the 'Cuvette' column in this new DataFrame to 'sample' to align with the column name in `processed_df`.
# 4. Rename the 'Filename' column in this new DataFrame to 'filename' to align with the column name in `processed_df`.
meta_subset_df.rename(columns={'Cuvette': 'sample', 'Filename': 'filename'}, inplace=True)

# 5. Merge `processed_df` with the renamed metadata DataFrame using an inner merge on the 'sample' and 'filename' columns.
# Store the result back into `processed_df`.
processed_df = pd.merge(processed_df,
                        meta_subset_df,
                        on=['sample', 'filename'],
                        how='inner')

# 6. Display the head of the `processed_df` to verify the merge and the presence of the new columns.
print("Head of the merged processed_df:")
display(processed_df.head())
print(f"Shape of processed_df: {processed_df.shape}")

Head of the merged processed_df:


Unnamed: 0,sample,Time_s,340,filename,Start_time_s,Blank_340
0,CELL_1,1.2,1.877911,251211 SERIES PDC FORWARD-1.KD,83.8,0.39
1,CELL_1,7.0,1.878656,251211 SERIES PDC FORWARD-1.KD,83.8,0.39
2,CELL_1,13.4,1.880949,251211 SERIES PDC FORWARD-1.KD,83.8,0.39
3,CELL_1,19.8,1.877815,251211 SERIES PDC FORWARD-1.KD,83.8,0.39
4,CELL_1,26.2,1.87538,251211 SERIES PDC FORWARD-1.KD,83.8,0.39


Shape of processed_df: (33226, 6)


## Calculate Adjusted Time and NADH Concentration

### Subtask:
Calculate `Adj_time_s` by subtracting `Start_time_s` from `Time_s`. Calculate `Adj_Abs_340` by subtracting `Blank_340` from the `340` column. Finally, convert `Adj_Abs_340` to `NADH_mM` using the Beer-Lambert Law (A = εlc), where:
- A is the absorbance (our `Adj_Abs_340`)
- ε is the molar extinction coefficient (for NADH at 340 nm, ε = 6220 M⁻¹cm⁻¹)
- l is the path length (assumed to be 1 cm for standard cuvettes)
- c is the concentration in Molar, which we will convert to mM.

### Reasoning:
These calculations are crucial for standardizing the time measurements and converting raw absorbance data into a biologically meaningful concentration of NADH. `Adj_time_s` ensures that each assay starts at t=0, while `NADH_mM` provides the actual concentration of the product formed, accounting for background absorbance.

**Reasoning**:
This code block performs the calculations outlined in the previous markdown step. It calculates the adjusted time (`Adj_time_s`) and adjusted absorbance (`Adj_Abs_340`), and then converts the adjusted absorbance to NADH concentration in millimolar (`NADH_mM`) using the Beer-Lambert law with a molar extinction coefficient of 6220 M⁻¹cm⁻¹ and a path length of 1 cm. Finally, it displays the head of the updated DataFrame.



In [None]:
# Define constants for Beer-Lambert Law
MOLAR_EXTINCTION_COEFFICIENT = 6220 # M-1 cm-1
PATH_LENGTH = 1 # cm

# 1. Calculate Adj_time_s: Subtract 'Start_time_s' from 'Time_s'
processed_df['Adj_time_s'] = processed_df['Time_s'] - processed_df['Start_time_s']

# 2. Calculate Adj_Abs_340: Subtract 'Blank_340' from the '340' column
processed_df['Adj_Abs_340'] = processed_df[340] - processed_df['Blank_340']

# 3. Convert Adj_Abs_340 to NADH_mM using Beer-Lambert Law (A = εlc)
# c (M) = A / (ε * l)
# NADH_mM = c (M) * 1000
processed_df['NADH_mM'] = (processed_df['Adj_Abs_340'] / (MOLAR_EXTINCTION_COEFFICIENT * PATH_LENGTH)) * 1000

# Display the head of the `processed_df` to verify the new columns.
print("Head of processed_df after calculations:")
display(processed_df.head())
print(f"Shape of processed_df: {processed_df.shape}")

Head of processed_df after calculations:


Unnamed: 0,sample,Time_s,340,filename,Start_time_s,Blank_340,Adj_time_s,Adj_Abs_340,NADH_mM
0,CELL_1,1.2,1.877911,251211 SERIES PDC FORWARD-1.KD,83.8,0.39,-82.6,1.487911,0.239214
1,CELL_1,7.0,1.878656,251211 SERIES PDC FORWARD-1.KD,83.8,0.39,-76.8,1.488656,0.239334
2,CELL_1,13.4,1.880949,251211 SERIES PDC FORWARD-1.KD,83.8,0.39,-70.4,1.490949,0.239702
3,CELL_1,19.8,1.877815,251211 SERIES PDC FORWARD-1.KD,83.8,0.39,-64.0,1.487815,0.239199
4,CELL_1,26.2,1.87538,251211 SERIES PDC FORWARD-1.KD,83.8,0.39,-57.6,1.48538,0.238807


Shape of processed_df: (33226, 9)


**Reasoning**:
The next step in the task is to visualize the calculated `NADH_mM` versus `Adj_time_s` for each unique combination of `sample` and `filename`. This will provide a visual representation of the assay kinetics.



In [None]:
import plotly.express as px

# Create a unique identifier for plotting purposes
processed_df['sample_filename'] = processed_df['sample'].astype(str) + '_' + processed_df['filename']

# Plot NADH_mM vs. Adj_time_s
fig = px.line(processed_df,
              x='Adj_time_s',
              y='NADH_mM',
              color='sample_filename',
              title='NADH Concentration (mM) vs. Adjusted Time (s)',
              markers=True,
              labels={'Adj_time_s': 'Adjusted Time (seconds)', 'NADH_mM': 'NADH (mM)'})

fig.show()


## Try calculating the NADH and pyruvate concentrations using all of the wavelength data, instead of just one wavelength.

In [None]:
import statsmodels.api as sm
import pandas as pd

def calculate_concentrations(spectrum_df, standards_df, wavelength_range=None, absorbance_max=None, fit_intercept=False):
    """
    Calculates concentrations of NADH and Pyruvate using linear regression against standards.

    Args:
    spectrum_df: DataFrame containing 'Wavelength' and 'Absorbance'.
    standards_df: DataFrame containing 'Wavelength', 'NADH_Coeff', and 'PYR_Coeff'.
    wavelength_range: Tuple (min_nm, max_nm) to filter data. Default is None.
    absorbance_max: Float value to exclude absorbance readings above this limit. Default is None.
    fit_intercept: Boolean. If True, allows a non-zero intercept (baseline offset). Default is False.

    Returns:
    dict: {'NADH_Conc': float, 'PYR_Conc': float, 'Intercept': float, 'R_squared': float}
    """
    # Merge spectrum with standards on Wavelength
    merged_data = pd.merge(spectrum_df, standards_df, on='Wavelength', how='inner')

    # Filter by Wavelength Range if provided
    if wavelength_range:
        min_wav, max_wav = wavelength_range
        merged_data = merged_data[
            (merged_data['Wavelength'] >= min_wav) &
            (merged_data['Wavelength'] <= max_wav)
        ]

    # Filter by Absorbance Max if provided
    if absorbance_max is not None:
        merged_data = merged_data[merged_data['Absorbance'] <= absorbance_max]

    if merged_data.empty or len(merged_data) < 2:
        return {'NADH_Conc': None, 'PYR_Conc': None, 'Intercept': 0, 'R_squared': None}

    # Prepare X (Extinction Coefficients) and y (Absorbance)
    X = merged_data[['NADH_Coeff', 'PYR_Coeff']]
    y = merged_data['Absorbance']

    # Add constant if intercept is requested
    if fit_intercept:
        X = sm.add_constant(X)

    # Perform linear regression
    model = sm.OLS(y, X).fit()

    return {
        'NADH_Conc': model.params.get('NADH_Coeff', 0),
        'PYR_Conc': model.params.get('PYR_Coeff', 0),
        'Intercept': model.params.get('const', 0),
        'R_squared': model.rsquared
    }

# Task
Merge `processed_df` with `meta_df` to include `Mask_until_s` and `Adh_ug_ml` columns for each sample and filename, preparing the data for initial rate calculations and plotting.

## Merge Additional Metadata

### Subtask:
Merge the `processed_df` with relevant columns from `meta_df` ('Mask_until_s', 'Adh_ug_ml') based on 'sample' and 'filename'. This will ensure all necessary information for rate calculation and final plotting is available in a single DataFrame.


**Reasoning**:
The user wants to merge additional metadata, specifically 'Mask_until_s' and 'Adh_ug_ml', from `meta_df` into `processed_df`. This code block will perform the necessary selection, renaming, and merging operations.



In [None]:
import pandas as pd

# 1. Create a new DataFrame by selecting 'Filename', 'Cuvette', 'Mask_until_s', and 'Adh_ug_ml' columns from `meta_df`.
meta_additional_df = meta_df[['Filename', 'Cuvette', 'Mask_until_s', 'Adh_ug_ml']].copy()

# 2. Rename 'Cuvette' to 'sample' and 'Filename' to 'filename' to match `processed_df`.
meta_additional_df.rename(columns={'Cuvette': 'sample', 'Filename': 'filename'}, inplace=True)

# 3. Perform an inner merge of `processed_df` with this new metadata DataFrame on 'sample' and 'filename'.
# 4. Store the result back into `processed_df`.
processed_df = pd.merge(processed_df,
                        meta_additional_df,
                        on=['sample', 'filename'],
                        how='inner')

# 5. Display the first few rows of the updated `processed_df`.
print("Head of processed_df after merging additional metadata:")
display(processed_df.head())
print(f"Shape of processed_df: {processed_df.shape}")

Head of processed_df after merging additional metadata:


Unnamed: 0,sample,Time_s,340,filename,Start_time_s,Blank_340,Adj_time_s,Adj_Abs_340,NADH_mM,sample_filename,Mask_until_s,Adh_ug_ml
0,CELL_1,1.2,1.877911,251211 SERIES PDC FORWARD-1.KD,83.8,0.39,-82.6,1.487911,0.239214,CELL_1_251211 SERIES PDC FORWARD-1.KD,96.6,3.268
1,CELL_1,7.0,1.878656,251211 SERIES PDC FORWARD-1.KD,83.8,0.39,-76.8,1.488656,0.239334,CELL_1_251211 SERIES PDC FORWARD-1.KD,96.6,3.268
2,CELL_1,13.4,1.880949,251211 SERIES PDC FORWARD-1.KD,83.8,0.39,-70.4,1.490949,0.239702,CELL_1_251211 SERIES PDC FORWARD-1.KD,96.6,3.268
3,CELL_1,19.8,1.877815,251211 SERIES PDC FORWARD-1.KD,83.8,0.39,-64.0,1.487815,0.239199,CELL_1_251211 SERIES PDC FORWARD-1.KD,96.6,3.268
4,CELL_1,26.2,1.87538,251211 SERIES PDC FORWARD-1.KD,83.8,0.39,-57.6,1.48538,0.238807,CELL_1_251211 SERIES PDC FORWARD-1.KD,96.6,3.268


Shape of processed_df: (33226, 12)


## Calculate Initial Rates
Calculate the initial rates for each assay and plot them vs. Adh enzyme concentration. Since this is a coupled assay, we expect that as Adh concentration increases, eventually it stops affecting the PDC assay slope.


In [None]:
from scipy.stats import linregress
import pandas as pd

# 1. Create an empty list to store the results of the rate calculations.
initial_rates_results = []
INITIAL_RATE_WINDOW = 50

# Conversion factor from mM/s to uM/min
# 1 mM = 1000 uM
# 1 s = 1/60 min
# (mM/s) * (1000 uM/mM) * (60 s/min) = uM/min
CONVERSION_FACTOR_MM_S_TO_UM_MIN = 1000 * 60

# 2. Group the processed_df DataFrame by 'sample' and 'filename'
# to iterate through each unique experimental run.
for (sample, filename), group in processed_df.groupby(['sample', 'filename']):
    # 3a. Extract the Mask_until_s and Adh_ug_ml values (they should be constant within each group).
    mask_until_s = group['Mask_until_s'].iloc[0]
    adh_ug_ml = group['Adh_ug_ml'].iloc[0]

    # Extract Start_time_s for the current group (already present in `processed_df` from prior merge)
    start_time_s_for_group = group['Start_time_s'].iloc[0]

    # Calculate adjusted_mask_until_s
    adjusted_mask_until_s = mask_until_s - start_time_s_for_group

    # 3b. Filter the group's data to include only rows where Adj_time_s is within the specified window.
    filtered_group = group[
        (group['Adj_time_s'] >= adjusted_mask_until_s) &
        (group['Adj_time_s'] <= adjusted_mask_until_s + INITIAL_RATE_WINDOW)
    ]

    initial_rate = None
    intercept = None # Initialize intercept
    # 3c. If there is sufficient data (e.g., more than one data point) in the filtered subset,
    # perform a linear regression.
    if len(filtered_group) > 1:
        slope, intercept, r_value, p_value, std_err = linregress(
            filtered_group['Adj_time_s'],
            filtered_group['NADH_mM']
        )
        # Convert slope from mM/s to uM/min
        initial_rate = slope * CONVERSION_FACTOR_MM_S_TO_UM_MIN

    # 3d. Append a dictionary containing the results to the list.
    initial_rates_results.append({
        'sample': sample,
        'filename': filename,
        'Adh_ug_ml': adh_ug_ml,
        'initial_rate_uM_per_min': initial_rate, # Updated column name
        'intercept_NADH_mM': intercept # Added intercept
    })

# 4. Convert the list of results into a new Pandas DataFrame, named initial_rates_df.
initial_rates_df = pd.DataFrame(initial_rates_results)

# Add the 'Ignore' column from meta_df to initial_rates_df
meta_ignore_df = meta_df[['Filename', 'Cuvette', 'Ignore']].copy()
meta_ignore_df.rename(columns={'Cuvette': 'sample', 'Filename': 'filename'}, inplace=True)
initial_rates_df = pd.merge(initial_rates_df,
                            meta_ignore_df,
                            on=['sample', 'filename'],
                            how='left')

# Add the 'Pyruvate_mM' column from meta_df to initial_rates_df
meta_pyruvate_df = meta_df[['Filename', 'Cuvette', 'Pyruvate_mM']].copy()
meta_pyruvate_df.rename(columns={'Cuvette': 'sample', 'Filename': 'filename'}, inplace=True)
initial_rates_df = pd.merge(initial_rates_df,
                            meta_pyruvate_df,
                            on=['sample', 'filename'],
                            how='left')

# 5. Display the head of the initial_rates_df to verify the calculated rates.
print("Head of initial_rates_df with calculated initial rates, intercept, Ignore, and Pyruvate_mM column:")
display(initial_rates_df.head())
print(f"Shape of initial_rates_df: {initial_rates_df.shape}")

Head of initial_rates_df with calculated initial rates, intercept, Ignore, and Pyruvate_mM column:


Unnamed: 0,sample,filename,Adh_ug_ml,initial_rate_uM_per_min,intercept_NADH_mM,Ignore,Pyruvate_mM
0,CELL_1,0108 1600MM PYR -1.KD,19.608,-1.008135,0.402474,,1600.0
1,CELL_1,0113 1600M PYR PDC-1.KD,19.608,-9.162176,0.420833,,1600.0
2,CELL_1,0113 200M PYR PDC-4.KD,19.608,-1.379082,0.365547,,200.0
3,CELL_1,0113 400M PYR PDC-3.KD,19.608,-1.492982,0.380073,,400.0
4,CELL_1,0113 800M PYR PDC-2.KD,19.608,-0.112952,0.393196,,800.0


Shape of initial_rates_df: (115, 7)


In [None]:
import plotly.express as px

# 1. Filter initial_rates_df for Pyruvate_mM == 20
filtered_for_plot_df = initial_rates_df[initial_rates_df['Pyruvate_mM'] == 20].copy()

# 2. Create a column to indicate if the 'Ignore' column is null (i.e., not ignored)
filtered_for_plot_df['Is_Ignored'] = filtered_for_plot_df['Ignore'].notna()

# 3. Generate a scatter plot using Plotly Express
fig = px.scatter(filtered_for_plot_df,
                 x='Adh_ug_ml',
                 y='initial_rate_uM_per_min',
                 color='Is_Ignored', # Color based on whether the 'Ignore' column is null
                 title='Initial Rates vs. ADH Enzyme Concentration (Pyruvate = 20 mM)',
                 labels={'Adh_ug_ml': 'ADH (ug/mL)',
                         'initial_rate_uM_per_min': 'Initial Rate (µM/min)',
                         'Is_Ignored': 'Is Ignored'})

# 4. Display the plot
fig.show()
print("Scatter plot of Initial Rates vs. ADH Enzyme Concentration displayed, filtered for 20 mM Pyruvate and colored by 'Is_Ignored'.")

Scatter plot of Initial Rates vs. ADH Enzyme Concentration displayed, filtered for 20 mM Pyruvate and colored by 'Is_Ignored'.


## Generate Individual Kinetic Plots with Overlaid Regression Lines

### Subtask:
Create individual plots for each experiment showing NADH_mM vs. Adj_time_s, with the calculated linear regression line overlaid within the initial rate window.


**Reasoning**:
The subtask requires generating individual kinetic plots with overlaid regression lines for each experiment. This involves iterating through each unique experiment, extracting kinetic data and regression parameters, calculating the regression line, and then plotting both the raw data and the regression line using Plotly.



In [None]:
import plotly.express as px
import plotly.graph_objects as go
import numpy as np

# Define constants for Beer-Lambert Law (re-defined for clarity in this cell)
MOLAR_EXTINCTION_COEFFICIENT = 6220 # M-1 cm-1
PATH_LENGTH = 1 # cm

# Iterate through each unique combination of 'sample' and 'filename' in processed_df
# This loop will generate one plot per unique experiment
for (sample, filename), group_data in processed_df.groupby(['sample', 'filename']):
    # Retrieve kinetic data for the current experiment
    kinetic_data = group_data.copy()

    # Retrieve regression parameters for the current experiment from initial_rates_df
    regression_params = initial_rates_df[
        (initial_rates_df['sample'] == sample) &
        (initial_rates_df['filename'] == filename)
    ]

    if not regression_params.empty:
        initial_rate_uM_per_min = regression_params['initial_rate_uM_per_min'].iloc[0]
        intercept_NADH_mM = regression_params['intercept_NADH_mM'].iloc[0]

        # Calculate initial rate in milliAbs/min
        # From NADH_mM = (Adj_Abs_340 / (epsilon * l)) * 1000
        # Adj_Abs_340 = (NADH_mM / 1000) * epsilon * l
        # d(milliAbs)/dt = 1000 * d(Adj_Abs_340)/dt
        # d(milliAbs)/dt = epsilon * l * d(NADH_mM)/dt (in mM/s)
        # d(milliAbs)/dt (in milliAbs/min) = epsilon * l * d(NADH_mM)/dt (in mM/s) * 60 (s/min)
        # d(NADH_mM)/dt (in mM/s) = initial_rate_uM_per_min / (1000 * 60)

        if pd.notna(initial_rate_uM_per_min):
            initial_rate_mAbs_per_min = (MOLAR_EXTINCTION_COEFFICIENT * PATH_LENGTH * (initial_rate_uM_per_min / (1000 * 60))) * 60
        else:
            initial_rate_mAbs_per_min = np.nan

        # Retrieve Mask_until_s and Start_time_s from the kinetic data (they are constant for the group)
        mask_until_s = kinetic_data['Mask_until_s'].iloc[0]
        start_time_s = kinetic_data['Start_time_s'].iloc[0]

        # Calculate adjusted_mask_until_s
        adjusted_mask_until_s = mask_until_s - start_time_s

        # Define the start and end points for the regression line's x-axis
        x_regression_start = adjusted_mask_until_s
        x_regression_end = adjusted_mask_until_s + INITIAL_RATE_WINDOW

        # Generate x values for the regression line
        # Ensure we have at least two points to draw a line
        if x_regression_end > x_regression_start:
            x_regression = np.linspace(x_regression_start, x_regression_end, 100) # 100 points for a smooth line
        else:
            x_regression = np.array([x_regression_start, x_regression_end])

        # Calculate y values for the regression line using the formula:
        # y = (initial_rate_uM_per_min / CONVERSION_FACTOR_MM_S_TO_UM_MIN) * x + intercept_NADH_mM
        # The slope from initial_rate_uM_per_min needs to be converted back to mM/s
        slope_mM_per_s = initial_rate_uM_per_min / CONVERSION_FACTOR_MM_S_TO_UM_MIN
        y_regression = slope_mM_per_s * x_regression + intercept_NADH_mM

        # Format rates for the title
        rate_uM_str = f"{initial_rate_uM_per_min:.2f}" if pd.notna(initial_rate_uM_per_min) else "N/A"
        rate_mAbs_str = f"{initial_rate_mAbs_per_min:.2f}" if pd.notna(initial_rate_mAbs_per_min) else "N/A"

        # Create Plotly Express line plot for the kinetic data
        fig = px.line(
            kinetic_data,
            x='Adj_time_s',
            y='NADH_mM',
            title=f'Kinetics for {sample} - {filename}<br>Rate: {rate_uM_str} \u00B5M/min ({rate_mAbs_str} milliAbs/min)',
            labels={'Adj_time_s': 'Adjusted Time (s)', 'NADH_mM': 'NADH (mM)'}
        )

        # Add a scatter trace for the regression line
        fig.add_trace(
            go.Scatter(
                x=x_regression,
                y=y_regression,
                mode='lines',
                name='Regression Line',
                line=dict(color='red', width=3)
            )
        )

        # Display the plot
        fig.show()
    else:
        print(f"No regression parameters found for {sample} - {filename}. Skipping plot.")

## Results
* Fixing the error in the slope calculation made the data look a lot better. The slope values are grouped more tightly within replicates, and the overall data fits the expected pattern better.
* The PDC FORWARD DOUBLE data still looks like an outlier and should be repeated.


# Task
Load the NADH and Pyruvate standards from "/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/Spectrum files from Agilent spec/NADH_Pyruvate_Standards.csv".

Focusing on the assay data corresponding to the filename "1229 PDC PYRUVATE 100MM-8.KD", calculate the NADH concentration for each time point using two methods:
1.  **Full Spectrum Regression**: Use the `calculate_concentrations` function with the loaded standards and the full spectral data.
2.  **Absorbance at 340 nm**: Use the standard method based on absorbance at 340 nm (as previously calculated or recalculated).

Generate a plot overlaying the NADH concentration vs. time curves from both methods to visualize the comparison, and provide a brief summary of the results.

## Load Standards and Select Dataset

### Subtask:
Load the calibration standards and filter the assay data for the specific 100mM pyruvate experiment.


**Reasoning**:
Load the standards CSV, normalize column names for compatibility with the calculation function, and filter the main assay dataset for the specific experiment file as requested.



In [None]:
import pandas as pd

# 1. Load the CSV file
standards_path = "/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/Spectrum files from Agilent spec/NADH_Pyruvate_Standards.csv"
standards_df = pd.read_csv(standards_path)

# 2. Inspect and rename columns
print("Original columns in standards_df:", standards_df.columns.tolist())

# Define a renaming map for common variations
rename_map = {
    'NADH': 'NADH_Coeff',
    'Pyruvate': 'PYR_Coeff',
    'Wavelength (nm)': 'Wavelength',
    'nm': 'Wavelength'
}

# Apply renaming
standards_df.rename(columns=rename_map, inplace=True)

# Verify renaming
print("Renamed columns in standards_df:", standards_df.columns.tolist())

# 3. Filter assay_data_df for the specific filename
target_filename = "1229 PDC PYRUVATE 100MM-8.KD"
target_assay_df = assay_data_df[assay_data_df['filename'] == target_filename].copy()

# 4. Display results
print("\nStandards DataFrame Head:")
display(standards_df.head())

print(f"\nShape of target_assay_df: {target_assay_df.shape}")
print(f"Unique samples in target_assay_df: {target_assay_df['sample'].unique()}")

Original columns in standards_df: ['Wavelength', 'NADH_Coeff', 'PYR_Coeff']
Renamed columns in standards_df: ['Wavelength', 'NADH_Coeff', 'PYR_Coeff']

Standards DataFrame Head:


Unnamed: 0,Wavelength,NADH_Coeff,PYR_Coeff
0,240.0,7.497727,0.23735
1,241.0,7.740465,0.221305
2,242.0,8.04099,0.204781
3,243.0,8.37143,0.188171
4,244.0,8.73697,0.17299



Shape of target_assay_df: (333, 914)
Unique samples in target_assay_df: ['CELL_1' 'CELL_2' 'CELL_3']


**Reasoning**:
I will define the `calculate_concentrations` function and then iterate through the `target_assay_df` to calculate NADH concentrations using the full spectrum regression method. I will also retrieve the previously calculated NADH concentrations (based on 340 nm absorbance) from `processed_df`. Finally, I will merge these datasets and generate a plot to compare the two methods.



In [None]:
import statsmodels.api as sm
import pandas as pd
import plotly.express as px
import numpy as np

# Define the calculation function
def calculate_concentrations(spectrum_df, standards_df, wavelength_range=None, fit_intercept=True):
    # Merge spectrum with standards on Wavelength
    merged_data = pd.merge(spectrum_df, standards_df, on='Wavelength', how='inner')

    # Filter by Wavelength Range if provided
    if wavelength_range:
        min_wav, max_wav = wavelength_range
        merged_data = merged_data[
            (merged_data['Wavelength'] >= min_wav) &
            (merged_data['Wavelength'] <= max_wav)
        ]

    if merged_data.empty or len(merged_data) < 2:
        return {'NADH_Conc': None, 'PYR_Conc': None, 'Intercept': 0, 'R_squared': None}

    # Prepare X (Extinction Coefficients) and y (Absorbance)
    X = merged_data[['NADH_Coeff', 'PYR_Coeff']]
    y = merged_data['Absorbance']

    # Add constant if intercept is requested
    if fit_intercept:
        X = sm.add_constant(X)

    # Perform linear regression
    try:
        model = sm.OLS(y, X).fit()
        return {
            'NADH_Conc': model.params.get('NADH_Coeff', 0),
            'PYR_Conc': model.params.get('PYR_Coeff', 0),
            'Intercept': model.params.get('const', 0),
            'R_squared': model.rsquared
        }
    except Exception:
         return {'NADH_Conc': None, 'PYR_Conc': None, 'Intercept': 0, 'R_squared': None}

# Identify wavelength columns (exclude non-spectral columns)
non_spectral_cols = ['sample', 'Time_s', 'filename']
wavelength_cols = [c for c in target_assay_df.columns if c not in non_spectral_cols]

# Prepare list for results
results = []

# Iterate through each row of the target assay dataframe
for index, row in target_assay_df.iterrows():
    # Extract spectrum
    spectrum_data = row[wavelength_cols].values
    wavelengths = pd.to_numeric(wavelength_cols, errors='coerce')

    # Create temporary dataframe for the spectrum
    spectrum_df = pd.DataFrame({'Wavelength': wavelengths, 'Absorbance': spectrum_data})
    spectrum_df = spectrum_df.dropna() # Remove any NaN wavelengths if conversion failed

    # Calculate concentrations
    conc_results = calculate_concentrations(
        spectrum_df,
        standards_df,
        wavelength_range=(240, 400), # Limit range to relevant UV area if desired, or use full. Using 240-400 as typical UV range for NADH/Protein/Pyr
        fit_intercept=True
    )

    results.append({
        'sample': row['sample'],
        'Time_s': row['Time_s'],
        'filename': row['filename'],
        'NADH_mM_Regression': conc_results['NADH_Conc'], # Assuming coeffs are in mM-1 cm-1
        'R_squared': conc_results['R_squared']
    })

# Convert results to DataFrame
regression_results_df = pd.DataFrame(results)

# Get the Method 2 data (340 nm) from processed_df
method2_df = processed_df[processed_df['filename'] == target_filename][['sample', 'Time_s', 'NADH_mM', 'Adj_time_s']].copy()
method2_df.rename(columns={'NADH_mM': 'NADH_mM_340nm'}, inplace=True)

# Merge the two datasets
comparison_df = pd.merge(regression_results_df, method2_df, on=['sample', 'Time_s'], how='inner')

# Reshape for plotting (Long format)
comparison_long_df = comparison_df.melt(
    id_vars=['sample', 'Adj_time_s', 'Time_s'],
    value_vars=['NADH_mM_Regression', 'NADH_mM_340nm'],
    var_name='Method',
    value_name='NADH_Concentration_mM'
)

# Plot
fig = px.line(
    comparison_long_df,
    x='Adj_time_s',
    y='NADH_Concentration_mM',
    color='Method',
    facet_col='sample',
    title=f'Comparison of NADH Calculation Methods: {target_filename}',
    labels={'Adj_time_s': 'Adjusted Time (s)', 'NADH_Concentration_mM': 'NADH (mM)'}
)
fig.update_yaxes(matches=None) # Allow y-axes to be independent if needed, or matches='y' for same scale
fig.show()

print("Head of comparison data:")
display(comparison_df.head())

Head of comparison data:


Unnamed: 0,sample,Time_s,filename,NADH_mM_Regression,R_squared,NADH_mM_340nm,Adj_time_s
0,CELL_1,1.4,1229 PDC PYRUVATE 100MM-8.KD,,,-0.270589,-389.9
1,CELL_1,7.0,1229 PDC PYRUVATE 100MM-8.KD,,,-0.270551,-384.3
2,CELL_1,13.5,1229 PDC PYRUVATE 100MM-8.KD,,,-0.270566,-377.8
3,CELL_1,19.9,1229 PDC PYRUVATE 100MM-8.KD,,,-0.270513,-371.4
4,CELL_1,26.2,1229 PDC PYRUVATE 100MM-8.KD,,,-0.270567,-365.1


**Reasoning**:
I need to re-run the concentration calculation with robust wavelength matching (converting to integers) because the previous attempt resulted in `None` values, likely due to floating-point mismatches during merging. Then I will generate the comparison plot and print summary statistics.



In [None]:
import statsmodels.api as sm
import pandas as pd
import plotly.express as px
import numpy as np

# Ensure standards_df is ready and Wavelength is integer for merging
standards_df['Wavelength'] = standards_df['Wavelength'].round(0).astype(int)

# Define the calculation function
def calculate_concentrations(spectrum_df, standards_df, wavelength_range=None, fit_intercept=True):
    # Ensure spectrum_df Wavelength is integer
    spectrum_df['Wavelength'] = spectrum_df['Wavelength'].round(0).astype(int)

    # Merge spectrum with standards on Wavelength
    merged_data = pd.merge(spectrum_df, standards_df, on='Wavelength', how='inner')

    # Filter by Wavelength Range if provided
    if wavelength_range:
        min_wav, max_wav = wavelength_range
        merged_data = merged_data[
            (merged_data['Wavelength'] >= min_wav) &
            (merged_data['Wavelength'] <= max_wav)
        ]

    if merged_data.empty or len(merged_data) < 2:
        return {'NADH_Conc': None, 'PYR_Conc': None, 'Intercept': 0, 'R_squared': None}

    # Prepare X (Extinction Coefficients) and y (Absorbance)
    X = merged_data[['NADH_Coeff', 'PYR_Coeff']]
    y = merged_data['Absorbance']

    # Add constant if intercept is requested
    if fit_intercept:
        X = sm.add_constant(X)

    # Perform linear regression
    try:
        model = sm.OLS(y, X).fit()
        return {
            'NADH_Conc': model.params.get('NADH_Coeff', 0),
            'PYR_Conc': model.params.get('PYR_Coeff', 0),
            'Intercept': model.params.get('const', 0),
            'R_squared': model.rsquared
        }
    except Exception:
         return {'NADH_Conc': None, 'PYR_Conc': None, 'Intercept': 0, 'R_squared': None}

# Identify wavelength columns (exclude non-spectral columns)
non_spectral_cols = ['sample', 'Time_s', 'filename', 'Adj_time_s', 'Adj_Abs_340', 'NADH_mM', 'sample_filename', 'Start_time_s', 'Blank_340', 'Mask_until_s', 'Adh_ug_ml']
wavelength_cols = [c for c in target_assay_df.columns if c not in non_spectral_cols]

# Prepare list for results
results = []

# Iterate through each row of the target assay dataframe
for index, row in target_assay_df.iterrows():
    # Extract spectrum
    spectrum_data = row[wavelength_cols].values
    wavelengths = pd.to_numeric(wavelength_cols, errors='coerce')

    # Create temporary dataframe for the spectrum
    spectrum_df = pd.DataFrame({'Wavelength': wavelengths, 'Absorbance': spectrum_data})
    spectrum_df = spectrum_df.dropna()

    # Calculate concentrations
    conc_results = calculate_concentrations(
        spectrum_df,
        standards_df,
        wavelength_range=(240, 400),
        fit_intercept=True
    )

    results.append({
        'sample': row['sample'],
        'Time_s': row['Time_s'],
        'filename': row['filename'],
        'NADH_mM_Regression': conc_results['NADH_Conc'],
        'R_squared': conc_results['R_squared']
    })

# Convert results to DataFrame
regression_results_df = pd.DataFrame(results)

# Get the Method 2 data (340 nm) from processed_df
method2_df = processed_df[processed_df['filename'] == target_filename][['sample', 'Time_s', 'NADH_mM', 'Adj_time_s']].copy()
method2_df.rename(columns={'NADH_mM': 'NADH_mM_340nm'}, inplace=True)

# Merge the two datasets
comparison_df = pd.merge(regression_results_df, method2_df, on=['sample', 'Time_s'], how='inner')

# Reshape for plotting (Long format)
comparison_long_df = comparison_df.melt(
    id_vars=['sample', 'Adj_time_s', 'Time_s'],
    value_vars=['NADH_mM_Regression', 'NADH_mM_340nm'],
    var_name='Method',
    value_name='NADH_Concentration_mM'
)

# Plot
fig = px.line(
    comparison_long_df,
    x='Adj_time_s',
    y='NADH_Concentration_mM',
    color='Method',
    facet_col='sample',
    title=f'Comparison of NADH Calculation Methods: {target_filename}',
    labels={'Adj_time_s': 'Adjusted Time (s)', 'NADH_Concentration_mM': 'NADH (mM)'}
)
fig.update_yaxes(matches=None)
fig.show()

print("Comparison Data Stats:")
print(comparison_df[['NADH_mM_Regression', 'NADH_mM_340nm']].describe())

Comparison Data Stats:
       NADH_mM_340nm
count     333.000000
mean        0.021223
std         0.131715
min        -0.270840
25%        -0.000167
50%         0.030071
75%         0.133541
max         0.153903


**Reasoning**:
The previous attempt to calculate concentrations resulted in null values, likely due to issues with column merging or data type handling. I will reload the standards, explicitly identify wavelength columns using string verification, perform the regression analysis again, and then generate the comparison plot.



In [None]:
import pandas as pd
import statsmodels.api as sm
import plotly.express as px
import numpy as np

# 1. Reload and clean standards
standards_path = "/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/Spectrum files from Agilent spec/NADH_Pyruvate_Standards.csv"
standards_df = pd.read_csv(standards_path)

rename_map = {
    'NADH': 'NADH_Coeff',
    'Pyruvate': 'PYR_Coeff',
    'Wavelength (nm)': 'Wavelength',
    'nm': 'Wavelength'
}
standards_df.rename(columns=rename_map, inplace=True)

# Ensure Wavelength is integer
standards_df['Wavelength'] = pd.to_numeric(standards_df['Wavelength'], errors='coerce').fillna(0).astype(int)

# 2. Prepare Assay Data
target_filename = "1229 PDC PYRUVATE 100MM-8.KD"
target_assay_df = assay_data_df[assay_data_df['filename'] == target_filename].copy()

# Identify wavelength columns: keep only those that are numeric strings
wavelength_cols = [c for c in target_assay_df.columns if str(c).isdigit()]

# 3. Define Calculation Function
def calculate_concentrations(spectrum_df, standards_df, wavelength_range=None, fit_intercept=True):
    # Ensure types match for merge
    spectrum_df['Wavelength'] = spectrum_df['Wavelength'].astype(int)

    # Merge
    merged_data = pd.merge(spectrum_df, standards_df, on='Wavelength', how='inner')

    # Filter Range
    if wavelength_range:
        merged_data = merged_data[
            (merged_data['Wavelength'] >= wavelength_range[0]) &
            (merged_data['Wavelength'] <= wavelength_range[1])
        ]

    if len(merged_data) < 2:
        return {'NADH_Conc': np.nan, 'R_squared': np.nan}

    X = merged_data[['NADH_Coeff', 'PYR_Coeff']]
    y = merged_data['Absorbance']

    if fit_intercept:
        X = sm.add_constant(X)

    try:
        model = sm.OLS(y, X).fit()
        return {
            'NADH_Conc': model.params.get('NADH_Coeff', np.nan),
            'R_squared': model.rsquared
        }
    except:
        return {'NADH_Conc': np.nan, 'R_squared': np.nan}

# 4. Iterate and Calculate
results = []

for index, row in target_assay_df.iterrows():
    # Extract spectrum data for this row
    spectrum_vals = row[wavelength_cols].values.astype(float)
    wavelengths = np.array([int(c) for c in wavelength_cols])

    spectrum_df = pd.DataFrame({'Wavelength': wavelengths, 'Absorbance': spectrum_vals})

    # Calculate
    res = calculate_concentrations(spectrum_df, standards_df, wavelength_range=(240, 400), fit_intercept=True)

    results.append({
        'sample': row['sample'],
        'Time_s': row['Time_s'],
        'filename': row['filename'],
        'NADH_mM_Regression': res['NADH_Conc'],
        'R_squared': res['R_squared']
    })

regression_results_df = pd.DataFrame(results)

# 5. Get Method 2 Data (340nm)
method2_df = processed_df[processed_df['filename'] == target_filename][['sample', 'Time_s', 'NADH_mM', 'Adj_time_s']].copy()
method2_df.rename(columns={'NADH_mM': 'NADH_mM_340nm'}, inplace=True)

# 6. Merge and Plot
comparison_df = pd.merge(regression_results_df, method2_df, on=['sample', 'Time_s'], how='inner')

comparison_long_df = comparison_df.melt(
    id_vars=['sample', 'Adj_time_s', 'Time_s'],
    value_vars=['NADH_mM_Regression', 'NADH_mM_340nm'],
    var_name='Method',
    value_name='NADH_Concentration_mM'
)

fig = px.line(
    comparison_long_df,
    x='Adj_time_s',
    y='NADH_Concentration_mM',
    color='Method',
    facet_col='sample',
    title=f'Comparison of NADH Calculation Methods: {target_filename}',
    labels={'Adj_time_s': 'Adjusted Time (s)', 'NADH_Concentration_mM': 'NADH (mM)'}
)
fig.update_yaxes(matches=None)
fig.show()

print("Comparison Data Stats:")
print(comparison_df[['NADH_mM_Regression', 'NADH_mM_340nm']].describe())

Comparison Data Stats:
       NADH_mM_Regression  NADH_mM_340nm
count          333.000000     333.000000
mean            -0.004497       0.021223
std              0.002839       0.131715
min             -0.009664      -0.270840
25%             -0.006264      -0.000167
50%             -0.005144       0.030071
75%             -0.004234       0.133541
max              0.009502       0.153903
