# Process data from ADH_fwd assays
DO 12-19-2025
Based on the PDC assay data processing notebook

* Find all of the .KD files with ADH_fwd enzyme assay data
* Convert them to Pandas dataframes using the uv_pro library
* Convert from absorbance data to mM NADH
* Offset time so that the assay start time is t=0
* Convert to an EnzymeML file
* Upload EnzymeML file, colab notebook, and raw data to Janis Shin's github folder for subsequent modeling.


## First, install necessary python libraries

In [1]:
import os

# Remove existing uv_pro directory if it exists
if os.path.exists('uv_pro'):
    !rm -rf uv_pro
    print('Removed existing uv_pro directory.')

# Clone the specific branch of the repository
!git clone -b parse-multi-cuvette-data https://github.com/danolson1/uv_pro.git

# Navigate into the cloned directory
%cd uv_pro

# Install the library in editable mode
!pip install -e .

# Go back to the original content directory
%cd ..

print('Library re-installed successfully. You can now import modules from uv_pro.')

Cloning into 'uv_pro'...
remote: Enumerating objects: 1934, done.[K
remote: Counting objects: 100% (548/548), done.[K
remote: Compressing objects: 100% (240/240), done.[K
remote: Total 1934 (delta 439), reused 374 (delta 308), pack-reused 1386 (from 1)[K
Receiving objects: 100% (1934/1934), 6.09 MiB | 10.30 MiB/s, done.
Resolving deltas: 100% (1353/1353), done.
/content/uv_pro
Obtaining file:///content/uv_pro
  Installing build dependencies ... [?25l[?25hdone
  Checking if build backend supports build_editable ... [?25l[?25hdone
  Getting requirements to build editable ... [?25l[?25hdone
  Preparing editable metadata (pyproject.toml) ... [?25l[?25hdone
Collecting pybaselines>=1.0.0 (from uv_pro==0.8.0)
  Downloading pybaselines-1.2.1-py3-none-any.whl.metadata (9.9 kB)
Collecting questionary>=2.0.1 (from uv_pro==0.8.0)
  Downloading questionary-2.1.1-py3-none-any.whl.metadata (5.4 kB)
Collecting lmfit>=1.3.3 (from uv_pro==0.8.0)
  Downloading lmfit-1.3.4-py3-none-any.whl.met

After installing the uv_pro library, the runtime needs to be restarted (Runtime --> Restart session (Ctrl + M + .))

In [1]:
## Start by importing python libraries for data import and analysis
import plotly.express as px # for plotting the output
from uv_pro.io import import_kd
from uv_pro.io.import_kd import KDFile # Import the KDFile class
import pandas as pd
import numpy as np

# See what's available in import_kd
print("Available functions/classes in import_kd:")
print([item for item in dir(import_kd) if not item.startswith('_')])

Available functions/classes in import_kd:
['KDFile', 'Path', 'pd', 'struct']


We define a function to read KD files, and export the result as a pandas dataframe for subsequent processing

In [2]:
import os

# Define KD File Reading Function
def read_kd_to_dataframe(file_path):
    """
    Reads a .KD file, converts its spectra data to a pandas DataFrame,
    adds a 'filename' column, and returns the DataFrame.

    Args:
        file_path (str): The full path to the .KD file.

    Returns:
        pd.DataFrame: A DataFrame containing the spectra data, with 'sample',
                      'Time_s', and 'filename' columns.
    """
    kd_file = KDFile(file_path)
    spectra_df = kd_file.spectra.T.reset_index()
    spectra_df.rename(columns={'Time (s)': 'Time_s'}, inplace=True)
    spectra_df.insert(0, 'sample', kd_file.samples_cell)

    # Remove 'SAMPLES_' prefix from the 'sample' column to better match what is written
    # in the Enzyme_assay_metadata spreadsheet
    spectra_df['sample'] = spectra_df['sample'].str.replace('SAMPLES_', '', regex=False)

    # Add the base filename as a new column
    base_filename = os.path.basename(file_path)
    spectra_df['filename'] = base_filename
    return spectra_df


# Test the modified function
# test_file_path = '/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025/KD files from Agilent spec/251211 SERIES PDC FORWARD-1.KD'
# print(f"Testing read_kd_to_dataframe with: {test_file_path}")
# cleaned_df = read_kd_to_dataframe(test_file_path)
# print("Head of the DataFrame after cleaning 'sample' column:")
# display(cleaned_df.head())

## Find all of the .KD files with the assay data we want
To read files shared on your google drive, you need to mount them first. Do that with the following code. The PROJECT_ROOT variable will need to be changed depending on the user's google drive structure. Uncomment the one that is relevant for your use.



In [3]:
import os
from google.colab import drive
drive.mount('/content/drive')

PROJECT_ROOT = "/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025" # Project root for running on Dan's computer
#PROJECT_ROOT = "/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025" # Project root ror running on Evelyn's computer

%cd "$PROJECT_ROOT"

os.getcwd() # Confirm that we have changed to the correct directory

Mounted at /content/drive
/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025


'/content/drive/MyDrive/Research/PDC+ADH+FDH assay data Evelyn 2025'

We will read the Enzyme_assay_metadata spreadsheet to know which assays data to read, and the conditions for each assay. This google sheet document has been set up to have its data published in comma-separated-variable (CSV) format at a publicly-available website. It is possible that the CSV data may take a few minutes to update, however, after the google doc has been edited.

In [8]:
# Load data from the Enzyme_assay_metadata google doc
public_csv_url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRVpwYqImFkaUigsWgrO9MRtWjYWwps82EExnomLqNr_hOUNViKF_fFyAhJfIqe3hDq0IEG76W4v_fO/pub?output=csv"
meta_df = pd.read_csv(public_csv_url)
#display(meta_df.head())

# Filter to just rows with the PDC_fwd assay
filtered_meta_df = meta_df[meta_df['Assay'] == 'ADH_fwd']
filtered_meta_df[:5]


Unnamed: 0,Experiment_ID,Filename,Assay,Cuvette,Start_time_s,Mask_until_s,Blank_340,Volume_ul,Temperature_C,pH,Tris-HCl_mM,TPP_mM,MgCl2_mM,Pyruvate_mM,Acetaldehyde_mM,Ethanol_mM,NADH_mM,NAD_mM,Adh_ug_ml,Pdc_ug_ml
19,Assay 7,251211 SERIES ADH FORWARD-3.KD,ADH_fwd,CELL_1,51.9,64.7,0.0,1000.0,37.0,7.0,100.0,0.4,5.0,,10.0,,0.3,,0.05477,
20,Assay 7,251211 SERIES ADH FORWARD-3.KD,ADH_fwd,CELL_2,51.9,64.7,0.0,1000.0,37.0,7.0,100.0,0.4,5.0,,10.0,,0.3,,0.05477,
21,Assay 7,251211 SERIES ADH FORWARD-3.KD,ADH_fwd,CELL_3,51.9,64.7,0.0,1000.0,37.0,7.0,100.0,0.4,5.0,,10.0,,0.3,,0.05477,
22,Assay 7,251211 3-SINGLE ADH FORWARD.KD,ADH_fwd,CELL_1,80.0,95.9,0.0,1000.0,37.0,7.0,100.0,0.4,5.0,,10.0,,0.3,,0.05477,


In [None]:
# Define the subfolder name for KD files. This assumes we've already moved to the PDC+ADH+FDH assay data Evelyn 2025 folder
base_path = os.path.join(os.getcwd(), "KD files from Agilent spec")

df_list = []

# Loop through filenames and check if the file path is valid
unique_filenames = filtered_meta_df['Filename'].unique()
print("#### Processing KD files: ####")
for filename in unique_filenames:
    file_path = os.path.join(base_path, filename)
    if os.path.exists(file_path):
        print(f"- {filename}: EXISTS ({file_path})")

        # Read the .KD file, and add the result to df_list
        df_list.append(read_kd_to_dataframe(file_path))
    else:
        print(f"- {filename}: DOES NOT EXIST ({file_path})")

# Concatenate all dataframes in df_list into a single dataframe
assay_data_df = pd.concat(df_list, ignore_index=True)
print("\nCombined DataFrame created successfully.")
print("Head of the combined DataFrame:")
display(assay_data_df.head())
print(f"Shape of the combined DataFrame: {assay_data_df.shape}")


In [5]:
# create a unique identifier for each run
assay_data_df['sample_filename'] = assay_data_df['sample'].astype(str) + '_' + assay_data_df['filename']

# plot the data
fig = px.line(assay_data_df,
              x='Time_s',
              y=340,
              color='sample_filename',
              title='Absorbance at 340 nm vs. Time by Sample and Filename',
              markers=True,
              labels={'Time_s': 'Time (seconds)', 340: 'Absorbance at 340 nm'})

fig.show()

# Task
Extract, merge, and process assay data to calculate adjusted time and NADH concentration (mM), then visualize `NADH_mM` vs. `Adj_time_s` for each sample and filename, and prepare the data for kinetic analysis.

## Extract and Merge Metadata for Calculations

### Subtask:
Extract relevant columns from `assay_data_df`: 'sample', 'Time_s', '340', and 'filename'. Then, merge this data with `meta_df` to bring in 'Start_time_s' and 'Blank_340' values, which are essential for per-group calculations. Ensure the merge is performed on appropriate columns like 'sample' and 'filename'.


**Reasoning**:
The subtask requires extracting specific columns from `assay_data_df`, selecting relevant columns from `meta_df`, renaming columns in the metadata for alignment, and then merging these two dataframes. This code block will perform all these data manipulation steps and display the head of the resulting merged DataFrame.



In [12]:
import pandas as pd

# 1. Create a new DataFrame, say `processed_df`, by selecting the 'sample', 'Time_s', '340', and 'filename' columns from the `assay_data_df` DataFrame.
processed_df = assay_data_df[['sample', 'Time_s', 340, 'filename']].copy()

# 2. Create another DataFrame by selecting 'Filename', 'Cuvette', 'Start_time_s', and 'Blank_340' columns from the `meta_df` DataFrame.
meta_subset_df = meta_df[['Filename', 'Cuvette', 'Start_time_s', 'Blank_340', 'Mask_until_s']].copy()

# 3. Rename the 'Cuvette' column in this new DataFrame to 'sample' to align with the column name in `processed_df`.
# 4. Rename the 'Filename' column in this new DataFrame to 'filename' to align with the column name in `processed_df`.
meta_subset_df.rename(columns={'Cuvette': 'sample', 'Filename': 'filename'}, inplace=True)

# 5. Merge `processed_df` with the renamed metadata DataFrame using an inner merge on the 'sample' and 'filename' columns.
# Store the result back into `processed_df`.
processed_df = pd.merge(processed_df,
                        meta_subset_df,
                        on=['sample', 'filename'],
                        how='inner')

# 6. Display the head of the `processed_df` to verify the merge and the presence of the new columns.
print("Head of the merged processed_df:")
display(processed_df.head())
print(f"Shape of processed_df: {processed_df.shape}")

Head of the merged processed_df:


Unnamed: 0,sample,Time_s,340,filename,Start_time_s,Blank_340,Mask_until_s
0,CELL_1,1.3,1.510813,251211 SERIES ADH FORWARD-3.KD,51.9,0.0,64.7
1,CELL_1,7.1,1.509446,251211 SERIES ADH FORWARD-3.KD,51.9,0.0,64.7
2,CELL_1,13.6,1.508778,251211 SERIES ADH FORWARD-3.KD,51.9,0.0,64.7
3,CELL_1,19.9,1.509935,251211 SERIES ADH FORWARD-3.KD,51.9,0.0,64.7
4,CELL_1,26.4,1.508879,251211 SERIES ADH FORWARD-3.KD,51.9,0.0,64.7


Shape of processed_df: (1418, 7)


## Calculate Adjusted Time and NADH Concentration

### Subtask:
Calculate `Adj_time_s` by subtracting `Start_time_s` from `Time_s`. Calculate `Adj_Abs_340` by subtracting `Blank_340` from the `340` column. Finally, convert `Adj_Abs_340` to `NADH_mM` using the Beer-Lambert Law (A = εlc), where:
- A is the absorbance (our `Adj_Abs_340`)
- ε is the molar extinction coefficient (for NADH at 340 nm, ε = 6220 M⁻¹cm⁻¹)
- l is the path length (assumed to be 1 cm for standard cuvettes)
- c is the concentration in Molar, which we will convert to mM.

### Reasoning:
These calculations are crucial for standardizing the time measurements and converting raw absorbance data into a biologically meaningful concentration of NADH. `Adj_time_s` ensures that each assay starts at t=0, while `NADH_mM` provides the actual concentration of the product formed, accounting for background absorbance.

**Reasoning**:
This code block performs the calculations outlined in the previous markdown step. It calculates the adjusted time (`Adj_time_s`) and adjusted absorbance (`Adj_Abs_340`), and then converts the adjusted absorbance to NADH concentration in millimolar (`NADH_mM`) using the Beer-Lambert law with a molar extinction coefficient of 6220 M⁻¹cm⁻¹ and a path length of 1 cm. Finally, it displays the head of the updated DataFrame.



In [13]:
# Define constants for Beer-Lambert Law
MOLAR_EXTINCTION_COEFFICIENT = 6220 # M-1 cm-1
PATH_LENGTH = 1 # cm

# 1. Calculate Adj_time_s: Subtract 'Start_time_s' from 'Time_s'
processed_df['Adj_time_s'] = processed_df['Time_s'] - processed_df['Start_time_s']

# 2. Calculate Adj_Abs_340: Subtract 'Blank_340' from the '340' column
processed_df['Adj_Abs_340'] = processed_df[340] - processed_df['Blank_340']

# 3. Convert Adj_Abs_340 to NADH_mM using Beer-Lambert Law (A = εlc)
# c (M) = A / (ε * l)
# NADH_mM = c (M) * 1000
processed_df['NADH_mM'] = (processed_df['Adj_Abs_340'] / (MOLAR_EXTINCTION_COEFFICIENT * PATH_LENGTH)) * 1000

# Display the head of the `processed_df` to verify the new columns.
print("Head of processed_df after calculations:")
display(processed_df.head())
print(f"Shape of processed_df: {processed_df.shape}")

Head of processed_df after calculations:


Unnamed: 0,sample,Time_s,340,filename,Start_time_s,Blank_340,Mask_until_s,Adj_time_s,Adj_Abs_340,NADH_mM
0,CELL_1,1.3,1.510813,251211 SERIES ADH FORWARD-3.KD,51.9,0.0,64.7,-50.6,1.510813,0.242896
1,CELL_1,7.1,1.509446,251211 SERIES ADH FORWARD-3.KD,51.9,0.0,64.7,-44.8,1.509446,0.242676
2,CELL_1,13.6,1.508778,251211 SERIES ADH FORWARD-3.KD,51.9,0.0,64.7,-38.3,1.508778,0.242569
3,CELL_1,19.9,1.509935,251211 SERIES ADH FORWARD-3.KD,51.9,0.0,64.7,-32.0,1.509935,0.242755
4,CELL_1,26.4,1.508879,251211 SERIES ADH FORWARD-3.KD,51.9,0.0,64.7,-25.5,1.508879,0.242585


Shape of processed_df: (1418, 10)


**Reasoning**:
The next step in the task is to visualize the calculated `NADH_mM` versus `Adj_time_s` for each unique combination of `sample` and `filename`. This will provide a visual representation of the assay kinetics.



In [14]:
import plotly.express as px

# Create a unique identifier for plotting purposes
processed_df['sample_filename'] = processed_df['sample'].astype(str) + '_' + processed_df['filename']

# Plot NADH_mM vs. Adj_time_s
fig = px.line(processed_df,
              x='Adj_time_s',
              y='NADH_mM',
              color='sample_filename',
              title='NADH Concentration (mM) vs. Adjusted Time (s)',
              markers=True,
              labels={'Adj_time_s': 'Adjusted Time (seconds)', 'NADH_mM': 'NADH (mM)'})

fig.show()


## Calculate Initial Rates

### Subtask:
For each unique combination of 'sample' and 'filename', filter the data to include `Adj_time_s` values from 'Mask_until_s' up to 'Mask_until_s + 50' seconds. Within this filtered time window, perform a linear regression of `NADH_mM` against `Adj_time_s` to determine the slope (initial rate). Store these rates in a new DataFrame along with 'sample', 'filename'.


**Reasoning**:
The subtask requires calculating initial rates using linear regression on a filtered time window for each unique sample and filename. This step will import the necessary function, iterate through grouped data, apply the filtering and regression, and store the results in a new DataFrame.



In [19]:
from scipy.stats import linregress
import pandas as pd

# 1. Create an empty list to store the results of the rate calculations.
initial_rates_results = []
INITIAL_RATE_WINDOW = 50

# Conversion factor from mM/s to uM/min
# 1 mM = 1000 uM
# 1 s = 1/60 min
# (mM/s) * (1000 uM/mM) * (60 s/min) = uM/min
CONVERSION_FACTOR_MM_S_TO_UM_MIN = 1000 * 60

# 2. Group the processed_df DataFrame by 'sample' and 'filename'
# to iterate through each unique experimental run.
for (sample, filename), group in processed_df.groupby(['sample', 'filename']):
    # 3a. Extract the Mask_until_s and Start_time_s (they should be constant within each group).
    # Calculate the adjusted mask until time
    adjusted_mask_until_s = group['Mask_until_s'].iloc[0] - group['Start_time_s'].iloc[0]

    # 3b. Filter the group's data to include only rows where Adj_time_s is within the specified window.
    filtered_group = group[
        (group['Adj_time_s'] >= adjusted_mask_until_s) &
        (group['Adj_time_s'] <= adjusted_mask_until_s + INITIAL_RATE_WINDOW)
    ]

    initial_rate = None
    # 3c. If there is sufficient data (e.g., more than one data point) in the filtered subset,
    # perform a linear regression.
    if len(filtered_group) > 1:
        slope, intercept, r_value, p_value, std_err = linregress(
            filtered_group['Adj_time_s'],
            filtered_group['NADH_mM']
        )
        # Convert slope from mM/s to uM/min
        initial_rate = slope * CONVERSION_FACTOR_MM_S_TO_UM_MIN

    # 3d. Append a dictionary containing the results to the list.
    initial_rates_results.append({
        'sample': sample,
        'filename': filename,
        'initial_rate_uM_per_min': initial_rate # Updated column name
    })

# 4. Convert the list of results into a new Pandas DataFrame, named initial_rates_df.
initial_rates_df = pd.DataFrame(initial_rates_results)

# 5. Display the initial rates.
initial_rates_df

Unnamed: 0,sample,filename,initial_rate_uM_per_min
0,CELL_1,251211 3-SINGLE ADH FORWARD.KD,-39.919944
1,CELL_1,251211 SERIES ADH FORWARD-3.KD,-51.621431
2,CELL_2,251211 SERIES ADH FORWARD-3.KD,-52.261979
3,CELL_3,251211 SERIES ADH FORWARD-3.KD,-51.469356


In [22]:
import plotly.graph_objects as go
from scipy.stats import linregress

# Group by the unique identifier for each trace
for (sample, filename), group_df in processed_df.groupby(['sample', 'filename']):
    # Get the unique identifier for the title
    sample_filename = group_df['sample_filename'].iloc[0]

    # Get mask_until_s and Start_time_s for the current group
    # Calculate the adjusted mask until time
    adjusted_mask_until_s = group_df['Mask_until_s'].iloc[0] - group_df['Start_time_s'].iloc[0]

    # Define the initial rate window using the adjusted mask until time
    x_start_regression = adjusted_mask_until_s
    x_end_regression = adjusted_mask_until_s + INITIAL_RATE_WINDOW

    # Filter data for regression calculation within the specified window
    regression_data_df = group_df[
        (group_df['Adj_time_s'] >= x_start_regression) &
        (group_df['Adj_time_s'] <= x_end_regression)
    ]

    # Perform linear regression if sufficient data points exist
    if len(regression_data_df) > 1:
        slope_mM_per_s, intercept_mM, _, _, _ = linregress(
            regression_data_df['Adj_time_s'],
            regression_data_df['NADH_mM']
        )

        # Calculate y-values for the regression line
        y_start_regression = slope_mM_per_s * x_start_regression + intercept_mM
        y_end_regression = slope_mM_per_s * x_end_regression + intercept_mM

        # Create the plot
        fig = go.Figure()

        # Add the original NADH vs. Adj_time_s trace
        fig.add_trace(go.Scatter(
            x=group_df['Adj_time_s'],
            y=group_df['NADH_mM'],
            mode='lines+markers',
            name='NADH (mM) Data',
            marker=dict(size=4)
        ))

        # Add the initial rate regression line with increased thickness and transparency
        fig.add_trace(go.Scatter(
            x=[x_start_regression, x_end_regression],
            y=[y_start_regression, y_end_regression],
            mode='lines',
            name=f'Initial Rate (Slope={slope_mM_per_s*CONVERSION_FACTOR_MM_S_TO_UM_MIN:.2f} uM/min)',
            line=dict(color='red', width=5, dash='dash'),
            opacity=0.6 # Opacity should be set directly on the Scatter trace
        ))

        # Add a marker for the start of the masked region
        # The `adjusted_mask_until_s` should be used here as well.
        start_mask_NADH = group_df[group_df['Adj_time_s'] >= adjusted_mask_until_s]['NADH_mM'].iloc[0] if not group_df[group_df['Adj_time_s'] >= adjusted_mask_until_s].empty else None
        if start_mask_NADH is not None:
            fig.add_trace(go.Scatter(
                x=[adjusted_mask_until_s],
                y=[start_mask_NADH],
                mode='markers',
                name='Mask Start',
                marker=dict(color='orange', size=8, symbol='star')
            ))

        fig.update_layout(
            title=f'NADH Concentration (mM) vs. Adjusted Time (s) for {sample_filename}',
            xaxis_title='Adjusted Time (seconds)',
            yaxis_title='NADH (mM)',
            hovermode='x unified'
        )

        fig.show()
    else:
        print(f"Skipping plot for {sample_filename}: Not enough data points for regression in the initial rate window.")

The slope data for the 3 cuvettes that were run at the same time look good. However, the slope for the 1-cuvette assay and 3-cuvette assay differ by about 20%. This is higher than I would expect for assays with the same setup. We should run some more assays to figure out what's going on.

My guesses:
* Assay temperatures were slightly different
* Pipetting error (particularly in Adh enzyme) between assays