# Overview of the CrystFEL-based Processing Workflow

This notebook implements a complete workflow for processing crystallography data using CrystFEL tools alongside custom scripts. It is designed to:
  
1. **Run Indexamajig** (`gandalf_iterator`)  
   - Execute peak finding, indexing, and integration for each HDF5 file.
   - Vary beam center coordinates on a radial grid (using defined maximum radius and step size).

2. **Visualize Indexing Results**  
   - Generate 3D histograms and 2D heatmaps to assess indexing performance.

3. **Evaluate Index Metrics** (`automate_evaluation`)  
   - Parse stream files to compute indexing quality metrics (IQMs) for each frame.
   - Analyze metrics such as weighted RMSD, fraction of outliers, length and angle deviations, peak ratio, and percentage indexed.

4. **Interactive IQM Metrics Dashboard**  
   - Use interactive sliders and weight inputs to filter and combine metrics.
   - Create/update a combined metric and display histograms for filtered results.

5. **CSV-to-Stream Conversion**  
   - Convert the filtered metrics CSV into a stream file using custom conversion scripts.

6. **Merge Best Results** (`merge`)  
   - Merge the best result stream file to refine cell parameters and symmetry.

7. **Convert to Shelx-Compatible .hkl**  
   - Transform the merged output to a Shelx-compatible format for further crystallographic analysis.

8. **Convert to .mtz Format**  
   - Prepare data for downstream crystallographic software by converting .hkl to .mtz.

> **Pre-requisites:**  
> Ensure that all preprocessing steps (peak finding, center refinement, etc.) have been completed and that the required tools and Python packages (CrystFEL, ipywidgets, matplotlib, etc.) are installed and properly configured before running the notebook.




# ==============================================
# Run indexamajig Iterations on a Circular Grid
### Options for Peakfinding, Indexing and Integration
# ==============================================


In [None]:
from gandalf_radial_iterator import gandalf_iterator

geomfile_path = "/path/to/GEOM.geom"       # .geom file
cellfile_path = "/path/to/CELL.cell"          # .cell file

input_path =   "/path/to/h5/folder"      # .h5 folder will also be output folder

output_file_base = "basename"    # output files will be named output_file_base_xcoord_ycoord.h5

num_threads = 24             # number of CPU threads to use

"""Define the grid and maximum radius in pixels for iterations.
As example max_radius = 1, step = 0.2 will give 81 iterations.
Iterations will start at the center and move radially outwards.
"""
max_radius = 2             # maximum radius in pixels
step = 0.5                 # grid granularity in pixels

extra_flags=[
# PEAKFINDING
"--no-revalidate",
"--no-half-pixel-shift",
"--peaks=cxi", 
"--min-peaks=15",
# INDEXING
"--indexing=xgandalf",
"--tolerance=10,10,10,5",
"--no-refine",
"--xgandalf-sampling-pitch=5",
"--xgandalf-grad-desc-iterations=1",
"--xgandalf-tolerance=0.02",
# INTEGRATION
"--integration=rings",
"--int-radius=4,5,9",
"--fix-profile-radius=70000000",
# OUTPUT
"--no-non-hits-in-stream",
]

"""Examples of extra flags(see crystfel documentation https://www.desy.de/~twhite/crystfel/manual-indexamajig.html):
    
    Peakfinding
    "--peaks=cxi",
    "--peak-radius=inner,middle,outer",
    "--min-peaks=n",
    "--median-filter=n",
    "--filter-noise",
    "--no-revalidate",
    "--no-half-pixel-shift",

    "--peaks=peakfinder9",
    "--min-snr=1",
    "--min-snr-peak-pix=6",
    "--min-snr-biggest-pix=1",
    "--min-sig=9",
    "--min-peak-over-neighbour=5",
    "--local-bg-radius=5",

    "--peaks=peakfinder8",
    "--threshold=45",
    "--min-snr=3",
    "--min-pix-count=3",
    "--max-pix-count=500",
    "--local-bg-radius=9",
    "--min-res=30",
    "--max-res=500",
    
    Indexing
    "--indexing=xgandalf",

    "--tolerance=tol"
    "--no-check-cell",
    "--no-check-peaks",
    "--multi",
    "--no-retry",
    "--no-refine",

    "--xgandalf-sampling-pitch=n"
    "--xgandalf-grad-desc-iterations=n"
    "--xgandalf-tolerance=n"
    "--xgandalf-no-deviation-from-provided-cell"
    "--xgandalf-max-lattice-vector-length=n"
    "--xgandalf-min-lattice-vector-length=n"
    "--xgandalf-max-peaks=n"

    Integration
    "--fix-profile-radius=n",
    "--integration=rings",
    "--int-radius=4,5,10",
    "--push-res=n",
    "--overpredict",

    Output
    "--no-non-hits-in-stream",
    "--no-peaks-in-stream",
    "--no-refls-in-stream",
"""

gandalf_iterator(geomfile_path, cellfile_path, input_path, output_file_base, num_threads, max_radius=max_radius, step=step, extra_flags=extra_flags)


# Visualize Indexing Results: 3D Histogram & 2D Heatmap

In [None]:
from indexing_3d_histogram import plot3d_indexing_rate
from indexing_center import indexing_heatmap

output_folder = "path/to/output/folder"
plot3d_indexing_rate(output_folder)
indexing_heatmap(output_folder)


# ==============================================
# Process Indexing Metrics Across All Stream Files
# ==============================================

In [None]:
from process_indexing_metrics import process_indexing_metrics

# Enter folder with stream file results from indexamajig. 
# Note that ALL stream files in the folder will be processed.

stream_file_folder = "path/to/stream/folder"
wrmsd_tolerance = 2.0
indexing_tolerance = 4.0

"""
wrmsd_tolerance :
The number of standard deviations away from the mean weighted RMSD for a chunk to be considered an outlier. Default factor is 2.0.

indexing_tolerance :
The maximum deviation in pixels between observed and predicted peak positions for a peak to be considered indexed. Default is 1.0 pixel.

The following metrics will be evaluated for analysis in the next step:

- 'weighted_rmsd'
- 'fraction_outliers'
- 'length_deviation'
- 'angle_deviation'
- 'peak_ratio'
- 'percentage_indexed'

"""

process_indexing_metrics(stream_file_folder, wrmsd_tolerance=wrmsd_tolerance, indexing_tolerance=indexing_tolerance)

# ==============================================
# Interactive Metrics Analysis and CSV-to-Stream Conversion
# ==============================================

In [None]:
# Define the path to the normalized metrics CSV file and run this cell to start the interactive IQM tool

import ipywidgets as widgets
from IPython.display import display
import matplotlib.pyplot as plt
import os

import csv_to_stream  

from interactive_iqm import (
    read_metric_csv,
    select_best_results_by_event,
    get_metric_ranges,
    create_combined_metric,
    filter_rows,
    write_filtered_csv
)

#################################
# 1) PATHS
#################################
CSV_PATH = "path/to/normalized_metrics.csv"
FILTERED_CSV_PATH = os.path.join(os.path.dirname(CSV_PATH), 'filtered_metrics.csv')

#################################
# 2) Read the CSV (group by event)
#################################
grouped_data = read_metric_csv(CSV_PATH, group_by_event=True)

#################################
# 3) If you have multiple rows per event, pick "best" row
#################################
best_rows = select_best_results_by_event(grouped_data, sort_metric='weighted_rmsd')

#################################
# 4) Metrics in your CSV
#################################
metrics_in_order = [
    'weighted_rmsd',
    'fraction_outliers',
    'length_deviation',
    'angle_deviation',
    'peak_ratio',
    'percentage_unindexed'
]

#################################
# 5) Create threshold sliders for each original metric
#################################
ranges_dict = get_metric_ranges(best_rows, metrics=metrics_in_order)
metric_sliders = {}

def create_slider(metric_name, min_val, max_val):
    # We'll do a ≤ filter, so default slider to max => "include all"
    default_val = max_val
    step = (max_val - min_val) / 100.0 if max_val != min_val else 0.01
    slider = widgets.FloatSlider(
        value=default_val,
        min=min_val,
        max=max_val,
        step=step,
        description=f"{metric_name} ≤"
    )
    return slider

for metric in metrics_in_order:
    mn, mx = ranges_dict[metric]
    metric_sliders[metric] = create_slider(metric, mn, mx)

#################################
# 6) Text fields for per-metric weights to create a combined metric
#################################
weight_text_fields = {}
for metric in metrics_in_order:
    weight_text_fields[metric] = widgets.FloatText(
        value=0.0,  # default to 0 => skip that metric in combined sum
        description=f"Weight for {metric}",
        style={"description_width": "initial"} 
    )

#################################
# 7) A slider to threshold the combined metric
#################################
combined_metric_slider = widgets.FloatSlider(
    value=0.0,
    min=0.0,
    max=1.0,
    step=0.01,
    description="combined_metric ≤"
)

#################################
# 8) Button & function to create/update combined metric
#################################
def create_or_update_combined_metric(_):
    """
    Read user-entered weights for each metric, compute 'combined_metric' 
    for best_rows, and update the combined_metric_slider's min/max/value.
    """
    selected_metrics = []
    weights_list = []
    for m in metrics_in_order:
        w = weight_text_fields[m].value
        selected_metrics.append(m)
        weights_list.append(w)

    create_combined_metric(
        rows=best_rows,
        metrics_to_combine=selected_metrics,
        weights=weights_list,
        new_metric_name="combined_metric"
    )

    combined_vals = [r["combined_metric"] for r in best_rows]
    cmin, cmax = min(combined_vals), max(combined_vals)

    with combined_metric_slider.hold_trait_notifications():
        current_val = combined_metric_slider.value
        if current_val < cmin or current_val > cmax:
            current_val = cmax  # or choose cmin if preferred
        combined_metric_slider.min = cmin
        combined_metric_slider.max = cmax
        combined_metric_slider.value = current_val

    print("Created/updated 'combined_metric' using user-entered weights.")

create_combined_button = widgets.Button(description="Create Combined Metric")
create_combined_button.on_click(create_or_update_combined_metric)

#################################
# 9) Create separate output widgets for each row of histograms, CSV path message, and CSV-to-stream conversion
#################################
combined_out = widgets.Output()
csv_path_out = widgets.Output()
wrmsd_frac_out = widgets.Output()
length_angle_out = widgets.Output()
peak_unindexed_out = widgets.Output()
csv_to_stream_out = widgets.Output()  # New output widget for CSV-to-stream conversion messages

#################################
# 10) Button to apply thresholds, update histograms, and write CSV
#################################
filter_button = widgets.Button(description="Apply Thresholds & Show Histograms")

def on_filter_clicked(_):
    # Clear previous outputs in all rows
    combined_out.clear_output()
    csv_path_out.clear_output()
    wrmsd_frac_out.clear_output()
    length_angle_out.clear_output()
    peak_unindexed_out.clear_output()
    csv_to_stream_out.clear_output()
    
    # Build thresholds from slider values on the original metrics
    thresholds = {m: metric_sliders[m].value for m in metrics_in_order}
    if "combined_metric" in best_rows[0]:
        thresholds["combined_metric"] = combined_metric_slider.value

    filtered = filter_rows(best_rows, thresholds)
    
    # Print filtering summary into csv_path_out
    with csv_path_out:
        print(f"Filtering... {len(best_rows)} rows -> {len(filtered)} pass thresholds.\n")
    
    if not filtered:
        with csv_path_out:
            print("No rows passed the thresholds, skipping histograms.")
        return

    # Write CSV of the filtered rows
    write_filtered_csv(filtered, FILTERED_CSV_PATH)
    with csv_path_out:
        print(f"Filtered CSV (including 'combined_metric' if created) written to:\n  {FILTERED_CSV_PATH}\n")
    
    # Row 1: Combined metric histogram
    with combined_out:
        if "combined_metric" in filtered[0]:
            plt.figure()
            values = [r["combined_metric"] for r in filtered]
            plt.hist(values, bins=20)
            plt.title("Histogram of combined_metric")
            plt.xlabel("combined_metric")
            plt.ylabel("Count")
            plt.show()
        else:
            print("No 'combined_metric' available.")
    
    # Row 3: Histograms for weighted_rmsd and fraction_outliers
    with wrmsd_frac_out:
        fig, axes = plt.subplots(1, 2, figsize=(10, 4))
        # weighted_rmsd
        if any("weighted_rmsd" in r for r in filtered):
            values = [r["weighted_rmsd"] for r in filtered if "weighted_rmsd" in r]
            axes[0].hist(values, bins=20)
            axes[0].set_title("Histogram of weighted_rmsd")
            axes[0].set_xlabel("weighted_rmsd")
            axes[0].set_ylabel("Count")
        else:
            axes[0].text(0.5, 0.5, "No data", horizontalalignment='center', verticalalignment='center')
        # fraction_outliers
        if any("fraction_outliers" in r for r in filtered):
            values = [r["fraction_outliers"] for r in filtered if "fraction_outliers" in r]
            axes[1].hist(values, bins=20)
            axes[1].set_title("Histogram of fraction_outliers")
            axes[1].set_xlabel("fraction_outliers")
            axes[1].set_ylabel("Count")
        else:
            axes[1].text(0.5, 0.5, "No data", horizontalalignment='center', verticalalignment='center')
        plt.tight_layout()
        plt.show()
    
    # Row 4: Histograms for length_deviation and angle_deviation
    with length_angle_out:
        fig, axes = plt.subplots(1, 2, figsize=(10, 4))
        # length_deviation
        if any("length_deviation" in r for r in filtered):
            values = [r["length_deviation"] for r in filtered if "length_deviation" in r]
            axes[0].hist(values, bins=20)
            axes[0].set_title("Histogram of length_deviation")
            axes[0].set_xlabel("length_deviation")
            axes[0].set_ylabel("Count")
        else:
            axes[0].text(0.5, 0.5, "No data", horizontalalignment='center', verticalalignment='center')
        # angle_deviation
        if any("angle_deviation" in r for r in filtered):
            values = [r["angle_deviation"] for r in filtered if "angle_deviation" in r]
            axes[1].hist(values, bins=20)
            axes[1].set_title("Histogram of angle_deviation")
            axes[1].set_xlabel("angle_deviation")
            axes[1].set_ylabel("Count")
        else:
            axes[1].text(0.5, 0.5, "No data", horizontalalignment='center', verticalalignment='center')
        plt.tight_layout()
        plt.show()
    
    # Row 5: Histograms for peak_ratio and percentage_unindexed
    with peak_unindexed_out:
        fig, axes = plt.subplots(1, 2, figsize=(10, 4))
        # peak_ratio
        if any("peak_ratio" in r for r in filtered):
            values = [r["peak_ratio"] for r in filtered if "peak_ratio" in r]
            axes[0].hist(values, bins=20)
            axes[0].set_title("Histogram of peak_ratio")
            axes[0].set_xlabel("peak_ratio")
            axes[0].set_ylabel("Count")
        else:
            axes[0].text(0.5, 0.5, "No data", horizontalalignment='center', verticalalignment='center')
        # percentage_unindexed
        if any("percentage_unindexed" in r for r in filtered):
            values = [r["percentage_unindexed"] for r in filtered if "percentage_unindexed" in r]
            axes[1].hist(values, bins=20)
            axes[1].set_title("Histogram of percentage_unindexed")
            axes[1].set_xlabel("percentage_unindexed")
            axes[1].set_ylabel("Count")
        else:
            axes[1].text(0.5, 0.5, "No data", horizontalalignment='center', verticalalignment='center')
        plt.tight_layout()
        plt.show()

filter_button.on_click(on_filter_clicked)

#################################
# 11) Button & function to convert filtered CSV to a stream file
#################################
convert_button = widgets.Button(description="Convert CSV to Stream")

def on_convert_clicked(_):
    csv_to_stream_out.clear_output()
    # Calculate output stream path based on FILTERED_CSV_PATH
    OUTPUT_STREAM_PATH = os.path.join(os.path.dirname(FILTERED_CSV_PATH), 'filtered_metrics.stream')
    csv_to_stream.write_stream_from_filtered_csv(
        filtered_csv_path=FILTERED_CSV_PATH,
        output_stream_path=OUTPUT_STREAM_PATH,
        event_col="event_number",    # adjust if needed
        streamfile_col="stream_file"   # adjust if needed
    )
    with csv_to_stream_out:
        print(f"CSV has been successfully converted to:\n {OUTPUT_STREAM_PATH}.")
        
convert_button.on_click(on_convert_clicked)

#################################
# 12) Layout: Arrange widgets and outputs into rows
#################################
# Control panel for all interactive widgets
control_panel = widgets.VBox(
    list(weight_text_fields.values()) +
    [create_combined_button, combined_metric_slider] +
    list(metric_sliders.values()) +
    [filter_button]
)

# Row 1: Control panel (left) and combined metric histogram (right)
row1 = widgets.HBox([control_panel, combined_out])

# Row 2: CSV path message output
row2 = csv_path_out

# Row 3: Convert CSV to stream button and its output
row3 = widgets.HBox([convert_button, csv_to_stream_out])

# Row 4: Histograms for weighted_rmsd and fraction_outliers
row4 = wrmsd_frac_out

# Row 5: Histograms for length_deviation and angle_deviation
row5 = length_angle_out

# Row 6: Histograms for peak_ratio and percentage_unindexed
row6 = peak_unindexed_out


# Display everything as a vertical stack of rows
display(widgets.VBox([row1, row2, row3, row4, row5, row6]))


# ==============================================
# Interactive Merging, SHELX Conversion, and MTZ Conversion
# ==============================================

In [None]:
# Run this cell to start the interactive Merging and Conversion tool
import ipywidgets as widgets
from IPython.display import display
import os

# Import file chooser widget
try:
    from ipyfilechooser import FileChooser
except ImportError:
    print("ipyfilechooser is required. Please install it with: pip install ipyfilechooser")

from merge import merge
from convert_hkl_crystfel_to_shelx import convert_hkl_crystfel_to_shelx 
from convert_hkl_to_mtz import convert_hkl_to_mtz

# Global variable to store merged output directory
global_output_dir = None

#################################
# Merging Section
#################################
# File chooser for selecting the stream file
stream_file_chooser = FileChooser(os.getcwd())
stream_file_chooser.title = 'Select Stream File'
stream_file_chooser.filter_pattern = '*.stream'  # Only show .stream files

# Other merge parameters
pointgroup_widget = widgets.Text(
    value="", 
    description="Pointgroup:", 
    style={"description_width": "150px"}
)
num_threads_widget = widgets.IntText(
    value=24, 
    description="Num Threads:", 
    style={"description_width": "150px"}
)
iterations_widget = widgets.IntText(
    value=5, 
    description="Iterations:", 
    style={"description_width": "150px"}
)
merge_button = widgets.Button(description="Merge")
merge_output = widgets.Output()

def on_merge_clicked(b):
    global global_output_dir
    merge_output.clear_output()
    with merge_output:
        stream_file = stream_file_chooser.selected
        pointgroup = pointgroup_widget.value
        num_threads = num_threads_widget.value
        iterations = iterations_widget.value
        
        if not stream_file:
            print("Please select a stream file first.")
            return
        
        print("Merging in progress...")
        output_dir = merge(
            stream_file,
            pointgroup=pointgroup,
            num_threads=num_threads,
            iterations=iterations,
        )
        if output_dir is not None:
            print("Merging done. Results are in:", output_dir)
            global_output_dir = output_dir
        else:
            print("Merging failed. Please check the parameters and try again.")

merge_button.on_click(on_merge_clicked)

merge_controls = widgets.VBox([
    widgets.HTML("<h3>Merging Parameters</h3>"),
    stream_file_chooser,
    pointgroup_widget,
    num_threads_widget,
    iterations_widget,
    merge_button,
    merge_output
])

#################################
# SHELX Conversion Section
#################################
shelx_button = widgets.Button(description="Convert to SHELX")
shelx_output = widgets.Output()

def on_shelx_clicked(b):
    shelx_output.clear_output()
    with shelx_output:
        if global_output_dir is None:
            print("No merged output available. Please run merge first.")
        else:
            print("Converting to SHELX...")
            convert_hkl_crystfel_to_shelx(global_output_dir)
            print("Conversion to SHELX completed.")

shelx_button.on_click(on_shelx_clicked)

shelx_controls = widgets.VBox([
    widgets.HTML("<h3>SHELX Conversion</h3>"),
    shelx_button,
    shelx_output
])

#################################
# MTZ Conversion Section
#################################
# File chooser for selecting the cell file
cell_file_chooser = FileChooser(os.getcwd())
cell_file_chooser.title = 'Select Cell File'
# Optionally set filter_pattern if your cell file has a specific extension

mtz_button = widgets.Button(description="Convert to MTZ")
mtz_output = widgets.Output()

def on_mtz_clicked(b):
    mtz_output.clear_output()
    with mtz_output:
        if global_output_dir is None:
            print("No merged output available. Please run merge first.")
        else:
            cellfile_path = cell_file_chooser.selected
            if not cellfile_path:
                print("Please select a cell file first.")
                return
            print("Converting to MTZ...")
            convert_hkl_to_mtz(global_output_dir, cellfile_path=cellfile_path)
            print("Conversion to MTZ completed.")

mtz_button.on_click(on_mtz_clicked)

mtz_controls = widgets.VBox([
    widgets.HTML("<h3>MTZ Conversion</h3>"),
    cell_file_chooser,
    mtz_button,
    mtz_output
])

#################################
# Display All Controls
#################################
display(widgets.VBox([
    merge_controls,
    shelx_controls,
    mtz_controls
]))


VBox(children=(VBox(children=(HTML(value='<h3>Merging Parameters</h3>'), FileChooser(path='/Users/xiaodong/Des…

# Merging Filtered Stream File Using Partialator (Non-Interactive)

In [None]:
from merge import merge

stream_file = "/path/to/filtered_metrics.stream"
pointgroup = ""
num_threads = 24
iterations = 5

output_dir = merge(
    stream_file,
    pointgroup=pointgroup,
    num_threads=num_threads,
    iterations=iterations,
)

if output_dir is not None:
    print("Merging done. Results are in:", output_dir)

# Convert to SHELX Compatible .hkl (Non-Interactive)

In [None]:
from convert_hkl_crystfel_to_shelx import convert_hkl_crystfel_to_shelx 
# output_dir = "" # If defined above comment out this line
convert_hkl_crystfel_to_shelx(output_dir)

# Convert to mtz (Non-Interactive)

In [None]:
from convert_hkl_to_mtz import convert_hkl_to_mtz
# output_dir = "" # If defined above comment out this line
# cellfile_path = ""  # If defined above comment out this line
convert_hkl_to_mtz(output_dir, cellfile_path=cellfile_path)