# Overview of the CrystFEL Processing and Refinement Workflow

This notebook provides a comprehensive pipeline for processing crystallography data using CrystFEL tools and custom scripts. The workflow encompasses the following key stages:

1. **Indexing on a Circular Grid:**  
   Use the `gandalf_iterator` to perform iterative peak finding, indexing, and integration over a radial grid of beam center shifts.

2. **Visualization of Indexing Performance:**  
   Generate 3D histograms and 2D heatmaps to assess how different beam center adjustments affect indexing success.

3. **Evaluation of Indexing Metrics:**  
   Process stream files to compute key quality metrics (weighted RMSD, fraction of outliers, deviation measures, etc.) that help quantify indexing accuracy.

4. **Interactive Metrics Analysis & CSV-to-Stream Conversion:**  
   Use interactive widgets to:
   - Filter individual metrics with separate thresholds.
   - Create a combined quality metric from weighted inputs.
   - Visualize the filtered results.
   - Convert the filtered metrics CSV into a stream file for downstream processing.

5. **Interactive Merging and Format Conversion:**  
   Merge the best indexing results and convert them into crystallographically useful formats:
   - **SHELX Conversion:** Transform the merged results into a Shelx-compatible .hkl file.
   - **MTZ Conversion:** Convert the .hkl file to the .mtz format for further analysis.

6. **Refinement Using REFMAC5:**  
   Refine the merged data with REFMAC5 and parse the resulting log file to extract and plot Rf_used values versus resolution, providing insight into the refinement quality.

> **Pre-requisites:**  
> Ensure that all required processing steps, tools (CrystFEL, REFMAC5, etc.), and Python packages (ipywidgets, matplotlib, etc.) are installed and properly configured.




# ==============================================
# Run Indexamajig Iterations on a Circular Grid
### Options for Peak Finding, Indexing, and Integration

This section executes the `gandalf_iterator` function to perform:
- Peak finding using options tailored for CXI data.
- Indexing with XGandalf, including setting tolerances and sampling parameters.
- Integration using a rings method.

Define your grid parameters (maximum radius and step size) to iterate over beam center shifts on a circular grid.
# ==============================================


In [None]:
from gandalf_interations.gandalf_radial_iterator import gandalf_iterator

# Hard-coded parameters (adjust these paths and values as needed)
geom_file = "path/to/your_file.geom"       # Path to your .geom file
cell_file = "path/to/your_file.cell"        # Path to your .cell file
input_folder = "path/to/your/input_folder"  # Path to your input folder

output_base = "Xtal"
num_threads = 24
max_radius = 1.8
step = 0.5

# Choose your peakfinder method: 'cxi', 'peakfinder9', or 'peakfinder8'
peakfinder = 'cxi'

# Define default peakfinder options.
default_peakfinder_options = {
    'cxi': "--peaks=cxi",
    'peakfinder9': """--peaks=peakfinder9
--min-snr=1
--min-snr-peak-pix=6
--min-snr-biggest-pix=1
--min-sig=9 
--min-peak-over-neighbour=5
--local-bg-radius=5""",
    'peakfinder8': """--peaks=peakfinder8
--threshold=45
--min-snr=3
--min-pix-count=3
--max-pix-count=500
--local-bg-radius=9
--min-res=30
--max-res=500"""
}

# Other extra flags (as a multiline string)
other_flags_str = """--no-revalidate
--no-half-pixel-shift
--no-refine
--no-non-hits-in-stream"""

# Advanced indexing parameters
min_peaks = 15
tolerance = "10,10,10,5"
xgandalf_sampling_pitch = 5
xgandalf_grad_desc_iterations = 1
xgandalf_tolerance = 0.02
int_radius = "2,5,10"

# Build flags
other_flags = [line.strip() for line in other_flags_str.splitlines() if line.strip()]
peakfinder_flags = [line.strip() for line in default_peakfinder_options[peakfinder].splitlines() if line.strip()]

advanced_flags = [
    f"--min-peaks={min_peaks}",
    f"--tolerance={tolerance}",
    f"--xgandalf-sampling-pitch={xgandalf_sampling_pitch}",
    f"--xgandalf-grad-desc-iterations={xgandalf_grad_desc_iterations}",
    f"--xgandalf-tolerance={xgandalf_tolerance}",
    f"--int-radius={int_radius}"
]

"""Examples of extra flags(see crystfel documentation https://www.desy.de/~twhite/crystfel/manual-indexamajig.html):
    
    Peakfinding
    "--peaks=cxi",
    "--peak-radius=inner,middle,outer",
    "--min-peaks=n",
    "--median-filter=n",
    "--filter-noise",
    "--no-revalidate",
    "--no-half-pixel-shift",

    "--peaks=peakfinder9",
    "--min-snr=1",
    "--min-snr-peak-pix=6",
    "--min-snr-biggest-pix=1",
    "--min-sig=9",
    "--min-peak-over-neighbour=5",
    "--local-bg-radius=5",

    "--peaks=peakfinder8",
    "--threshold=45",
    "--min-snr=3",
    "--min-pix-count=3",
    "--max-pix-count=500",
    "--local-bg-radius=9",
    "--min-res=30",
    "--max-res=500",
    
    Indexing
    "--indexing=xgandalf",

    "--tolerance=tol"
    "--no-check-cell",
    "--no-check-peaks",
    "--multi",
    "--no-retry",
    "--no-refine",

    "--xgandalf-sampling-pitch=n"
    "--xgandalf-grad-desc-iterations=n"
    "--xgandalf-tolerance=n"
    "--xgandalf-no-deviation-from-provided-cell"
    "--xgandalf-max-lattice-vector-length=n"
    "--xgandalf-min-lattice-vector-length=n"
    "--xgandalf-max-peaks=n"

    Integration
    "--fix-profile-radius=n",
    "--integration=rings",
    "--int-radius=4,5,10",
    "--push-res=n",
    "--overpredict",

    Output
    "--no-non-hits-in-stream",
    "--no-peaks-in-stream",
    "--no-refls-in-stream",
"""

# Fixed indexing flags (unchanged)
indexing_flags = [
    "--indexing=xgandalf",
    "--integration=rings",
]

# Combine all flags
flags_list = advanced_flags + other_flags + peakfinder_flags + indexing_flags

# Display the parameters for verification.
print("Running gandalf_iterator with the following parameters:")
print("Geom File:", geom_file)
print("Cell File:", cell_file)
print("Input Folder:", input_folder)
print("Output Base:", output_base)
print("Threads:", num_threads)
print("Max Radius:", max_radius)
print("Step:", step)
print("\nCombined Flags:", flags_list)

# Run the indexing workflow
try:
    gandalf_iterator(
        geom_file,
        cell_file,
        input_folder,
        output_base,
        num_threads,
        max_radius=max_radius,
        step=step,
        extra_flags=flags_list
    )
    print("Indexing completed successfully.")
except Exception as e:
    print("Error during indexing:", e)



# ==============================================
# Visualize Indexing Results: 3D Histogram & 2D Heatmap

After running the iterations, this section generates visualizations:
- **3D Histogram:** Provides an overview of the indexing rate across the grid.
- **2D Heatmap:** Offers a more detailed view of the beam center optimization.

Make sure that the output folder path reflects the folder where the iterative results are saved.
# ==============================================

In [None]:
from visualization.indexing_3d_histogram import plot3d_indexing_rate
from visualization.indexing_center import indexing_heatmap

# Hard-coded output folder path (adjust this path as needed)
output_folder = "path/to/your/output_folder"

print("Generating visualizations for output folder:", output_folder)
try:
    # Call the visualization functions.
    plot3d_indexing_rate(output_folder)
    indexing_heatmap(output_folder)
    print("Visualization completed successfully.")
except Exception as e:
    print("Error during visualization:", e)



# ==============================================
# Process Indexing Metrics Across All Stream Files

In this section, the notebook processes all stream file outputs from Indexamajig by:
- Reading each stream file and computing key indexing quality metrics.
- Evaluating metrics such as weighted RMSD, fraction of outliers, length and angle deviations, peak ratio, and percentage indexed.

These metrics will be used later to select the best results for further processing.
# ==============================================

In [None]:
from calc_metrics.process_indexing_metrics import process_indexing_metrics

# Hard-coded parameter values (adjust these as needed)
stream_folder = "path/to/your/stream_folder"  # Path to your stream file folder
wrmsd_tolerance = 2.0      # WRMSD tolerance (default: 2.0)
indexing_tolerance = 4.0   # Indexing tolerance (default: 4.0)

print("Processing metrics for folder:", stream_folder)
print("WRMSD Tolerance:", wrmsd_tolerance)
print("Indexing Tolerance:", indexing_tolerance)

try:
    process_indexing_metrics(stream_folder, wrmsd_tolerance=wrmsd_tolerance, indexing_tolerance=indexing_tolerance)
    print("Metrics processed successfully.")
except Exception as e:
    print("Error processing metrics:", e)


# ==============================================
# Interactive Metrics Analysis and CSV-to-Stream Conversion

This interactive section enables you to:
- **Filter Metrics:** Use individual sliders to apply separate thresholds to each quality metric.
- **Create a Combined Metric:** Input weights for each metric to generate a composite quality score.
- **Visualize the Distribution:** Display histograms of both separate and combined metrics.
- **Convert CSV to Stream:** Once the best rows are filtered, convert the CSV file into a stream file for merging.

Use the interactive widgets to adjust thresholds and weights, and then trigger the filtering and conversion processes.
# ==============================================

In [None]:
# ----- Filter Separate Metrics Section -----
import matplotlib.pyplot as plt
from filter_and_combine.interactive_iqm import read_metric_csv, get_metric_ranges, filter_rows

# ----- Parameter Setup & CSV Loading -----
CSV_PATH = "path/to/your/metrics.csv"  # Adjust this path as needed

print("Loading CSV file:", CSV_PATH)
grouped_data = read_metric_csv(CSV_PATH, group_by_event=True)
all_rows = [row for rows in grouped_data.values() for row in rows]
print(f"Loaded {len(all_rows)} rows from CSV.")

# Define metrics to be analyzed.
metrics_in_order = [
    'weighted_rmsd',
    'fraction_outliers',
    'length_deviation',
    'angle_deviation',
    'peak_ratio',
    'percentage_unindexed'
]

# ----- Compute Default Ranges & Set Thresholds -----
# Compute ranges for each metric from the loaded data.
ranges_dict = get_metric_ranges(all_rows, metrics=metrics_in_order)
print("\nMetric ranges (min, max):")
for m in metrics_in_order:
    print(f"  {m}: {ranges_dict[m]}")

# Set thresholds for each metric.
# By default, thresholds are set to the maximum value from the data.
# To adjust a threshold, simply modify the corresponding value in the THRESHOLDS dictionary.
THRESHOLDS = {metric: ranges_dict[metric][1] for metric in metrics_in_order}
# Example adjustment:
# THRESHOLDS['weighted_rmsd'] = 10.0

print("\nUsing thresholds:")
for m in metrics_in_order:
    print(f"  {m}: {THRESHOLDS[m]}")

# ----- Filter Data & Plot Histograms -----
filtered_separate = filter_rows(all_rows, THRESHOLDS)
print(f"\nFiltering: {len(all_rows)} total rows -> {len(filtered_separate)} pass thresholds.")

if filtered_separate:
    # Plot histograms for each metric.
    fig, axes = plt.subplots(3, 2, figsize=(12, 12))
    axes = axes.flatten()
    for i, metric in enumerate(metrics_in_order):
        values = [r[metric] for r in filtered_separate if metric in r]
        axes[i].hist(values, bins=20)
        axes[i].set_title(f"Histogram of {metric}")
        axes[i].set_xlabel(metric)
        axes[i].set_ylabel("Count")
    plt.tight_layout()
    plt.show()
else:
    print("No rows passed the thresholds.")


In [None]:
# ----- Combine Metrics & Filter Section -----
import matplotlib.pyplot as plt
from filter_and_combine.interactive_iqm import create_combined_metric, select_best_results_by_event, write_filtered_csv

print("\n--- Combined Metric Creation & Filtering ---")
# Define weights for each metric (adjust as needed; default is all zeros here)
weights = {metric: 0.0 for metric in metrics_in_order}
# Optionally, set some non-zero weights, for example:
# weights = {'weighted_rmsd': 0.5, 'fraction_outliers': 0.2, 'length_deviation': 0.1, 'angle_deviation': 0.1, 'peak_ratio': 0.05, 'percentage_unindexed': 0.05}

# Create the combined metric in the data rows
create_combined_metric(
    rows=all_rows,
    metrics_to_combine=metrics_in_order,
    weights=[weights[m] for m in metrics_in_order],
    new_metric_name="combined_metric"
)

# Determine range of the combined metric
combined_vals = [r["combined_metric"] for r in all_rows if "combined_metric" in r]
if combined_vals:
    cmin, cmax = min(combined_vals), max(combined_vals)
    # Set threshold to the max value by default (adjust as needed)
    combined_threshold = cmax
    print(f"Combined metric created successfully!")
    print(f"  * Min value: {cmin:.3f}")
    print(f"  * Max value: {cmax:.3f}")
else:
    print("Failed to create combined metric. Check your weights.")

# Filter rows by the combined metric threshold
filtered_combined = [r for r in all_rows if "combined_metric" in r and r["combined_metric"] <= combined_threshold]
print(f"Filtering rows by combined_metric ≤ {combined_threshold:.3f}")
if not filtered_combined:
    print("No rows passed the combined metric threshold.")
else:
    # Group the filtered rows by event number
    grouped_filtered = {}
    for r in filtered_combined:
        event = r.get("event_number")
        if event not in grouped_filtered:
            grouped_filtered[event] = []
        grouped_filtered[event].append(r)
    
    best_filtered = select_best_results_by_event(grouped_filtered, sort_metric="combined_metric")
    print(f"{len(filtered_combined)} rows passed threshold, {len(best_filtered)} best rows selected per event.")
    
    # Write the best filtered rows to a CSV file
    write_filtered_csv(best_filtered, FILTERED_CSV_PATH)
    print(f"Wrote {len(best_filtered)} best-filtered rows to {FILTERED_CSV_PATH}")
    
    # Plot histogram for the combined metric from the best rows
    plt.figure(figsize=(8, 6))
    values = [r["combined_metric"] for r in best_filtered]
    plt.hist(values, bins=20)
    plt.title("Histogram of Best Rows (combined_metric)")
    plt.xlabel("combined_metric")
    plt.ylabel("Count")
    plt.tight_layout()
    plt.show()


In [None]:
# ----- Write Filter & Combine Results CSV to STREAM -----
import os
import time
from filter_and_combine.csv_to_stream import write_stream_from_filtered_csv
from filter_and_combine.interactive_iqm import read_metric_csv

print("\n--- Converting to Stream ---")
# Create output directory for the stream file (subfolder 'filtered_metrics' in the CSV directory)
output_dir = os.path.join(os.path.dirname(CSV_PATH), 'filtered_metrics')
os.makedirs(output_dir, exist_ok=True)

# Define the output stream file path
OUTPUT_STREAM_PATH = os.path.join(output_dir, 'filtered_metrics.stream')

print("Starting conversion to stream file...")
time.sleep(0.2)
print("  * Step 1/5: Reading filtered CSV file...")
filtered_grouped_data = read_metric_csv(FILTERED_CSV_PATH, group_by_event=True)
time.sleep(0.2)

print("  * Step 2/5: Checking for combined metric and selecting best rows (if applicable)...")
time.sleep(0.2)

print("  * Step 3/5: (Best rows selection already performed in previous cell)")
time.sleep(0.2)

print("  * Step 4/5: Writing the .stream file...")
write_stream_from_filtered_csv(
    filtered_csv_path=FILTERED_CSV_PATH,
    output_stream_path=OUTPUT_STREAM_PATH,
    event_col="event_number",
    streamfile_col="stream_file"
)
time.sleep(0.2)

print("  * Step 5/5: Conversion complete!")
print(f"CSV has been successfully converted to:\n  {OUTPUT_STREAM_PATH}")


# ==============================================
# Merging, SHELX Conversion, and MTZ Conversion

This section provides interactive tools to:
1. **Merge Results:** Select a .stream file and set parameters (pointgroup, number of threads, iterations) to merge the best indexing results.
2. **Convert to SHELX Format:** Convert the merged results into a Shelx-compatible .hkl format.
3. **Convert to MTZ Format:** Using a chosen cell file, convert the .hkl file to .mtz format for downstream analysis.

Adjust the parameters using the provided widgets, then follow the step-by-step process to execute merging and format conversions.
# ==============================================

In [None]:
# ----- Merging Section -----
import time
from merge_and_convert.merge import merge

# Define the parameters for merging.
stream_file = "path/to/your/file.stream"  # Path to the .stream file
pointgroup = "P212121"  # Adjust pointgroup as needed
num_threads = 24
iterations = 5

print("="*50)
print("MERGING SECTION")
print("="*50)
print("Merging in progress...")
time.sleep(0.2)  # Simulate progress

# Call the merge function
output_dir = merge(
    stream_file,
    pointgroup=pointgroup,
    num_threads=num_threads,
    iterations=iterations,
)
time.sleep(0.2)

if output_dir is not None:
    print("Merging done. Results are in:", output_dir)
else:
    print("Merging failed. Please check the parameters and try again.")
print("Done merging.")


In [None]:
# ----- SHELX Conversion Section -----
from merge_and_convert.convert_hkl_crystfel_to_shelx import convert_hkl_crystfel_to_shelx 
print("\n" + "="*50)
print("SHELX CONVERSION")
print("="*50)

if output_dir is None:
    print("No merged output available. Please run the merge step first.")
else:
    print("Converting to SHELX...")
    convert_hkl_crystfel_to_shelx(output_dir)
    print("Conversion to SHELX completed.")

In [None]:
# ----- MTZ Conversion Section -----
import os
from merge_and_convert.convert_hkl_to_mtz import convert_hkl_to_mtz
# Define the cell file path for MTZ conversion.
cell_file = "path/to/your/cell_file.cell"  # Adjust as needed

print("\n" + "="*50)
print("MTZ CONVERSION")
print("="*50)

if output_dir is None:
    print("No merged output available. Please run the merge step first.")
else:
    if not os.path.exists(cell_file):
        print("Cell file not found. Please check the path:", cell_file)
    else:
        print("Converting to MTZ...")
        convert_hkl_to_mtz(output_dir, cellfile_path=cell_file)
        print("Conversion to MTZ completed.")