# Protein Secondary Structure Inference with BayesFlow

_Authors: Bhanu Prasanna, Simulation-Based Inference Course - Task 5_

## Introduction

This notebook demonstrates amortized Bayesian inference for protein secondary structure prediction using a two-state Hidden Markov Model (HMM) and BayesFlow. The goal is to train a neural network to predict state membership probabilities (alpha-helix vs. other) from amino acid sequences, essentially learning an amortized approximation to the Forward-Backward algorithm.

### Problem Setup

We use a two-state HMM where:
- **State 0 ("other")**: Beta-sheets and random coils  
- **State 1 ("alpha-helix")**: Alpha-helix secondary structure

The HMM has fixed emission and transition probabilities based on empirical data from protein structure analysis. Given an amino acid sequence, we want to infer the probability that each position belongs to an alpha-helix or other structure.

### Approach

1. **Simulator**: Generate amino acid sequences using the HMM generative model
2. **Forward-Backward**: Compute true state probabilities using hmmlearn's `predict_proba`
3. **BayesFlow**: Train a neural network to map sequences → state probabilities  
4. **Validation**: Compare predictions to known protein structures (human insulin 1A7F)

In [9]:
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from typing import Dict, Tuple, Any, List
import logging
from pathlib import Path

# Set backend for BayesFlow
if "KERAS_BACKEND" not in os.environ:
    os.environ["KERAS_BACKEND"] = "tensorflow"
else:
    print(f"Using '{os.environ['KERAS_BACKEND']}' backend")

# HMM library for forward-backward algorithm
from hmmlearn import hmm

# BayesFlow imports
import bayesflow as bf

# Set random seed for reproducibility
np.random.seed(42)

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

print("All libraries imported successfully!")
print(f"BayesFlow version: {bf.__version__}")
print(f"NumPy version: {np.__version__}")

Using 'tensorflow' backend
All libraries imported successfully!
BayesFlow version: 2.0.5
NumPy version: 1.26.4


In [10]:
# ============================================================================
# HMM Configuration and Data Setup
# ============================================================================

# Amino acid alphabet (20 standard amino acids)
AMINO_ACIDS = ['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 
               'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']

# Create amino acid to index mapping
AA_TO_IDX = {aa: idx for idx, aa in enumerate(AMINO_ACIDS)}
IDX_TO_AA = {idx: aa for idx, aa in enumerate(AMINO_ACIDS)}

print(f"Amino acid alphabet: {AMINO_ACIDS}")
print(f"Number of amino acids: {len(AMINO_ACIDS)}")

# Emission probabilities from the task description
# Alpha-helix emissions (State 1)
alpha_helix_probs = [
    12, 6, 3, 5, 1, 9, 5, 4, 2, 7,    # A R N D C E Q G H I
    12, 6, 3, 4, 2, 5, 4, 1, 3, 6     # L K M F P S T W Y V
]

# Other emissions (State 0) 
other_probs = [
    6, 5, 5, 6, 2, 5, 3, 9, 3, 5,     # A R N D C E Q G H I
    8, 6, 2, 4, 6, 7, 6, 1, 4, 7      # L K M F P S T W Y V
]

# Convert to probabilities (normalize by 100)
alpha_helix_emissions = np.array(alpha_helix_probs) / 100.0
other_emissions = np.array(other_probs) / 100.0

# Verify probabilities sum to 1
print(f"Alpha-helix emissions sum: {alpha_helix_emissions.sum():.3f}")
print(f"Other emissions sum: {other_emissions.sum():.3f}")

# Emission matrix: shape (n_states, n_features)
# State 0: "other", State 1: "alpha-helix"
emission_probs = np.array([other_emissions, alpha_helix_emissions])
print(f"Emission matrix shape: {emission_probs.shape}")

# Transition probabilities from task description
# State 0 = "other", State 1 = "alpha-helix"
# Always starts in "other" state
transition_probs = np.array([
    [0.95, 0.05],  # From "other": 95% stay, 5% to alpha-helix
    [0.10, 0.90]   # From "alpha-helix": 10% to other, 90% stay
])

# Initial state probabilities (always start in "other")
start_probs = np.array([1.0, 0.0])

print(f"Transition matrix shape: {transition_probs.shape}")
print(f"Transition probabilities:\n{transition_probs}")
print(f"Start probabilities: {start_probs}")

Amino acid alphabet: ['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
Number of amino acids: 20
Alpha-helix emissions sum: 1.000
Other emissions sum: 1.000
Emission matrix shape: (2, 20)
Transition matrix shape: (2, 2)
Transition probabilities:
[[0.95 0.05]
 [0.1  0.9 ]]
Start probabilities: [1. 0.]


In [11]:
# ============================================================================
# HMM Model Setup using hmmlearn
# ============================================================================

class ProteinHMM:
    """Hidden Markov Model for protein secondary structure"""
    
    def __init__(self):
        # Create CategoricalHMM with 2 states
        self.model = hmm.CategoricalHMM(n_components=2, random_state=42)
        
        # Set the fixed parameters
        self.model.startprob_ = start_probs
        self.model.transmat_ = transition_probs  
        self.model.emissionprob_ = emission_probs
        
        # Prevent parameter updates during fitting
        self.model.params = ""  # Empty string means no parameters will be updated
        self.model.init_params = ""  # Empty string means no initialization
        
        print("ProteinHMM initialized with fixed parameters")
        print(f"Start probabilities: {self.model.startprob_}")
        print(f"Transition matrix:\n{self.model.transmat_}")
        print(f"Emission matrix shape: {self.model.emissionprob_.shape}")
    
    def generate_sequence(self, length: int) -> Tuple[np.ndarray, np.ndarray]:
        """Generate amino acid sequence and corresponding hidden states
        
        Args:
            length: Length of sequence to generate
            
        Returns:
            tuple: (amino_acid_sequence, hidden_states)
        """
        # Sample from the HMM
        amino_acids, states = self.model.sample(length, random_state=np.random.randint(0, 10000))
        
        return amino_acids.flatten(), states
    
    def get_state_probabilities(self, sequence: np.ndarray) -> np.ndarray:
        """Compute state membership probabilities using Forward-Backward algorithm
        
        Args:
            sequence: Array of amino acid indices
            
        Returns:
            state_probs: Array of shape (seq_length, n_states) with probabilities
        """
        # Reshape sequence for hmmlearn (expects 2D)
        seq_2d = sequence.reshape(-1, 1)
        
        # Use predict_proba to get state probabilities (Forward-Backward)
        state_probs = self.model.predict_proba(seq_2d)
        
        return state_probs
    
    def sequence_to_string(self, sequence: np.ndarray) -> str:
        """Convert amino acid indices to string"""
        return ''.join([IDX_TO_AA[idx] for idx in sequence])
    
    def string_to_sequence(self, seq_string: str) -> np.ndarray:
        """Convert amino acid string to indices"""
        return np.array([AA_TO_IDX[aa] for aa in seq_string if aa in AA_TO_IDX])

# Initialize the HMM
protein_hmm = ProteinHMM()

# Test sequence generation
test_seq, test_states = protein_hmm.generate_sequence(20)
test_seq_str = protein_hmm.sequence_to_string(test_seq)
test_probs = protein_hmm.get_state_probabilities(test_seq)

print(f"\nTest sequence generation:")
print(f"Sequence: {test_seq_str}")
print(f"Length: {len(test_seq)}")
print(f"Hidden states: {test_states}")
print(f"State probabilities shape: {test_probs.shape}")
print(f"Alpha-helix probabilities: {test_probs[:, 1]}")
print(f"Other probabilities: {test_probs[:, 0]}")

ProteinHMM initialized with fixed parameters
Start probabilities: [1. 0.]
Transition matrix:
[[0.95 0.05]
 [0.1  0.9 ]]
Emission matrix shape: (2, 20)

Test sequence generation:
Sequence: WRYTNSAFKDALYHLDKRLL
Length: 20
Hidden states: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1]
State probabilities shape: (20, 2)
Alpha-helix probabilities: [0.         0.03629551 0.05650999 0.08060925 0.11881799 0.18938437
 0.29388266 0.33046914 0.36745596 0.4058222  0.46311606 0.46629057
 0.44130009 0.4414953  0.47661556 0.48309802 0.5077631  0.53701772
 0.55680926 0.55083721]
Other probabilities: [1.         0.96370449 0.94349001 0.91939075 0.88118201 0.81061563
 0.70611734 0.66953086 0.63254404 0.5941778  0.53688394 0.53370943
 0.55869991 0.5585047  0.52338444 0.51690198 0.4922369  0.46298228
 0.44319074 0.44916279]


In [17]:
# ============================================================================
# BayesFlow Simulator Setup
# ============================================================================

def sequence_length_prior():
    """Sample sequence length from a reasonable range"""
    # Protein sequences typically range from 10-500 amino acids
    # We'll use a smaller range for computational efficiency
    length = np.random.randint(10, 51)  # 10-50 amino acids
    return dict(seq_length=length)

def protein_sequence_simulator(seq_length):
    """Generate protein sequence and compute state probabilities
    
    Args:
        seq_length: Length of sequence to generate
        
    Returns:
        dict: Contains amino acid sequence and state probabilities
    """
    # Fixed maximum length for padding (must be consistent across batch)
    MAX_LENGTH = 50
    
    # Generate sequence using our HMM
    amino_acid_seq, hidden_states = protein_hmm.generate_sequence(seq_length)
    
    # Compute state probabilities using Forward-Backward
    state_probs = protein_hmm.get_state_probabilities(amino_acid_seq)
    
    # Extract alpha-helix probabilities (what we want to predict)
    alpha_helix_probs = state_probs[:, 1]  # State 1 = alpha-helix
    
    # Pad sequences to fixed length with value 0 (will be treated as amino acid A)
    # This is simpler and avoids issues with negative padding values
    
    # Pad amino acid sequence
    padded_amino_acids = np.zeros(MAX_LENGTH, dtype=np.float32)
    padded_amino_acids[:seq_length] = amino_acid_seq.astype(np.float32)
    
    # Pad alpha-helix probabilities with 0 (neutral probability)
    padded_alpha_probs = np.zeros(MAX_LENGTH, dtype=np.float32)
    padded_alpha_probs[:seq_length] = alpha_helix_probs.astype(np.float32)
    
    # Pad hidden states with 0
    padded_states = np.zeros(MAX_LENGTH, dtype=np.float32)
    padded_states[:seq_length] = hidden_states.astype(np.float32)
    
    return dict(
        amino_acids=padded_amino_acids,
        alpha_helix_probs=padded_alpha_probs,
        true_states=padded_states
    )

# Create BayesFlow simulator
simulator = bf.simulators.make_simulator(
    [sequence_length_prior, protein_sequence_simulator]
)

# Test the simulator
print("Testing BayesFlow simulator...")
sample_data = simulator.sample(3)

print(f"Sample data keys: {list(sample_data.keys())}")
print(f"Sequence lengths: {sample_data['seq_length']}")
print(f"Amino acids shape: {sample_data['amino_acids'].shape}")
print(f"Alpha-helix probabilities shape: {sample_data['alpha_helix_probs'].shape}")
print(f"True states shape: {sample_data['true_states'].shape}")

# Show example
idx = 0
print(f"\nExample sequence {idx}:")
print(f"Length: {sample_data['seq_length'][idx]}")
print(f"Amino acids: {sample_data['amino_acids'][idx]}")
print(f"Alpha-helix probs: {sample_data['alpha_helix_probs'][idx]}")
print(f"True states: {sample_data['true_states'][idx]}")

# Convert to amino acid string for visualization (only actual sequence length)
actual_length = int(sample_data['seq_length'][idx])
valid_indices = sample_data['amino_acids'][idx][:actual_length].astype(int)
seq_str = protein_hmm.sequence_to_string(valid_indices)
print(f"Sequence string: {seq_str}")

Testing BayesFlow simulator...
Sample data keys: ['seq_length', 'amino_acids', 'alpha_helix_probs', 'true_states']
Sequence lengths: [[30]
 [42]
 [21]]
Amino acids shape: (3, 50)
Alpha-helix probabilities shape: (3, 50)
True states shape: (3, 50)

Example sequence 0:
Length: [30]
Amino acids: [ 8. 14. 18. 19.  0. 13. 19.  8.  3.  9. 15. 12.  5.  7.  8. 15.  3. 18.
  3.  0.  9.  1.  0. 16. 18. 11.  1.  2. 19. 19.  0.  0.  0.  0.  0.  0.
  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
Alpha-helix probs: [0.         0.01453173 0.05327759 0.10025357 0.15187754 0.1555573
 0.15414822 0.15618935 0.17652062 0.20424414 0.20692506 0.22761549
 0.21901613 0.16585737 0.16108303 0.17462648 0.20498015 0.2452934
 0.30829096 0.39078447 0.41228017 0.41008446 0.39600596 0.32377043
 0.28863704 0.27391347 0.25767338 0.22657956 0.2268793  0.23421608
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.       

In [48]:
# ============================================================================
# BayesFlow Adapter Setup
# ============================================================================

# The simulator outputs individual keys, not sim_data!
# Let's create the proper adapter using standard BayesFlow transforms

adapter = (
    bf.adapters.Adapter()
    # Convert any non-arrays to numpy arrays
    .to_array()
    # Convert from numpy's default float64 to deep learning friendly float32
    .convert_dtype("float64", "float32")
    # Set up the data flow for BayesFlow
    .rename("alpha_helix_probs", "inference_variables")  # Target: state probabilities  
    .rename("amino_acids", "summary_variables")          # Input: amino acid sequences for summary network
    .rename("seq_length", "inference_conditions")       # Condition: sequence length for inference network
    # Drop unnecessary variables
    .drop(["true_states"])
)

# Test the adapter
print("Testing BayesFlow adapter...")
sample_data = simulator.sample(3)
print(f"Original simulator output keys: {list(sample_data.keys())}")

for key, value in sample_data.items():
    print(f"{key}: {value.shape} (dtype: {value.dtype})")

# Test the adapter
try:
    adapted_data = adapter(sample_data)
    print(f"\nAdapter successful! Output keys: {list(adapted_data.keys())}")
    
    print(f"\nData shapes after adaptation:")
    for key, value in adapted_data.items():
        print(f"{key}: {value.shape} (dtype: {value.dtype})")

    print(f"\nExample adapted data:")
    print(f"summary_variables (amino acids): {adapted_data['summary_variables'][0][:10]}")
    print(f"inference_variables (alpha-helix probs): {adapted_data['inference_variables'][0][:10]}")
    print(f"inference_conditions (seq_length): {adapted_data['inference_conditions'][0]}")

    # Check value ranges
    print(f"\nValue ranges:")
    print(f"Amino acids: [{adapted_data['summary_variables'].min():.1f}, {adapted_data['summary_variables'].max():.1f}]")
    print(f"Alpha-helix probs: [{adapted_data['inference_variables'].min():.3f}, {adapted_data['inference_variables'].max():.3f}]")
    print(f"Sequence lengths: [{adapted_data['inference_conditions'].min():.0f}, {adapted_data['inference_conditions'].max():.0f}]")
    
except Exception as e:
    print(f"Adapter error: {e}")
    import traceback
    traceback.print_exc()

Testing BayesFlow adapter...
Original simulator output keys: ['seq_length', 'amino_acids', 'alpha_helix_probs', 'true_states']
seq_length: (3, 1) (dtype: int64)
amino_acids: (3, 50) (dtype: float32)
alpha_helix_probs: (3, 50) (dtype: float32)
true_states: (3, 50) (dtype: float32)

Adapter successful! Output keys: ['inference_variables', 'summary_variables', 'inference_conditions']

Data shapes after adaptation:
inference_variables: (3, 50) (dtype: float32)
summary_variables: (3, 50) (dtype: float32)
inference_conditions: (3, 1) (dtype: float32)

Example adapted data:
summary_variables (amino acids): [16.  7.  6.  8.  7.  3. 13. 16.  6.  5.]
inference_variables (alpha-helix probs): [0.         0.024004   0.07333158 0.08906678 0.11779314 0.2010955
 0.30366385 0.4083966  0.58317083 0.688051  ]
inference_conditions (seq_length): [31.]

Value ranges:
Amino acids: [0.0, 19.0]
Alpha-helix probs: [0.000, 0.868]
Sequence lengths: [16, 31]


In [55]:
# ============================================================================
# BayesFlow Neural Architecture Setup
# ============================================================================

print("Creating BayesFlow neural networks...")

# Create summary network using DeepSet for processing amino acid sequences
summary_net = bf.networks.DeepSet(
    # Each element in the set (amino acid) goes through this MLP
    phi_kwargs={"units": [64, 32], "activation": "relu"},
    # The pooled result goes through this MLP to get final summary
    rho_kwargs={"units": [32, 16], "activation": "relu"}
)

# Create inference network for predicting alpha-helix probabilities
inference_net = bf.networks.InferenceNetwork(
    mlp_kwargs={"units": [32, 32, 50], "activation": "relu"}  # Output 50 probabilities
)

print("BayesFlow networks created successfully!")
print(f"Summary network: {type(summary_net).__name__}")
print(f"  - Element MLP (phi): [64, 32]")
print(f"  - Aggregation MLP (rho): [32, 16]")
print(f"  - Purpose: Extract features from amino acid sequences")

print(f"\nInference network: {type(inference_net).__name__}")
print(f"  - MLP units: [32, 32, 50]")
print(f"  - Purpose: Predict alpha-helix probabilities from sequence summary + length")

print(f"\nArchitecture overview:")
print(f"  Input: Amino acid sequences (batch_size, 50)")
print(f"  Summary Network: sequences -> 16D summary vector")
print(f"  Inference Network: [summary, seq_length] -> alpha-helix probabilities (50D)")
print(f"  Output: Probability distribution over sequence positions")

Creating BayesFlow neural networks...
BayesFlow networks created successfully!
Summary network: DeepSet
  - Element MLP (phi): [64, 32]
  - Aggregation MLP (rho): [32, 16]
  - Purpose: Extract features from amino acid sequences

Inference network: InferenceNetwork
  - MLP units: [32, 32, 50]
  - Purpose: Predict alpha-helix probabilities from sequence summary + length

Architecture overview:
  Input: Amino acid sequences (batch_size, 50)
  Summary Network: sequences -> 16D summary vector
  Inference Network: [summary, seq_length] -> alpha-helix probabilities (50D)
  Output: Probability distribution over sequence positions


In [56]:
# ============================================================================
# BayesFlow Training Workflow Setup
# ============================================================================

# Create the BasicWorkflow that ties everything together
workflow = bf.workflows.BasicWorkflow(
    simulator=simulator,
    adapter=adapter,
    inference_network=inference_net,
    summary_network=summary_net
)

print("BayesFlow BasicWorkflow created successfully!")
print(f"Workflow components:")
print(f"  - Simulator: {type(workflow.simulator).__name__}")
print(f"  - Inference Network: {type(inference_net).__name__}")  
print(f"  - Summary Network: {type(summary_net).__name__}")
print(f"  - Adapter: {type(workflow.adapter).__name__}")

# Configure training parameters
BATCH_SIZE = 32
EPOCHS = 5  # Start with fewer epochs for initial testing
LEARNING_RATE = 0.001

print(f"\nTraining configuration:")
print(f"  - Batch size: {BATCH_SIZE}")
print(f"  - Epochs: {EPOCHS}")
print(f"  - Learning rate: {LEARNING_RATE}")

# Test workflow with a small batch
print("\nTesting workflow with small batch...")
try:
    # Generate training data
    train_data = workflow.simulate(BATCH_SIZE)
    adapted_train_data = workflow.adapter(train_data)
    
    print(f"Training data generated successfully!")
    print(f"Batch shapes:")
    for key, value in adapted_train_data.items():
        print(f"  {key}: {value.shape}")
        
    # Test the summary network
    summary_output = summary_net(adapted_train_data['summary_variables'])
    print(f"\nSummary network output shape: {summary_output.shape}")
    print(f"Expected: (batch_size={BATCH_SIZE}, n_summary=16)")
    
except Exception as e:
    print(f"Error during workflow testing: {e}")
    import traceback
    traceback.print_exc()

BayesFlow BasicWorkflow created successfully!
Workflow components:
  - Simulator: SequentialSimulator
  - Inference Network: InferenceNetwork
  - Summary Network: DeepSet
  - Adapter: Adapter

Training configuration:
  - Batch size: 32
  - Epochs: 5
  - Learning rate: 0.001

Testing workflow with small batch...
Training data generated successfully!
Batch shapes:
  inference_variables: (32, 50)
  summary_variables: (32, 50)
  inference_conditions: (32, 1)
Error during workflow testing: Exception encountered when calling Residual.call().

[1mInput 0 of layer "projector" is incompatible with the layer: expected min_ndim=2, found ndim=1. Full shape received: (64,)[0m

Arguments received by Residual.call():
  • x=tf.Tensor(shape=(64,), dtype=float32)
  • training=False
  • mask=None


Traceback (most recent call last):
  File "/var/folders/1r/h80d31y92rn7dxwn7_1yfhsh0000gn/T/ipykernel_51683/2789587064.py", line 43, in <module>
    summary_output = summary_net(adapted_train_data['summary_variables'])
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/ukk/lib/python3.12/site-packages/keras/src/utils/traceback_utils.py", line 122, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/opt/anaconda3/envs/ukk/lib/python3.12/site-packages/bayesflow/utils/decorators.py", line 95, in wrapper
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/ukk/lib/python3.12/site-packages/bayesflow/networks/summary_network.py", line 18, in build
    z = self.call(x)
        ^^^^^^^^^^^^
  File "/opt/anaconda3/envs/ukk/lib/python3.12/site-packages/bayesflow/networks/deep_set/deep_set.py", line 146, in call
    x = em(x, training=training)
        ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/

In [30]:
# ============================================================================
# BayesFlow Training
# ============================================================================

print("Starting BayesFlow training...")
print("This may take a few minutes depending on your hardware.\n")

# Train the workflow using online training (generates data on-the-fly)
training_history = workflow.fit_online(
    epochs=EPOCHS,
    batch_size=BATCH_SIZE,
    num_batches_per_epoch=100,  # Number of training batches per epoch
    validation_data=500        # Number of validation simulations
)

print("\nTraining completed successfully!")

# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Training loss
axes[0].plot(training_history.history['loss'])
axes[0].set_title('Training Loss')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Loss')
axes[0].grid(True)

# Validation loss (if available)
if 'val_loss' in training_history.history:
    axes[1].plot(training_history.history['val_loss'])
    axes[1].set_title('Validation Loss')
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('Loss')
    axes[1].grid(True)
else:
    axes[1].text(0.5, 0.5, 'No validation loss available', 
                ha='center', va='center', transform=axes[1].transAxes)
    axes[1].set_title('Validation Loss')

plt.tight_layout()
plt.show()

print(f"Final training loss: {training_history.history['loss'][-1]:.4f}")
if 'val_loss' in training_history.history:
    print(f"Final validation loss: {training_history.history['val_loss'][-1]:.4f}")

# Test the trained model
print("\nTesting trained model...")
test_data = workflow.simulate(5)

# Get predictions from the trained model using the workflow's sample method
test_conditions = {
    'summary_variables': test_data['amino_acids'],
    'inference_conditions': test_data['seq_length']
}

# Apply adapter to test conditions
adapted_test_conditions = workflow.adapter(test_conditions)

predictions = workflow.sample(
    num_samples=100,  # Number of posterior samples
    conditions=adapted_test_conditions
)

print(f"Predictions shape: {predictions['inference_variables'].shape}")
print("Model training and testing completed successfully!")

Starting BayesFlow training...
This may take a few minutes depending on your hardware.



INFO:bayesflow:Fitting on dataset instance of OnlineDataset.


Epoch 1/5


ValueError: Cannot compute summaries from summary_variables without a summary network.

In [42]:
# Let's create a simpler approach that directly extracts what we need
print("Manual data extraction from sim_data:")

# Manually extract the data we need
amino_acids = test_sim['sim_data'][..., 0]  # First channel
alpha_helix_probs = test_sim['sim_data'][..., 1]  # Second channel

print("Extracted amino_acids shape:", amino_acids.shape)
print("Extracted alpha_helix_probs shape:", alpha_helix_probs.shape)

# Create a simple data structure that BayesFlow expects
manual_data = {
    'inference_conditions': amino_acids.astype(np.float32),
    'inference_variables': alpha_helix_probs.astype(np.float32)
}

print("\nManual data structure:")
for key, value in manual_data.items():
    print(f"  {key}: {value.shape}")

print("\nSample inference_conditions (amino acids):")
print(manual_data['inference_conditions'][0][:10])
print("\nSample inference_variables (alpha helix probs):")
print(manual_data['inference_variables'][0][:10])

Manual data extraction from sim_data:
Extracted amino_acids shape: (2, 50)
Extracted alpha_helix_probs shape: (2, 50)

Manual data structure:
  inference_conditions: (2, 50)
  inference_variables: (2, 50)

Sample inference_conditions (amino acids):
[0.59666985 0.56946033 0.5465842  0.4928938  0.4201785  0.38633034
 0.36481977 0.3564731  0.34724712 0.36692783]

Sample inference_variables (alpha helix probs):
[0.40333015 0.43053967 0.4534158  0.50710624 0.57982147 0.6136697
 0.63518023 0.6435269  0.6527529  0.6330722 ]
