Determine why number of equilibrium iterations is almost always 1 #1000

jchodera · 2022-05-08T18:39:05Z

@zhang-ivy has noticed that the number of equilibrium iterations assessed by automatic equilibration detection is almost always 1. We should investigate why this is happening and whether there is a better timeseries to feed the automatic equilibration detection than the super-replica energy to better partition between transient initial equilibration and production.

zhang-ivy · 2022-05-12T17:42:57Z

@jchodera : I started investigating this today, and I have 1 finding and 2 questions.

Here is the function in MultiStateSamplerAnalyzer that generates the equilibration data (n_equilibration_iterations, statistical inefficiency, and number of uncorrelated iterations): https://github.com/choderalab/openmmtools/blob/main/openmmtools/multistate/multistateanalyzer.py#L1986

Four things happen in this function:

The effective energy timeseries is retrieved (this is an array of shape (N, ) where N is the number of iterations)
t0 is set to 1. for SAMS simulations only, attempt to update t0 based on the online analysis data
The effective energy timeseries is fed into get_equilibration_data_per_sample(), which computes the statistical inefficiency and n_effective_samples for a subset of iterations. By default, it will compute on 100 evenly spaced iterations, i.e. for 90000 iterations, it'll compute on iterations: 0, 900, 1800, 2700, ... 88200, 89100. The output is three arrays that contain the iteration numbers, statistical inefficiencies, and n_effective_samples for each iteration.
The overall n_effective_samples, statistical inefficiency, and n_equilibration_iterations is determined to be the iteration with the max of the n_effective_samples array.

Finding 1: Since n_effective samples is largest at iteration 0, and n_equilibration_iterations is computed as the iteration with the largest n_effective_samples + t0, n_equilibration_iterations ends up being 1.

Question 1: Why does get_equilibration_data_per_sample() by default compute on 100 samples only? Is this really a good default? Should the default depend on the number of total iterations? If I have 1000 iterations, maybe 100 samples makes sense, but if I have 90000 iterations, 100 seems like it may be too few.

Now that I'm familiar with these functions, I'm going to play around with swapping out the effective energy timeseries with either the state index replica timeseries or the d_u/d_lambda replica timeseries.
Question 2: How should I feed the state index or d_u/d_lambda replica timeseries into get_equilibration_data_per_sample? Should I just swap out statisticalInefficiency with statisticalInefficiencyMultiple here?

zhang-ivy · 2022-05-16T16:29:00Z

As a follow up to question 2 above, I tried swapping out statisticalInefficiency with statisticalInefficiencyMultiple here and then fed in the state index and d_u/d_lambda timeseries, but both of them still result in t0 = 1.

Here's the code I used for feeding in the state index replica timeseries:

# Retrieve state index replica timeseries
analyzer = MultiStateSamplerAnalyzer(reporter, max_n_iterations=n_iterations)
timeseries_to_analyze = analyzer.read_energies()[-1].data

# Set parameters to be used in `get_equilibration_data_per_sample()`
fast = True
max_subset = 100

# Run code in `multistate.utils.get_equilibration_per_sample()`
# Cast to array if not already
series = np.array(timeseries_to_analyze)
# Special trap for constant series
time_size = n_iterations 
set_size = time_size - 1  # Cannot analyze the last entry
# Set maximum
if max_subset is None or set_size < max_subset:
    max_subset = set_size
# Special trap for series of size 1
if max_subset == 0:
    max_subset = 1
# Special trap for constant or size 1 series
if series.std() == 0.0 or max_subset == 1:
    print("std dev is 0")
g_i = np.ones([max_subset], np.float32)
n_effective_i = np.ones([max_subset], np.float32)
counter = np.arange(max_subset)
i_t = np.floor(counter * time_size / max_subset).astype(int)
for i, t in enumerate(i_t):
    print(i)
    try:
        g_i[i] = timeseries.statisticalInefficiencyMultiple(series[:,t:], fast=fast)
    except:
        #g_i[i] = (time_size - t + 1)
        raise Exception("Unable to run statisticalInefficiencyMultiple")
    n_effective_i[i] = (time_size - t + 1) / g_i[i]

# Run code in MultiStateSamplerAnalyzer.get_equilibration_data()
t0 = 1
# Discard equilibration samples.
# TODO: if we include u_n[0] (the energy right after minimization) in the equilibration detection,
# TODO:         then number_equilibrated is 0. Find a better way than just discarding first frame.
n_effective_max = n_effective_i.max()
i_max = n_effective_i.argmax()
n_equilibration = i_t[i_max] + t0 # account for initially discarded frames
g_t = g_i[i_max]
print(n_equilibration, g_t, n_effective_max)

Output:

1 11697.602 7.69397

I then generated a d_u/d_lambda replica timeseries and fed it through the above code snippet. Here is the output:

1 902.0917 9.9779215

ijpulidos · 2022-11-08T16:18:36Z

Closing via choderalab/openmmtools#610

jchodera added enhancement ✨ potential bug 🤔 🐛 priority: high priority high effort: medium labels May 8, 2022

jchodera added this to the 0.10.0 Enhanced CLI with more stable input/output formats milestone May 8, 2022

zhang-ivy mentioned this issue May 19, 2022

Allow override t0 and g_t when computing free energy choderalab/openmmtools#585

Closed

5 tasks

zhang-ivy modified the milestones: 0.10.0 Enhanced CLI with more stable input/output formats, 0.11.0: New top-level API May 31, 2022

ijpulidos closed this as completed Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determine why number of equilibrium iterations is almost always 1 #1000

Determine why number of equilibrium iterations is almost always 1 #1000

jchodera commented May 8, 2022

zhang-ivy commented May 12, 2022 •

edited

zhang-ivy commented May 16, 2022 •

edited

ijpulidos commented Nov 8, 2022

Determine why number of equilibrium iterations is almost always 1 #1000

Determine why number of equilibrium iterations is almost always 1 #1000

Comments

jchodera commented May 8, 2022

zhang-ivy commented May 12, 2022 • edited

zhang-ivy commented May 16, 2022 • edited

ijpulidos commented Nov 8, 2022

zhang-ivy commented May 12, 2022 •

edited

zhang-ivy commented May 16, 2022 •

edited