Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determine why number of equilibrium iterations is almost always 1 #1000

Closed
jchodera opened this issue May 8, 2022 · 3 comments
Closed

Determine why number of equilibrium iterations is almost always 1 #1000

jchodera opened this issue May 8, 2022 · 3 comments

Comments

@jchodera
Copy link
Member

jchodera commented May 8, 2022

@zhang-ivy has noticed that the number of equilibrium iterations assessed by automatic equilibration detection is almost always 1. We should investigate why this is happening and whether there is a better timeseries to feed the automatic equilibration detection than the super-replica energy to better partition between transient initial equilibration and production.

@zhang-ivy
Copy link
Contributor

zhang-ivy commented May 12, 2022

@jchodera : I started investigating this today, and I have 1 finding and 2 questions.

Here is the function in MultiStateSamplerAnalyzer that generates the equilibration data (n_equilibration_iterations, statistical inefficiency, and number of uncorrelated iterations): https://github.com/choderalab/openmmtools/blob/main/openmmtools/multistate/multistateanalyzer.py#L1986

Four things happen in this function:

  1. The effective energy timeseries is retrieved (this is an array of shape (N, ) where N is the number of iterations)
  2. t0 is set to 1. for SAMS simulations only, attempt to update t0 based on the online analysis data
  3. The effective energy timeseries is fed into get_equilibration_data_per_sample(), which computes the statistical inefficiency and n_effective_samples for a subset of iterations. By default, it will compute on 100 evenly spaced iterations, i.e. for 90000 iterations, it'll compute on iterations: 0, 900, 1800, 2700, ... 88200, 89100. The output is three arrays that contain the iteration numbers, statistical inefficiencies, and n_effective_samples for each iteration.
  4. The overall n_effective_samples, statistical inefficiency, and n_equilibration_iterations is determined to be the iteration with the max of the n_effective_samples array.

Finding 1: Since n_effective samples is largest at iteration 0, and n_equilibration_iterations is computed as the iteration with the largest n_effective_samples + t0, n_equilibration_iterations ends up being 1.
image

Question 1: Why does get_equilibration_data_per_sample() by default compute on 100 samples only? Is this really a good default? Should the default depend on the number of total iterations? If I have 1000 iterations, maybe 100 samples makes sense, but if I have 90000 iterations, 100 seems like it may be too few.

Now that I'm familiar with these functions, I'm going to play around with swapping out the effective energy timeseries with either the state index replica timeseries or the d_u/d_lambda replica timeseries.
Question 2: How should I feed the state index or d_u/d_lambda replica timeseries into get_equilibration_data_per_sample? Should I just swap out statisticalInefficiency with statisticalInefficiencyMultiple here?

@zhang-ivy
Copy link
Contributor

zhang-ivy commented May 16, 2022

As a follow up to question 2 above, I tried swapping out statisticalInefficiency with statisticalInefficiencyMultiple here and then fed in the state index and d_u/d_lambda timeseries, but both of them still result in t0 = 1.

Here's the code I used for feeding in the state index replica timeseries:

# Retrieve state index replica timeseries
analyzer = MultiStateSamplerAnalyzer(reporter, max_n_iterations=n_iterations)
timeseries_to_analyze = analyzer.read_energies()[-1].data

# Set parameters to be used in `get_equilibration_data_per_sample()`
fast = True
max_subset = 100

# Run code in `multistate.utils.get_equilibration_per_sample()`
# Cast to array if not already
series = np.array(timeseries_to_analyze)
# Special trap for constant series
time_size = n_iterations 
set_size = time_size - 1  # Cannot analyze the last entry
# Set maximum
if max_subset is None or set_size < max_subset:
    max_subset = set_size
# Special trap for series of size 1
if max_subset == 0:
    max_subset = 1
# Special trap for constant or size 1 series
if series.std() == 0.0 or max_subset == 1:
    print("std dev is 0")
g_i = np.ones([max_subset], np.float32)
n_effective_i = np.ones([max_subset], np.float32)
counter = np.arange(max_subset)
i_t = np.floor(counter * time_size / max_subset).astype(int)
for i, t in enumerate(i_t):
    print(i)
    try:
        g_i[i] = timeseries.statisticalInefficiencyMultiple(series[:,t:], fast=fast)
    except:
        #g_i[i] = (time_size - t + 1)
        raise Exception("Unable to run statisticalInefficiencyMultiple")
    n_effective_i[i] = (time_size - t + 1) / g_i[i]

# Run code in MultiStateSamplerAnalyzer.get_equilibration_data()
t0 = 1
# Discard equilibration samples.
# TODO: if we include u_n[0] (the energy right after minimization) in the equilibration detection,
# TODO:         then number_equilibrated is 0. Find a better way than just discarding first frame.
n_effective_max = n_effective_i.max()
i_max = n_effective_i.argmax()
n_equilibration = i_t[i_max] + t0 # account for initially discarded frames
g_t = g_i[i_max]
print(n_equilibration, g_t, n_effective_max)

Output:

1 11697.602 7.69397

I then generated a d_u/d_lambda replica timeseries and fed it through the above code snippet. Here is the output:

1 902.0917 9.9779215

@ijpulidos
Copy link
Contributor

Closing via choderalab/openmmtools#610

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants