Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request for comment: Ebisu v3 API #58

Closed
fasiha opened this issue Sep 20, 2022 · 16 comments
Closed

Request for comment: Ebisu v3 API #58

fasiha opened this issue Sep 20, 2022 · 16 comments

Comments

@fasiha
Copy link
Owner

fasiha commented Sep 20, 2022

Ebisu v3 request for comment (RFC)

Introduction

This issue contains a proposal for a future v3 release of Ebisu, with the goal of inviting feedback to improve the design before it is released.

V2 recap

A quick recap of the current version of Ebisu: v2 models each flashcard's probability of recall as a probability distribution that is valid at a certain time in the future. (Stats nerd note: when you first learn a card, its probability of recall t hours in the future is a Beta distribution with parameters a and b.) At any given time in the future (not just t), you can call ebisu.predictRecall to get the estimated recall probability for this flashcard. Doing this for each flashcard lets you pick which flashcards have the lowest recall probability, and you present one of those to the learner. Then, you call ebisu.updateRecall with the quiz's result (binary pass/fail, binomial passes and fails, noisy pass/fail), which updates the Beta distribution in light of the quiz result.

V3's goal

The main goal of v3 is to address the issue that various folks have raised over the years but that I finally understood thanks to @cyphar in #43 (which has links to the original Reddit thread): Ebisu v2 has a fundamental flaw in how it models your memory, in that it ignores the fact that, in the real world, the act of reviewing a flashcard actually changes the underlying strength of the memory. Instead, Ebisu v2 assumed that your memory of a flashcard was a fixed but unknown probability, and updated its belief about that probability's distribution after each quiz. This made it appear that it was strengthening or weakening the memory model but when we tried to infer the ideal initial halflife for real-world flashcard histories, Ebisu v2 insisted on absurdly high estimates of initial halflife.

In practice, this means that the actual recall probabilities predicted by Ebisu v2 were extremely pessimistic, and flashcard apps that attempted to schedule reviews based on a minimum acceptable recall probability had to set those thresholds unrealistically low. I didn't realize this was a problem because flashcard apps I wrote on top of Ebisu just used its predicted recall to rank flashcards, and ignored their actual values; and since the predicted recall went up or down as I passed or failed quizzes, I didn't notice that the halflife was growing very slowly. Again, thanks to @cyphar and others who patiently explained this to me repeatedly until I finally saw the problem 🙏!

Boost

Ebisu v3 is a total rewrite. It introduces the concept of a "boost" to Ebisu, which is very similar to Anki's "ease factor": each time you review a card, Ebisu will apply a Bayesian update based on the quiz's result, but then for each successful quiz it will boost the resulting probability distribution by a flashcard-specific scalar.

For example, suppose the halflife of a flashcard was a day, and you pass a quiz after two days, and suppose Ebisu v2 would simply update the halflife to four days. The Ebisu v3 model for this flashcard would have not only a probability distrubtion for the halflife (one day) but also for the boost, so if the mean boost value for this flashcard was 1.5, under Ebisu v3, updateRecall would yield a model with halflife 4 * 1.5 or six days. (I'm being very hand-wavy here for purposes of illustration.)

An important part of boosting is that it is applied only for successful quizzes after a big enough delay. This should make sense: imagine if you reviewed that flashcard with one day halflife after just ten minutes, and that Ebisu v2 would update its halflife to, say, 1.01 days. We wouldn't want Ebisu v3 to boost the halflife by 1.5 leaving us with a halflife of 1.01 * 1.5! We'd want v3 to keep the halflife around 1.01, because it's unlikely that a quiz after ten minutes would significantly alter our neural memory of this fact.

Ebisu v3 therefore has the following nominal psuedocode:

def updateRecallBoost(oldHl: float, elapsedTime: float, quizResult: bool, boost: float, LEFT: float, RIGHT: float) -> float:
  updatedHl = updateRecallSimple(oldHl, elapsedTime, quizResult)

  if quizResult:
    b = lerp([LEFT * oldHl, RIGHT * oldHl], [1, boost], elapsedTime)
    # clamp
    if b < 1.0:
      b = 1.0
    elif b > boost:
      b = boost
  else:
    b = 1.0

  boostedHl = b * updatedHl
  return boostedHl

def lerp(xs: float[], ys: float[], x: float) -> float:
  mu = (x - x[0]) / (x[1] - x[0])
  return (y[0] * (1 - mu) + y[1] * mu)

where lerp([x1, x2], [y1, y2], x) is linear interpolation of x along the line between points (x1, y1) and (x2, y2). In words:

  1. calculate the new no-boost Bayesian-updated halflife based on a quiz result and quiz elapsed time,
  2. calculate the boost factor:
    • 1.0 for quiz failures and elapsed times short relative to the old halflife,
    • the full boost factor for elapsed times that are long relative to the old halflife, and
    • between 1.0 and the full boost for intermediate elapsed times, linearly interpolating.
  3. Multiply the two values to get your final new halflife.

LEFT and RIGHT are constants that you pick to control what counts as "short" vs "long" relative to the old halflife. In practice, I've been experimenting with LEFT=0.3 and RIGHT=1.0 and these work well.

This RFC

With now two things to keep track of for each flashcard (its halflife and its boost), the update step is more complicated in v3 than v2: it is broken down into two functions that app authors will need to call, because of the computational burden. However, in v3, the predict step (predictRecall) is a lot faster than in v2 and in fact can be done in SQL. Hopefully the more complex mathematics and API are worth it for more realistic estimates.

In the remainder of this issue, I'll

  1. briefly sketch out the new probablility model (since GitHub doesn't render LaTeX, there will be only very basic equations), but more importantly,
  2. specify the functions in the API and how they work in considerable detail.

My main goal is to get feedback on the second point above from app authors who use Ebisu. Does the proposed API feel good? Are there flaws in it? Are there other nice-to-haves that I haven't thought of? Do you want things renamed?

V3 statistical model

This section can be skipped by folks who don't care about the math. For those that do, this section will only be a sketch because I can't use LaTeX math in a GitHub issue. The details will be spelled out on the main Ebisu website when v3 is released, but I'm of course happy to answer questions about this now during the RFC phase as well.

Before getting to v3's statistical model, recall that v2's statistical model is, for each flashchard:

(recall probability in t hours in the future) ~ Beta(a, b),

that is, a Beta random variable parameterized by a and b.

In v3, we instead go a level higher and place the prior probability distribution on the halflife directly:

halflife ~ Gamma(a, b),

We also place a probability distribution on boost:

boost ~ Gamma(aBoost, bBoost).

That is, we have two Gamma-distributed random variables.

Predict

The predicted recall probability at time t hours in the future is a very simple arithmetic expression:

logPredictRecall ∝ -t / E[halflife]
predictRecall = exp(-predictRecall)

The symbol means "proportional to", so there are some extra factors I'm omitting here that aren't important to the explanation here but are needed to line up the units of t and halflife, and to convert exp's base to 2 (so that predictRecall for t = halflife is 0.5 = 2**-1 instead of exp(-1)).

Note that the predict recall step is really simple. To find the flashcard with the lowest predicted recall, find the flashchard with lowest -t / halflife, which can be done in SQL and cached and etc.

In the README for Ebisu v2, I invoke Jensen's inequality to explain why this is technically inaccurate: because E[f(x)] ≠ f(E[x]), the exact expected probability of recall E[2**(-t / halflife)] will be different from the above, which is 2**(-t / E[halflife]). However, I've decided for v3 that this inexactness is a good exchange for the computational simplicity and boost-based modeling.

Update step 1: update halflife

When we have a quiz result, after time t has elapsed since the last review, we can apply a Bayesian update on halflife:

  • original prior: halflife ~ Gamma(a, b)
  • exact posterior: halflife | quiz after t hours ~ SomeComplicatedDistribution (we have moments of this, not the exact density)
  • approximated posterior: halflife | quiz after t hours ~ Gamma(a2, b2)

So far this is the exact same architecture as Ebisu v2, except with prior on halflife instead of recall probability (and therefore different integrals).

However, at this stage we apply the boost:

  • given a mean posterior halflife = E[halflife | quiz after t hours],
  • compute a deterministic value boostValue = clampLerp([LEFT * E[prior halflife], RIGHT * E[prior halflife]], [1, E[boost]], t), where clampLerp is the linear interpolation clamped so 1 <= boostValue <= E[boost], before finally
  • scaling the posterior yielding the final boosted halflife, boostValue * posterior ~ Gamma(a2, boostValue * b2)

That is, we've updated our probability distribution on the halflife in response to a quiz and applied a boost (a little bit of boost for small t, the full boost value for large t).

In this way, we can see that we're just updating our probability distribution around the initial halflife, before the first quiz, and after each successive quiz, we update our distribution about that initial halflife. The halflife after several quizzes is a random variable that's fully specified by the initial halflife random variable after it's been scaled by a sequence of these clampLerped boost values.

However, this leaves open the question: how do we estimate boost? How do we update our probability distribution from boost ~ Gamma(aBoost, bBoost)?

Update step 2: update halflife and boost

This section is the primary innovation of Ebisu v3 over v2, mathematically. After we have three or more quizzes (though you could do this after just two quizes), we can update our probability distributions about both (a) the initial halflife and (b) the boost.

To do this, we use two simple techniques in sequence.

First, curve fit. We evaluate the posterior initial halflife, boost | several quizzes for a grid of values in the initial halflife × boost plane and curve fit this two-dimensional posterior surface to two independent Gamma random variables. This is readily doable because we assume each quiz is independent, so the overall posterior is simply the product of individual likelihoods and priors after each quiz.

We curve fit because, while we have a simple expression for the posterior (up to a scalar constant), we can't analytically evaluate this bivariate distribution's moments. (In Ebisu v2, we had a univariate posterior on recall probability with thankfully tractable integrals yielding analytical moments.)

The curve fit is just weighted least-squares, i.e., a very fast solution to an overdetermined system of equations. It is essentially solving A x = b for a tall and skinny matrix A, and where the unknowns x are the four parameters of the two Gamma random variables (initial halflife and boost).

However, the resulting curve fit often fails to properly capture the behavior of the posterior in its tails, so badly that prediction performance drops if we just use these two Gammas as our updated probability distributions.

Therefore, second, we use importance sampling, a Monte Carlo method to obtain moments of a probability distribution which we can analytically evaluate, given samples of some other distribution (the latter is called the "proposal distribution").

Specifically, we use importance sampling get accurate estimates for the moments of the bivariate posterior, given samples from the curve-fit bivariate Gamma distribution we estimated. With relatively few samples (on the order of ~1000), we can get good estimates of the posterior's moments and moment-match these to two independent Gammas, on initial halflife and boost respectively.

Therefore, this update step allows us to update our estimate of the boost factor in a data-driven way, unlike Anki's ease factor which is hardcoded. Given a sequence of quiz results, we can answer the questions, "What was the halflife of this fact when I began learning it? What boost factor has been strenghtening it each time I review it?"

N.B. We need both these steps, and in this order. The curve fit alone doesn't give us accurate enough parameter estimates but it gives a great proposal distribution for use with importance sampling. Without that, importance sampling using the original priors on initial halflife and boost would need many more Monte Carlo samples and would not be computationally tractable. (I tried 🥲.)

N.B.2. While I've described this two-stage bivariate update step as relatively efficient (linear system solver, importance sampling with ~thousands of samples), it is much more expensive than the plain univariate Bayesian update described above, which used the mean boost E[boost] everywhere. Therefore, the API breaks these up into two separate phases, with two update functions in the API: you can run the simpler univariate halflife-only update after each quiz and then maybe once a day or once a week run the heavier-weight bivariate halflife-and-boost updater to refresh your estimate of boost for each flashcard.

We turn to the API secion next. Again, the above mathematics has been terse because I only have ASCII to describe it and I'm hoping to primary get feedback on the API, but I hope it is sufficiently detailed to give a flavor of the computational burdens involved.

Tutorial on importance sampling

Here's a super quick script to show you the value of importance sampling, plus how it's used.

Suppose you want to estimate the mean of the square of the unit uniform distribution, i.e., if u ~ Unif(0, 1), what is E[u**2]?, and you can't do calculus so you want to do Monte Carlo. Super-easy:

import numpy as np
from scipy.stats import norm as normrv
from scipy.stats import uniform as uniformrv

nsamples = 1_000_000
u = np.random.rand(nsamples)

direct = np.mean(u**2)
directvar = np.var(u**2)

You generate a million samples uniformly between 0 and 1, square them, and call np.mean to get your estimate of the mean. Then you can call np.var to get the variance, i.e., accuracy, of that estimate: the lower the variance, the more accurate your estimate, and the more efficient your sampling is (those million samples are doing a good job).

This is not importance sampling, it's just normal Monte Carlo sampling.

Suppose though that you for some reason cannot sample the uniform distribution. Maybe you only have a library to generate Normally-distributed (Gaussian) random samples, so you have randn but no rand. Could you estimate E[u**2] with just randn? Yes. Importance sampling lets you:

n = np.random.randn(nsamples)
indirect = np.mean(n**2 * uniformrv.pdf(n) / normrv.pdf(n))
indirectvar = np.var(n**2 * uniformrv.pdf(n) / normrv.pdf(n))

Here we generate a million samples from the unit Normal distribution, n ~ Normal(0, 1), and again square them, but before calling np.mean to get our estimate, we weight each sample by its importance. That's what the crucial uniformrv.pdf(n) / normrv.pdf(n) factor there is doing: it's giving more weight to samples that are likely to have come from our target distribution (Unif(0,1)) and less weight to samples that aren't.

This will actually work: your indirect estimate of E[u**2] will be correct but (much) higher variance than when you could sample from the Uniform distribution directly.

But we know the unit Normal distribution is an obviously bad proposal distribution for the unit Uniform distribution—over half the samples will just be thrown away, since negative samples get an importance factor of 0. We can try one more experiment: let's use Normal(0.5, σ=0.5) as our proposal. This should be a good deal more accurate than the unit Normal:

n2 = normrv.rvs(loc=0.5, scale=0.5, size=nsamples)
indirect2 = np.mean(n2**2 * uniformrv.pdf(n2) / normrv.pdf(x=n2, loc=0.5, scale=0.5))
indirect2var = np.var(n2**2 * uniformrv.pdf(n2) / normrv.pdf(x=n2, loc=0.5, scale=0.5))

Printing out the estimated mean and the variance (inaccuracy) of each estimate:

print("| method | estimated mean | estimator variance |")
print('|--------|----------------|--------------------|')
print(f'| direct | {direct:0.4f} | {directvar:0.4f} |')
print(f'| crappy proposal | {indirect:0.4f} | {indirectvar:0.4f} |')
print(f'| better proposal | {indirect2:0.4f} | {indirect2var:0.4f} |')

yields

method estimated mean estimator variance
direct 0.3328 0.0888
crappy proposal 0.3340 0.6122
better proposal 0.3332 0.2182

As expected, the middle column all report the same estimate, 1/3. However, whereas the straightforward Monte Carlo estimator had a low variance of 0.09, which is the best you're going to get given you had access to rand, the crappy importance sampling estimator using just randn (the unit Normal) as its proposal had a much higher variance, almost 7× worse. But by using Normal(0.5, σ=0.5) as the proposal, we reduce the inaccuracy of the estimator, whose variance is just ~2.5× the direct estimator.

So, of course ideally you'd be able to do the integral to find E[u**2] (zero estimator variance 🙌!), and failing that, hopefully you can use rand to draw uniform samples and calculate the estimate directly. But failing both, importance sampling gives you a way to convert samples from some rando distribution into moment estimates of the distribution you do care about.

(Obviously the proposal distribution can't be too rando, so for example you couldn't switch the role of the Uniform and Normal above: a unit Uniform proposal totally misses out on huge chunks of the target unit Normal's support. There are rules about what makes an allowable proposal distribution, and about what's the best proposal distribution (the target distribution 🙃) that textbooks bore us with, but the rules are quite sensible.)

Hopefully the above convinces you that there's nothing magical about using it to improve our posterior fits. The curve-fit posterior is like the "better proposal" above. The "crappy proposal" for importance sampling might be the original priors on halflife and boost (except after a dozen quizzes, the variance of the estimates would be enormous without billions and trillions of samples).

V3 API

Model

Ebisu v2 had a very compact data model. Each flashcard was just a 3-tuple, [float, float, float]: two numbers representing the Beta random variable's parameters and the elapsed time at which that Beta applied. It didn't keep track of past quizzes because that 3-tuple was a "sufficient statistic": as far as the math was concerned, everything about the past was encoded in that 3-tuple.

Because the Ebisu v3 algorithm described above needs to know about past quiz times and results to re-estimate boost, the Ebisu v3 data model is much bigger, and now includes past quizzes.

See Python `dataclass`es for Ebisu v3 model

While the key-value structure of the data model is probably an implementation detail and not relevant for the majority of readers of this RFC, for completeness I include it here, with the caveat that it may change.

The top-level model has three sections:

@dataclass
class Model:
  quiz: Quiz
  prob: Probability
  pred: Predict

One section for a list of quiz results and times:

@dataclass
class Quiz:
  elapseds: list[list[float]]

  # same length as `elapseds`, and each sub-list has the same length
  results: list[list[Result]]

  # 0 < x <= 1 (reinforcement). Same length/sub-lengths as `elapseds`
  startStrengths: list[list[float]]

A second section for the parameters of the probability distributions:

@dataclass
class Probability:
  # priors: fixed at model creation time
  initHlPrior: tuple[float, float]  # alpha and beta
  boostPrior: tuple[float, float]  # alpha and beta

  # posteriors: these change after quizzes
  initHl: tuple[float, float]  # alpha and beta
  boost: tuple[float, float]  # alpha and beta

And a final section for computed values that are useful for predictRecall (whether that's done in Python or in SQL or whatever):

@dataclass
class Predict:
  # just for developer ease, these can be stored in SQL, etc.
  # halflife is proportional to `logStrength - (startTime * CONSTANT) / halflife`
  # where `CONSTANT` converts `startTime` to same units as `halflife`.
  startTime: float  # unix epoch
  currentHalflife: float  # mean (so _currentHalflifePrior works). Same units as `elapseds`
  logStrength: float

For logStrength and startStrengths, see discussion below on reinforcement strength.

Initialize

def initModel(initHlPrior: Union[tuple[float, float], None] = None,
              boostPrior: Union[tuple[float, float], None] = None,
              initHlMean: Union[float, None] = None,
              initHlStd: Union[float, None] = None,
              boostMean: Union[float, None] = None,
              boostStd: Union[float, None] = None) -> Model: pass

You're expected to provide either initHlPrior (a 2-tuple [α, β] of the Gamma random variable) or both initHlMean and initHlStd representing the mean and the standard deviation of the Gamma random variable representining the initial halflife.

Similarly for boost.

Predict

There are actually two functions here:

def predictRecall(model: Model, elapsedHours=None, logDomain=True) -> float: pass
def _predictRecallBayesian(model: Model, elapsedHours=None, logDomain=True) -> float: pass

The official predictRecall returns 2**(-t / (mean halflife)), which is an approximation of the recall probability at time t in the future. It's mathematically inaccurate but very fast and convenient to find the mean of the halflife and then put it through Ebbinghaus' exponential forgetting function. I expect this function to be portable to SQL, etc., because it's just a fast algebraic expression (addition, multiplication, division).

For fun, I plan to include _predictRecallBayesian which computes the exact expected recall probability (like Ebisu v2 does). The leading _ means I don't expect this function to be in ports to other languages. This function computes an expensive Bessel function of the second kind, the analytical solution to the integral that yields the exact recall probability.

Update halflife

I've decided to reuse the updateRecall name in Ebisu v3 for the "quick" update step that assumes the boost is fixed and just applies a quiz result:

def updateRecall(
    model: Model,
    elapsed: float,
    successes: Union[float, int],
    total: int = 1,
    now: Union[None, float] = None,
    q0: Union[None, float] = None,
    reinforcement: float = 1.0,
    left=0.3,
    right=1.0,
) -> Model: pass

Note that like Ebisu v2, the above function supports noisy quizzes and binomial/binary quizzes. Also note two new v3-only parameters: left and right, which control how much boost to apply. Nominally, if elapsed < left * (current halflife), no boost is applied. If elapsed > right * (current halflife), the full boost is applied (whatever it may be, recall it has a probability distribution that you initialized or that was computed from data). For values of elapsed in between, the boost is scaled linearly.

This function needs to be called whenever the student completes a quiz. But as alluded above, it only updates the halflife, and takes the boost as fixed.

For more on what reinforcement means, see below on reinforcement strength.

Update halflife and boost

The "non-quick" update function will update its belief about both the halflife and the boost.

def updateRecallHistory(
    model: Model,
    left=0.3,
    right=1.0,
    size=10_000,
) -> Model: pass

If we had unlimited computing power, we'd run updateRecall and then immediately run updateRecallHistory so we always had the most accurate estimates. However, this updateRecallHistory is an expensive function, much moreso than the old v2 updateRecall, because it is looping over all quizzes. Specifically, as described in the math section above, it's

  1. evaluating probabilities on a two-dimensional grid of halflife and boost for each quiz, then
  2. solving a linear system of equations, before finally
  3. an importance sampling step (the number of Monte Carlo samples is governed by the size argument).

All this should be totally feasible on mobile/browser, but I do expect apps will opt to run this as a batch function, for example,

  • run this daily for flashcards that were reviewed today, or
  • run this every 2–4 quizzes.

If you initialized the boost prior reasonably, there shouldn't be a pressing need to rerun this function after each quiz.

In the future, some smart person might enlighten me how we can make this estimation step better (or how we can change the entire model to have a better estimation step), but for now, I feel that this is a reasonable balance between model accuracy, prediction speed, and update complexity.

Reset halflife

Ebisu v2 supports manually rescaling the halflife (rescaleHalflife in v2) of a model, for those situations where you just know the model has over- or underestimated your memory. We'd like to support this in v3 as well.

In Ebisu v3, since a flashcard's data model has all past quiz results, you may think that it'd be straightforward to just tweak the initial halflife's priors and re-run updateRecallHistory, but after you get several quizzes, the initial priors are often relatively powerless, with the posteriors almost entirely driven by the data (the quizzes). This is a well-known outcome in Bayesian statistics.

Therefore, the proposed function is called "reset" because it will literally start a new chapter in your history with this flashcard with a new prior on halflife:

def resetHalflife(
    model: Model,
    initHlMean: float,
    initHlStd: float,
    startTime: Union[float, None] = None,
    strength: float = 1.0,
) -> Model: pass

The model will still contain all past quizzes, but this function sets those aside and reinitializes a new list of elapsed times and quizzes, so all future quizzes get the initial halflife you ask for here.

This way, if after several quizzes, you need to rescale/reset the halflife of a flashcard, this lets you do that. No hard feelings.

Ideally someday we'll be able to analyze each flashcard's history of quizzes and automatically detect when there's a "new halflife", i.e., when your memory for this fact got much better or much worse. Sort of like the example in Bayesian methods for hackers chapter 1, "Inferring behaviour from text-message data".

That's it. That's the API.

Reinforcement strength

Throughout the discussion above, I've alluded to startStrength and logStrength and reinforcement. This is to support partial reconsolidation or reinforcement, raised in #51 by @jasonsparc. The idea is, Ebbinghaus' exponential decay assumes that, after each review, your memory of some fact jumps to 100%: probability of recall 2**(-t / halflife) = 1.0 when t=0 (right after you've reviewed this fact, probability 1 = 100% recall probability). @jasonsparc's idea is: what if this isn't true? What if you know that the quiz only partially reconsolidated the memory, i.e., that at t=0.0001 (a millisecond after your last quiz), the probability of recall is not .9999 but something much less, maybe 0.5, maybe 0.1?

Ebisu v3 tentatively supports this by allowing you to specify how much reconsolidation you think has happened every time you call updateRecall. This number is taken as deterministic and fixed, just like the quiz result or elapsed time, and it flows through the update process as if it was known. We'll want to experiment with this to make sure it's working. For users who aren't interested in playing with this feature, the API defaults to sane values.

Conclusion

The code and tests are written. You can find them (along with a ton of irrelevant code I wrote while getting to this point) at https://github.com/fasiha/ebisu-likelihood-analysis/, specifically ebisu3.py and test_ebisu3.py.

I've prepared a script that can

  1. load flashcard review history from Anki's collection.anki2 files (if you export your Anki deck to an apkg file and unzip that, this collection.anki2 file, which is a SQLite database file, will be inside), and
  2. run a couple of cards through both Ebisu v3 and a Stan model.

The Stan model will likely mostly be useful for those interested in changing the model or putting hyperpriors on left and right and other parameters, or verifying Ebisu v3's correctness (there's a difference in the output between these two that I'm working on tracking down).

But mainly I'm hoping that interested parties provide feedback on the API above. And of course any other feedback (the mathematical model for example) is also most welcome.

Many thanks for your patience with v3. This is several months late and I am grateful for the support and feedback I've received so far 🙇.

@fasiha
Copy link
Owner Author

fasiha commented Sep 21, 2022

(Placeholder for edits, to-do lists, etc.)

@fasiha fasiha mentioned this issue Nov 16, 2022
@BieniekAlexander
Copy link

Is this still being worked on? I'm putting Ebisu in my project, and I'd like to chip in.

@fasiha
Copy link
Owner Author

fasiha commented Jan 28, 2023

@BieniekAlexander thanks for the comment! Yes haha, I am doing a lot of work on Ebisu these days and should have posted an update here sooner.

The proposed model above (Gamma on half-life with boost) is good. It produces good predictions for recall probability on real data, and I've written detailed tests and a lot of the README for it (they're in the v3 branch where you can read the README), but I felt it was too complicated in a few places which made me revisit the literature from Mozer et al (their 2009 paper on MCM and their 2014 paper on DASH). Their emphasis on power law decay instead of exponential was inspiring so I designed a totally different proposal for v3 😅, based on Mozer's team's work called "leaky integrators". I've been pretty happy with it, I've mostly completed the README and the code in this branch: https://github.com/fasiha/ebisu/tree/v3-leaky-integrators#ebisu-intelligent-quiz-scheduling I'm going to resist trying to describe it here, but here's a direct link to the data model section of the README that I think is quite concise. It's simpler than the Gamma-on-halflife proposal above, it's faster, on average it seems to produce just as good predictions of quiz results as the Gamma-on-halflife, plus it doesn't suffer from #52.

Writing the details of a model's math and code like this (in Knuth's "literate programming" style) is powerful because it makes you confront the rough edges, the infelicities in your design and implementation choices 🫠. In the last couple of days I've been thinking of how to potentially simplify this leaky integrator proposal even further. I know the risk of the never-finished project so I'm weighing the value between continuing doing research on potential improvements/alternatives versus just publishing what I have.

@jyboy
Copy link

jyboy commented Feb 9, 2023

I'm excited to see a new path to memory with SRS. It's awesome to model the law of memory with math.
As an Anki user, I'm wondering if you have plan to support Anki custom scheduling in the future? As I noticed there was a comment on Ebisu by someone before which must be bias.

I'm asking this here because it is a small question and related to the roadmap (and vision) of Ebisu v3. I really hope there will be many good options of scheduler in Anki. Anki users will also provide effective feedback and even participate in the contribution.

image

@fasiha
Copy link
Owner Author

fasiha commented Feb 10, 2023

Thanks for writing @jyboy!

Hmm, for sure I can help port Ebisu to the Anki custom scheduler, once v3 is released. I see that Anki is still attached to the concept of a "scheduled review" (scheduled_secs and scheduled_days in backend.proto from your link), which Ebisu can accommodate by figuring out what future time corresponds to the recall probability dropping to 50% or 80% or whatever user-settable probability.

For the visually impaired, here's the conversation in the screenshot:

2021/03/07 19:44: I heard the new SRS algorithm (ebisu) handles underdue and overdue well. Anyone with knowledge in the matter?
2021/03/07 19:47: Theoretically it does.
Practically, it hasn't really been used much, and the author didn't really intend to have the algorithm used in classical SRS from what I can tell.
(This has been told to me by a Supermemo enthusiast who actually was in contact with the Ebisu dev) Only measurement can tell if Ebisu > Anki SM-2, but I would bet my money that SM18 is still easily better than Ebisu. On the other hand, don't care too much about the algorithm, rather focus on learning the right things the right way, is what I'd say

Honestly, I agree with this—I really doubt flashcard scheduling is the reason users quit Anki/Memrise or stay with Duolingo, it's really all about the content, presentation, and quiz interactivity. There have been some correlative studies that note students' test scores rise as a result of practicing with a certain SRS (MaiMemo example/paper, DASH example, Duolingo HLR paper, etc.) but I'd love to see third-party replications, especially with something more meaningful than semester exam scores? But until then I'm confident that SRS is just a secondary or tertiary effect on motivation and performance.

But I am confused by comparisons of probabilistic algorithms like Ebisu or half-life regression or DASH to Anki SM-2—their APIs are so different.

Specifically, consider that Anki's scheduler outputs just a single value, the future due date. "This fact needs to be reviewed in 14.1 days." When you inevitably over-review (review a flashcard sooner than it's due date) or under-review (skip reviewing on that due date), there are finely-tuned rules to decide what to happen—rules that take time to develop and test and refine.

Meanwhile, consider that Ebisu's output is a function, a mapping from time to recall probability. In crude terms, rather than a single number ("14.1 days"), Ebisu outputs a very long array of (time, recallProbability) pairs. The issue with over-reviewing and under-reviewing doesn't arise, not because of any special Ebisu magic, but just because Ebisu (like any probabilistic algorithm) deals with models, which are bigger and more capable mathematical objects than a single number output by Anki.

Having a model is why Ebisu is able to seamlessly handle more complex quiz types as inputs:

  • "the student read this paragraph and did NOT click on any word to see its definition; I [the quiz app] estimates Probability(they know the word) = 0.8, Probability(they don't know the word) = 0.3" (Language Learning with Netflix/Language Reactor uses Ebisu to handle this)
  • "The student just produced the pronunciation from the kanji, but then we showed them (1) kanji, (2) pronunciation, and (3) definition, so all three memories are strengthened, but how much do we delay each?"
  • without any rules, just by virtue of using a probability model, Ebisu is able to guarantee that if you review a card once a week, your memory for that card grows beyond a week only slooowly. This matches psychological research that shows inter-review interval tightly governs future recall probability.

The only input Anki SM-2 has is fail/hard/easy/pass, but I am sure the Anki/Supermemo communities can and likely have figured out rules to adapt their algorithms to a similarly rich array of quizzes needed by reading apps and correlated cloze deletions—those communities are energetic and brilliant and motivated. However, I don't think it's controversial to say that tuning SM-2 rules for the above types of quizzes takes time and effort. Ebisu and similar probabilistic algorithms, by virtue of leveraging statistical models, can often make it a lot easier to quickly innovate new quiz formats/content structures/learning styles, by offloading the bookkeeping of memory tracking.

Now. It's entirely plausible, like the commenter says, that carefully-tuned SM2-like rules will eventually outperform probabilistic models, but hopefully in the interval, libraries like Ebisu allow quiz app authors to explore a huge chunk of the space of all possible quiz apps. As quiz app authors ask, "how can I incorporate sound recordings for pronunciation practice in my SRS", having a statistical model in place and ready to tap the large library of applied math tools will give some of those quiz app authors a head start.

Aaaaaall that said… please don't be too impressed with math 😆, math can lead you very astray even when it's not wrong, because it's so often the case that simple ad hoc methods (like SM-2!) wind up being more effective than complicated mathematical algorithms! Conor McBride recently described this quite graphically but accurately on Mastodon:

There is a school of paper-writing which amounts to doing a very dense logical shit on your reader's doorstep and then running away. These people want to tell us they understand things. They do not want us to understand things, or if they do, they do so only as an aspiration unenacted. Of course, dense doorstep defaecation is standard mathematical practice. https://types.pl/@pigworker/109792088887875922

🤣 gross but true! I have worried that Ebisu will fall into this bucket since day one, and it continues to be a fear, but so far I've been really pleased with the kinds of apps Ebisu is helping folks build 😁.

Sorry for the long boring post, hopefully it was more helpful than confusing.

@fasiha
Copy link
Owner Author

fasiha commented Feb 10, 2023

Quick update on Ebisu v3: this is coming along nicely! It's a much simpler model than the Gammas-on-boost-and-halflife described above but just as capable! I will probably close this issue and either open a new request for comment (RFC) issue or, if I can wrap up the README and tests quickly, just publish a release candidate (RC) soon—hopefully in a week or two.

@BieniekAlexander
Copy link

@fasiha I'm only now trying to catch up on algorithms and models being used for recall training - the reason that I moved my project's implementation from sm2_anki to ebisu is because sm2 doesn't seem to have a clean answer to pull-based fact studying. Now that I'm using Ebisu for my project, the outstanding concern that I have is Ebisu v2's propensity to converge to a specific half-life (rather than discounting the influence of earlier study sessions over time) - I'm excited to see if v3 has an answer to this problem!

@fasiha
Copy link
Owner Author

fasiha commented Feb 22, 2023

Ebisu v2's propensity to converge to a specific half-life (rather than discounting the influence of earlier study sessions over time)

@BieniekAlexander do you have an example of this? It might help redirect me but here's what I tentatively think.

  1. This might be slightly ameliorated in Ebisu v3 because it has a more accurate estimate of memory decay: early failures will be expected so will be less likely to permanently drag down the half-life growth as the flashcard matures.
  2. But at the same time it's plausible that this doesn't help that much. In Ebisu's Bayesian formulation, all experiments (quizzes) are equally "valid". There's nothing in the formulation to deweight "old" data. I have thought about this—it would be nice to have some automated way of "change detection" where the algorithm automatically notices that your memory for a flashcard became much better after some time (for example, see the text message time series example in Probabilistic Programming & Bayesian Methods for Hackers), but I haven't thought of a nice way to model this. Are we looking for specific times when your memory becomes much stronger or weaker (because you discovered a powerful new mnemonic, or you learned a bunch of interfering confuser flashcards)? Should we just deweight older flashcards?

In this RFC I did propose a resetHalflife function that would just ignore all older quizzes and initialize a new prior for future quizzes. It might be acceptable in many situations, but I totally agree it's a nuclear option.

I'll think about how to maybe build a preference for newer data into the algorithm. It's not clear to me how to do this right now, so I'm happy to get advice.

I know this is probably not what you wanted to hear but I hope this makes sense at least?

@BieniekAlexander
Copy link

BieniekAlexander commented Feb 23, 2023

@fasiha

@BieniekAlexander do you have an example of this?

Actually, I may very well be wrong about that assertion that I made. Let me cite some things that led me to believe my claim. I don't have an extremely strong stats background, so feel free to question my points. To restate the claim: I have the impression that as the user studies more, the beta prior will converge on some ideal half-life, which would also mean that a fact's half-life will become harder and harder to update as it is studied more and more. Here are my references;

  • Here's a blogpost that I read which explains the beta distribution through a baseball analogy. As I undestand it, the initial model (α=81, β=219), with some degree of confidence, asserts that a given batter is expected to bat with a .27 average. I believe this initial model represents a prior of some sort of "average baseball player". For a given batter, the model is updated according to each batting attempt, and with each update, the model more and more represents the batting ability of that given player.
  • Here in this post, you say that "Ebisu v2 has a fundamental flaw in how it models your memory, in that it ignores the fact that, in the real world, the act of reviewing a flashcard actually changes the underlying strength of the memory. Instead, Ebisu v2 assumed that your memory of a flashcard was a fixed but unknown probability, and updated its belief about that probability's distribution after each quiz.".

In summary, when I talk about the model converging to some sort of half-life, I think that's a corollary to your statement here that "Ebisu v2 assumed that your memory of a flashcard was a fixed but unknown probability" (though maybe I'm mincing words).

Anyway, you mention a resetHalflife function to account for this. That makes me a bit concerned because part of the reason I initially dug into Ebisu was because of the simplicity of the interface. I initially made an implementation on top of Anki's SM2, but I was turned off by the complexity of the interface. It's not extremely complicated, but I like the prospect of a really elegant solution that relies on just a few function calls. To address this problem of convergence, instead of additions to the interface, I suppose I would hope that the model somehow discounts the older study attempts and more heavily weights the more recent attempts. Given my background in Deep Learning, I'm aware of some metrics such as Adam designed to, in a fast (maybe constant?) time evaluation, recalculate a metric with a decay of older data.

@fasiha
Copy link
Owner Author

fasiha commented Feb 25, 2023

Thanks for the detailed reply @BieniekAlexander!

The baseball analogy is useful! The classic Bayesian intuition for the Beta distribution is that it's indeed a model of a weighted coin whose underlying probability of coming up heads you don't know. As you flip the coin, α and β represent the number of times it comes up heads or tails respectively—Ebisu v2 treats "heads" as "quiz success", and adds the extra twist of exponential decay. If you initialize a Ebisu v2 model with initial half-life of one week and do quizzes every week, the (α, β) that Ebisu estimates are exactly this, and will "converge" to the odds that you get the quiz right after a one week delay.

Of course, since your memory is actually strengthening every week, the model will never converge 😅 and Ebisu v2 won't really settle down (and that's without adding the complexity of exponential decay of memories and the fact that you do these quizzes at arbitrary intervals, not just a week).

To your second point, I totally agree resetHalflife is an escape hatch, using it is basically admitting that the model was initialized wrong, and that's overpowering the data, and just needs to be "rebooted".

Any kind of temporal deweighting has to be done carefully because:

  • sometimes (maybe often?) there is no need to discard or deweight old quizzes because they are from the same probabilistic regime as recent quizzes, and using all your history yields more accurate predictions. Admittedly, I have no way of knowing how often this is the case so this is a weak point.
  • Another lazy point is, Ebisu v2 and v3 (not the boost proposal above but what I am eventually hoping to ship) have a Markov property, in that the update for the tenth quiz P(h | x_1, x_2, ..., x_10) simplifies to P(h | x_10), i.e., the model contributions from the 1st to the 9th quizzes are all rolled up into a single distribution. And the reason we have this convenient property is because each quiz is independent. If we deweight old data, I'd lose this nice theoretical and practical feature. That's not a reason to not do this! But it does give me reason to think carefully about how we do this.
  • Ideally, I'd love to be able to detect when your memory of a flashcard has suddenly strengthened or weakened. This is a "clustering" problem—there are contiguous clusters of quizzes from the same underlying probabilistic regime, and then the regime changes, and we have to estimate its parameters. Instinctively I want to use the Dirichlet (stick-breaking) process to model this (related to the Chinese restaurant process), but, I expect this is hard to do accurately with just 10-20 quizzes 😅 so I haven't spent a ton of time thinking about this.

All that said, I think v3, which tries to capture the fact that your memory of a fact improves each time you see it, will not get locked into a stale half-life. It remains to be seen whether v3's model for memory improvement is good 😆.

@BieniekAlexander
Copy link

BieniekAlexander commented Mar 29, 2023

@fasiha

Of course, since your memory is actually strengthening every week, the model will never converge 😅

Yeah, I am incorrect and it won't converge because you're hopefully repeatedly telling the model that you're right, so it'll continue pushing forward the halflife. I suppose the thing I'm trying to caputre instead, then, (and what I had in mind when I used the term "converge") is that the rate at which the recall model updates updates in a manner that equally weights your studying in the past to your studying at the current time. It's been a minute, and I don't recall why I had this intuition, so what I should do is experiment with the model a bit and compare how the halflife actually changes against how I might expect it to change.

I'll play around with it soon!

@BieniekAlexander
Copy link

@fasiha I threw together some visualizations, and I think my suspicions were correct. First, the plots:

In this first example, the fact is updated after the previous halflife has elapsed. The term is guessed correctly some N times, incorrectly some N times, correctly some N times, etc.
image

In this second example, the fact is updated 24 hours after the previous study session. The guess correctness is identical to that of the previous session.
image

I didn't put too much thought into the generation of study timing and study success, but I don't think they're important. I also don't understand how the recall updating works, but I think I'm seeing the following property of the updates:

  • When the fact is recall successfully, it appears that the halflife is updated in some linear fashion.
  • When the fact is not recalled, it appears that all future halflife changes have a smaller magnitude (all else equal).

With this behavior, my concern is: If I struggle to recall a fact at some point in my study history such that halflife updates are minor, future halflife updates of the fact will always be small. I think it makes sense that the model does not update the recall as much when they have been failing to recall the term a lot. However, if for some reason I greatly improve my memory of a fact, my fact's halflife will still only increase very slowly.

@ckoshka
Copy link

ckoshka commented Apr 11, 2023

I've been following this project for quite a while now and I'm happy to see v3 finally happen.

I will see if I can port it to rust and from there, compiling it as a wasm function so it can be used from any language implementing a wasm runtime.

I guess the first step is generating test data to ensure it has the same behaviour as the original.

@fasiha are there any non-obvious edge-cases I should be aware of that aren't checked for in the code? as in, "this function will return nonsensical results if param_a is below param_b".

@fasiha
Copy link
Owner Author

fasiha commented Apr 12, 2023

Thanks for the note @ckoshka! Just to make sure I understand your question, you are hoping to port v3 to Rust, is that right? If so, let me try and wrap up the behavior of the algorithm and generate some input/output test data to measure correctness against. (Version 2 has this, but I think you're more interested in porting v3 rather than v2?)

The current v3 algorithm is at in the v3-leaky-integrators branch (README here, also mostly complete). The thing that gives me pause is, I loved how v2 was totally devoid of magic numbers or special parameters: you set your memory model and go. This v3 performs well but has 3-ish magic knobs that one needs to tune with real quiz data to get good performance, over both strong memories and weak memories. Every time I think about releasing v3 I feel disgusted trying to write an explanation for these parameters, because it's basically like saying "🤷 we don't have a great way to mathematically handle this so here's the best we can do" 😅.

But I will persevere! Thank you for your kind note and your patience!

@ckoshka
Copy link

ckoshka commented Apr 17, 2023

v3, that's right. Though I could start with v2 since it might help me understand the changes in context. And ya, I think I know that feeling. Sometimes when I'm just experimenting and iterating on something I land on a combination of params that seem to just work for whatever reason via trial and error. No biggie, I can sit tight for the time being 😅

@fasiha
Copy link
Owner Author

fasiha commented Jun 4, 2023

This RFC is superseded by #62 😁 see you all there!

@fasiha fasiha closed this as completed Jun 4, 2023
@fasiha fasiha unpinned this issue Jun 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants