Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stochastic averager inefficiencies #34

Open
nblei opened this issue Mar 23, 2024 · 3 comments
Open

Stochastic averager inefficiencies #34

nblei opened this issue Mar 23, 2024 · 3 comments

Comments

@nblei
Copy link
Contributor

nblei commented Mar 23, 2024

One comment and one question about stoch_avg.sv

integer i;
always @(*) begin
    for (i = 0; i < NUM_POPS - 1; i = i + 1) begin
        if (i == 0) sum[i] <= a[i] + a[i + 1];
        else sum[i] <= sum[i - 1] + a[i + 1];
    end
end

This infers a linear summation structure when a tree structure will reduce combinational delay significantly. Perhaps routing is better with the linear structure, but I doubt it.

The question that I have is why use this adding structure in the first place? Why not use an LFSR to index into a mux, thereby evenly sampling the inputs? The adder based approach requires, log(N)+1 flops for the counter, and ~3N FAs (1N for the input sum reduce, 2N for adding it the counter), while the mux approach requires log(N)+1 flops for the LFSR (assuming it requires its own unique LFSR, which it likely doesn't), and an N-1 mux.

N to 1 mux uses ~2N transistors, while 2N FAs use ~82N transistors.

@darsnack
Copy link
Member

darsnack commented Mar 24, 2024

You are right, a tree structure would be much better.

While the two approaches to averaging are equivalent in expectation, I'm not sure they are the same in terms of their variance.

For the current approach, we have

$$Y = \frac{1}{N} \sum_i X_i \Rightarrow \qquad \mathbb{E}[Y] = \frac{1}{N} \sum_i \mathbb{E}[X_i], \qquad \mathrm{Var}[Y] = \frac{1}{N^2} \sum_i \mathrm{Var}[X_i]$$

For your proposal, we have

$$Y = \sum_i X_i \cdot \mathbb{P}(Z = i) \Rightarrow \qquad \mathbb{E}[Y] = \sum_i \mathbb{E}[X_i] \mathbb{E}[Z = i] = \frac{1}{N} \sum_i \mathbb{E}[X_i]$$ $$\mathrm{Var}[Y] = \text{some covariance term} + \left( \sum_i \mathrm{Var}[X_i] \mathbb{E}[Z = i]^2 + \mathrm{Var}[Z = i] \mathbb{E}[X_i]^2 + \mathrm{Var}[Z = i] \mathrm{Var}[X_i] \right)$$ $$\approx \sum_i \left( \frac{N - 1}{N^2} (\mathrm{Var}[X_i] + \mathbb{E}[X_i]^2) + \frac{1}{N^2} \mathrm{Var}[X_i] \right)$$

It's possible the covariance term is negative which will reduce the variance, but I think in general the scaling out front is the bigger issue. (I did this math quick, so correct me if I made some grave error).

Intuitively, we want a computable quantity which is the empirical average of $N$ streams, but sampling is approximating that computation. At least for the BitSAD paper, one of the results was related to stochastic averaging units implicitly decorrelating the output before being fed back. I think this difference will change that result.

Of course, this is a trade-off that might be worth making. Perhaps a noisier average is worth it for large $N$ so that we don't have a gigantic adder tree. So adding a different unit to the library for "approximate averaging" is what I would suggest.

@nblei
Copy link
Contributor Author

nblei commented Mar 24, 2024

A ran a small experiment to compare the behavior of a mux vs counter based averager.
These all have N=8, and input streams are Bernouli processes with each individual rv iid.

When actual average is 0.9:
averagers_0_9

When actual average is 0.5:
averagers_0_5

When actual average is 0.1:
averagers_0_1

This suggests that when the average is large, the higher variance of the mux-based approach does not have dramatic impact on the output behavior, but when the average is small, the counter- based approach is significantly better.

@darsnack
Copy link
Member

darsnack commented Mar 24, 2024

I think the small deviations from the true value even on the right side of the curve are the variance dampening at play. It's still true that for smaller true values, this will have a greater effect on the error.

But the gap between the curves as the true values decrease seems more like the rate of convergence in expectation. I would expect you need exponentially more samples to converge to the same relative error rate as the magnitude of the true value decreases. This is an effect that I had not considered, but the counter-based design won't suffer from this since it is deterministic w.r.t. the $N$-term averaging operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants