New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProgPoW: A programmatic (dynamic) Proof-of-Work algorithm tuned for commodity hardware #15

Open
ifdefelse opened this Issue May 10, 2018 · 67 comments

Comments

Projects
None yet
@ifdefelse

ifdefelse commented May 10, 2018

Abstract

We propose an alternate proof-of-work algorithm - “ProgPoW” - tuned for commodity hardware in order to close the efficiency gap available to specialized ASICs.

The security of proof-of-work is built on a fair, randomized lottery where miners with similar resources have a similar chance of generating the next block.

For Zcash - a community based on widely distributed commodity hardware - specialized ASICs enable certain participants to gain a much greater chance of generating the next block and undermine the distributed security.

ASIC-resistance is a misunderstood problem. FPGAs, GPUs and CPUs can themselves be considered ASICs. Any algorithm that executes on a commodity ASIC can have a specialized ASIC made for it; most existing algorithms provide opportunities that reduce power usage and cost. Thus, the proper question to ask when solving ASIC-resistance is “how much more efficient will a specialized ASIC be, in comparison with commodity hardware?”

This proposal presents an algorithm that is tuned for commodity GPUs where there is minimal opportunity for ASIC specialization. This prevents specialized ASICs without resorting to a game of whack-a-mole where the network changes algorithms every few months.

Motivation

Since proof-of-work will continue to underpin the security of the Zcash network, it is important that this consensus isn’t dominated by a single party in control of a large portion of the compute power.
The existence of the Z9 ASIC miner proves that the current Equihash algorithm allows significant efficiency gains from specialized hardware. The ~150mb of state in Equihash is large but possible on an ASIC. Furthermore, the binning, sorting, and comparing of bit strings could be implemented on an ASIC at extremely high speed.

Thus, a new hash algorithm is needed to replace the current Equihash. This algorithm must be permanently resistant to domination by specialized hardware and avoid further PoW code forks.

Design Specification

ProgPoW preserves the 256-bit nonce size for Zcash and thus does not require any changes to light clients. However, it entirely replaces the rest of the Equihash PoW algorithm.

ProgPoW is based on Dagger Hashimoto and follows the same general structure. Dagger Hashimoto was selected as a base due to its simple algorithm and memory-hard nature. Dagger Hashimoto has also withstood a significant amount of real-world “battle testing.” Finally, starting from a simple algorithm makes the changes easy to understand and thus easy assess for strengths and weakness. Complexity tends to imply a larger attack surface.

The name of the algorithm comes from the fact that the inner loop between global memory accesses is a randomly generated program based on the block number. The random program is designed to both run efficiently on commodity GPUs and also cover most of the GPU's functionality. The random program sequence prevents the creation of a fixed pipeline implementation as seen in a specialized ASIC. The access size has also been tweaked to match contemporary GPUs.

ProgPoW utilizes almost all parts of a commodity GPU, excluding:

  • The graphics pipeline (displays, geometry engines, texturing, etc);
  • Floating point math.
    Making use of either of these would have significant portability issues between commodity hardware vendors, and across programming languages.

Since the GPU is almost fully utilized, there’s little opportunity for specialized ASICs to gain efficiency. Removing both the graphics pipeline and floating point math could provide up to 1.2x gains in efficiency, compared to the 2x gains possible in Dagger Hashimoto, 50x gains possible for CryptoNight, and ~100x gains possible in Equihash.

The algorithm has five main changes from Dagger Hashimoto, each tuned for commodity GPUs while minimizing the possible advantage of a specialized ASIC. In contrast to Dagger Hashimoto, the changes detailed below make Prog-PoW dependent on the core compute capabilities in addition to memory bandwidth and size.

  • Changes keccak to blake2s.
    Zcash already uses BLAKE2 in various locations, so it makes sense to continue using BLAKE2. GPUs are natively 32-bit architectures so blake2s is used. Both blake2b and blake2s provide the same security but are tuned for 64-bit and 32-bit architectures respectively. Blake2s runs roughly twice as fast on a 32-bit architecture as blake2b.

  • Increases mix state.
    A significant part of a GPU’s area, power, and complexity is the large register file. A large mix state ensures that a specialized ASIC would need to implement similar state storage, limiting any advantage.

  • Adds a random sequence of math in the main loop.
    The random math changes every 50 blocks to amortize compilation overhead. Having a random sequence of math that reads and writes random locations within the state ensures that the ASIC executing the algorithm is fully programmable. There is no possibility to create an ASIC with a fixed pipeline that is much faster or lower power.

  • Adds reads from a small, low-latency cache that supports random addresses.
    Another significant part of a GPU’s area, power, and complexity is the memory hierarchy. Adding cached reads makes use of this hierarchy and ensures that a specialized ASIC also implements a similar hierarchy, preventing power or area savings.

  • Increases the DRAM read from 128 bytes to 256 bytes.
    The DRAM read from the DAG is the same as Dagger Hashimoto’s, but with the size increased to 256 bytes. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses.
    The DAG file is generated according to traditional Dagger Hashimoto specifications, with an additional ProgPoW_SIZE_CACHE bytes generated that will be cached in the L1.

ProgPoW can be tuned using the following parameters. The proposed settings have been tuned for a range of existing, commodity GPUs:

  • ProgPoW_LANES: The number of parallel lanes that coordinate to calculate a single hash instance; default is 32.
  • ProgPoW_REGS: The register file usage size; default is 16.
  • ProgPoW_CACHE_BYTES: The size of the cache; default is 16 x 1024.
  • ProgPoW_CNT_MEM: The number of frame buffer accesses, defined as the outer loop of the algorithm; default is 64 (same as Dagger Hashimoto).
  • ProgPoW_CNT_CACHE: The number of cache accesses per loop; default is 8.
  • ProgPoW_CNT_MATH: The number of math operations per loop; default is 8.

ProgPoW uses FNV1a for merging data. The existing Dagger Hashimoto uses FNV1 for merging, but FNV1a provides better distribution properties.

ProgPoW uses KISS99 for random number generation. This is the simplest (fewest instructions) random generator that passes the TestU01 statistical test suite. A more complex random number generator like Mersenne Twister can be efficiently implemented on a specialized ASIC, providing an opportunity for efficiency gains.

Backwards Compatibility

This algorithm is not backward compatible with the existing Equihash implementation and will require a fork for adoption. Furthermore, the network hash rate will change dramatically as the units of compute are re-normalized to this new algorithm.

Testing

This PoW algorithm was implemented and tested against six different models from two different manufacturers. Selected models span two different chips and memory types from each manufacturer (Polaris20-GDDR5 and Vega10-HBM2 for AMD; GP104-GDDR5 and GP102-GDDR5X for NVIDIA). The average hashrate results are listed below. Additional tests are ongoing.

As the algorithm nearly fully utilizes GPU functions in a natural way, the results reflect relative GPU performance that is similar to other gaming and graphics applications.

Model Hashrate (MH/s)
RX580 9.4
Vega56 16.6
Vega64 18.7
GTX1070Ti 13.1
GTX1080 14.9
GTX1080Ti 21.8

Sample Code

Sample code is available here.

Schedule and Funding Request

This proposal is submitted to the foundation to review and adopt as a priority. We ask that any funding goes toward any outside development, review, testing, and implementation efforts as the foundation deems appropriate. We humbly suggest that this proposal be integrated to the soonest planned fork (as feasible) to minimize disruption to the network.

@solardiz

This comment has been minimized.

solardiz commented May 11, 2018

This looks like a solid pre-proposal to me. I'm not familiar with the original Dagger Hashimoto and thus can't fully evaluate it yet; nevertheless, the proposed changes all make sense to me from the perspective of using GPUs fully and optimally.

My (limited) understanding is that Dagger Hashimoto is an asymmetric PoW just like Equihash is - that is, validating a proof is lightweight and nearly instant, whereas computing a proof is heavy and slow (but not too slow as to make the PoW non-progress-free - the hashrates on the order of 10 to 20 per second given in the proposal are OK'ish in that respect, meaning there are lots of PoW computations per block interval).

However, can't we achieve a similar effect with a simpler PoW? I think an instantiation of yescrypt/pwxform with low-level parameters to fit and maximally use a GPU can have similar properties, except only for not generating "a random sequence of math" and technically being symmetric. The latter can be offset by setting yescrypt's tunable costs to be low and using a large ROM. A major drawback is that it means only moderately fast validation (not nearly instant) - like taking a millisecond per CPU core - and that the machine doing the validation needs to have enough RAM to accommodate yescrypt's ROM.

As to "a random sequence of math", how would it be generated so that it utilizes most of a GPU's computing resources yet lacks optimization potential (or has obvious optimization potential and no more, so that the GPU miners can utilize it easily)? I think it'd be tough to make use of more than a handful of different instructions in there while also meeting our other desirable properties. So target the different execution units (perhaps integer MADs and local/shared memory lookups/writes, like yescrypt's pwxform does? anything else we can easily use?) while keeping the instruction set use rather limited? Would this be very different from a non-random sequence then, in terms of relative resource utilization on GPUs vs. more specialized ASICs? That's unclear to me.

A "random number generator" may be unnecessary as a separate component given that per this proposal we'd effectively add something similar to yescrypt pwxform's S-boxes. A very simple transformation like a pwxform round and subsequent S-box writes results in changing S-box contents that also pass randomness tests.

The stated expected limitations of specialized ASICs' "up to 1.2x gains in efficiency, compared to the 2x gains possible in Dagger Hashimoto" look overly optimistic to me, although not necessarily impossible. When we're talking very small gains like 1.2x, it becomes especially important to consider what's included in GPU prices available to miners - and it's by far not just the production costs. The inefficiencies in gaming GPUs' (components) markets when those GPUs are purchased by/for miners are probably in excess of 20% compared to a market for a company's specialized mining ASICs (let alone that company's own use of the ASICs).

It's also important to consider whether our PoW algorithm would be somewhat tolerant to hardware errors (which is another way to improve usable yield and lower production costs) or not at all. Equihash is somewhat tolerant. It sounds to me like this new PoW won't be as tolerant, which is a good thing.

Please read the above as a friendly review and challenging of the proposal to help make it better.

@solardiz

This comment has been minimized.

solardiz commented May 11, 2018

I am not familiar with FNV1a, but FWIW per the table at https://github.com/leo-yuriev/t1ha variations of t1ha (including those optimized for 32-bit) are the fastest non-cryptographic hashes that pass SMHasher tests, whereas FNV1a_YoshimitsuTRIAD is slightly faster yet but fails tests.

@solardiz

This comment has been minimized.

solardiz commented May 11, 2018

I just found that the code and more documentation are already available at https://github.com/ifdefelse/ProgPOW

This answers my question on what kind of random math is being used, but doesn't address the concerns associated with it: optimization potential that miners will compete in exploring, only few instructions being used, and this not necessarily being better than optimally-pre-weighted use of MADs - in fact, introducing relatively lightweight ops to the mix might leave a GPU's MAD units under-utilized and thus let an ASIC get away with implementing fewer MAD units.

Quoting from the author's documentation linked above:

void merge(uint32_t &a, uint32_t b, uint32_t r)
{
    switch (r % 4)
    {
    case 0: a = (a * 33) + b; break;
    case 1: a = (a ^ b) * 33; break;
    case 2: a = ROTL32(a, ((r >> 16) % 32)) ^ b; break;
    case 3: a = ROTR32(a, ((r >> 16) % 32)) ^ b; break;
    }
}
uint32_t math(uint32_t a, uint32_t b, uint32_t r)
{
    switch (r % 11)
    {
    case 0: return a + b;
    case 1: return a * b;
    case 2: return mul_hi(a, b);
    case 3: return min(a, b);
    case 4: return ROTL32(a, b);
    case 5: return ROTR32(a, b);
    case 6: return a & b;
    case 7: return a | b;
    case 8: return a ^ b;
    case 9: return clz(a) + clz(b);
    case 10: return popcount(a) + popcount(b);
    }
}

(In the implementation, this choice is compile-time (kernel source code is produced to be compiled for the target GPU), so the maybe-inefficiency of modulo division is not an issue.)

Out of these many operations, only the two MULs are heavy (and they miss on the opportunity to be MADs, to more fully utilize that hardware).

BTW, the focus on staying 32-bit for GPU friendliness may be excessive, especially if we also have a goal to maintain some register pressure so that the large register files are used nearly-fully. Modern GPUs do have some "64-bit" instructions operating on 32-bit elements in pairs of registers for each operand, and in a sense it's wasteful not to use those. For example, at least at NVIDIA PTX asm level (and probably also at ISA level and in hardware, but I didn't confirm) there's 64-bit addition (that is, with carry between the two 32-bit halves) as part of the high 32-bit MAD instruction. From a Japanese yescrypt-0.5'ish coin miner:

__device__ __forceinline__ uint2 mad64(const uint32_t a, const uint32_t b, uint2 c)
{
#if 0
        return vectorize((uint64_t)a * (uint64_t)b) + c;
#else
        asm("{\n\t"
                "mad.lo.cc.u32 %0, %2, %3, %0; \n\t"
                "madc.hi.u32 %1, %2, %3, %1; \n\t"
                "}\n\t" : "+r"(c.x), "+r"(c.y) : "r"(a), "r"(b)
        );
        return c;
#endif
}

It feels wasteful not to use powerful instructions like this, which probably utilize the MAD units' circuitry to the max achievable with integer math.

Do we really achieve anything good by sometimes no-op'ing those heavy MADs in favor of something trivial like a bitwise-OR, which might also be redundant? I doubt it.

@solardiz

This comment has been minimized.

solardiz commented May 11, 2018

Oh, also the funnel shifts probably consume non-trivial amount of die area. But then it's better to use them as such, not reduce them to the simpler bit rotates. So perhaps just use a fixed mix of 64-bit MADs, funnel shifts (of 64-bit values by variable counts - but then we could run into differences in how shift counts in excess of bit counts are processed by different GPUs/CPUs - some will require explicit masking to be compatible with others, which would be wasteful), and shared/local memory lookups and writes.

The GPUs' instruction processing probably introduces little overhead (in die area and energy consumption) since it's amortized due to wide SIMD. What's avoided in a pre-defined circuit is the need to store intermediate results in the large global register file with indexed access, rather than in temporary registers placed right inbetween the units. So what we might want is non-pre-defined routing of data between a few relatively large units (like those I mentioned, in comparison to e.g. a mere bitwise-OR). It isn't important to mix different instructions in arbitrary order (in fact, we might end up under-utilizing our execution units if we do), but rather to re-route the outputs/inputs inbetween fixed groups of instructions (e.g., in one code revision a MAD could take one of its inputs from an immediately preceding S-box lookup, and in another it'd take that input from an S-box lookup done 5 instruction groups earlier, etc.)

Unlike CPUs, modern GPUs also tend to allow indexed access to general-purpose registers (but for tiny indices, and we're limited to 255/N registers anyway). We could also use that, and then it's even runtime variability in the routing, which another ASIC would have to duplicate.

@solardiz

This comment has been minimized.

solardiz commented May 11, 2018

Skimming over the code, I now see that FNV1a and KISS99 are only used in host code to generate a randomized kernel for GPU, and are not used on GPU. This wasn't clear to me from the proposal (perhaps in part because I am not familiar with the original Dagger Hashimoto). Then my earlier suggestion that we don't need an RNG because we already have enough randomness from S-box lookups doesn't apply here. OTOH, this also means that neither the hash nor the RNG need to be fast.

@ifdefelse

This comment has been minimized.

ifdefelse commented May 11, 2018

@solardiz Thanks for the extensive feedback! This is wonderful. We are digesting. Please reach out to ifdefelse at protonmail if you would like to chat with the team directly.

@RadixPi

This comment has been minimized.

RadixPi commented May 11, 2018

Thank you for the detailed review and comments. A bunch of answeres below.

the hashrates on the order of 10 to 20 per second

Those should have been labeled as mega-hash/sec. Sorry for the confusion.

However, can't we achieve a similar effect with a simpler PoW?

No. Any fixed hashing sequence can be implemented on an ASIC (or FPGA) with an order of magnitude better efficiency than a CPU-based implementation. The random sequence of math is what forces this algorithm to run on a GPU instead of on a specialized, fixed-sequence ASIC.

t1ha vs fnv1a and KISS99

I wasn't familiar with t1ha, but it looks very interesting.

KISS99/FNV1a are used both in generating the random math sequence for the inner loop and outside the loop.

In the sequence generation fnv1a is used to seed KISS99, and then KISS99 is used for producing the instruction/register sequence.

If you look at the non-loop kernel code here:
https://github.com/ifdefelse/ProgPOW/blob/master/libethash-cuda/CUDAMiner_kernel.cu

FNV1a is again used to seed KISS99. KISS99 is used to initialize the values of mix[] (the per-lane registers).

FNV1a is also used for reducing mix[] to a single per-lane value, and then reducing the pre-lane values to a single 128 bit result.

I think FNV1a is fine for seeding KISS99. The reduction could replace FNV1a with T1HA0_32le, though it probably won't make much of a difference.

64-bit ops

All GPU 64-bit ops are implemented using 32-bit ops. So doing a single 64-bit op or 2 32-bit ops makes effectively no difference. From someone implementing a CPU ALU having carry-out to support a 64 bit op via 2 32-bit ops is essentially free, this doesn't differentiate in any interesting way.

register pressure

The ProgPoW_REGS define directly modifes the register pressure, and is a better way of controling it than occasional 64-bit ops.

MAD

I had originally supported explict 3-input MAD, and also some 3-input logical ops, but the added complexity didn't seem worth it. The random sequence will natrually produce some MADs and 3-input LOPs anyway, which the compiler can make use of.

funnel shift vs bit rotate

The hardware implementation of either is effectively the same, again not a significant differentiator.

indexed register file

AMD supports indexed RF, though I think it's only accessible via assembly. Nvidia does not. I intentionally picked ops that are common between both arcitectures. Focusing on just CUDA I would have also made use of shfl, prmt, and lop3

fixed instruction order

If there was a fixed instruction order where only the inputs/outputs locations changed an ASIC could directly build that pipeline. Only by changing both the instruction order and the input/output locations does it guarantee a real instruction processing system, not a fixed (and much more efficient) pipeline.

@solardiz

This comment has been minimized.

solardiz commented May 12, 2018

Thank you for your answers.

Regarding the speeds, what are the validation speeds on CPU? And proof size?

I agree with many of your points: there isn't a lot of difference between using "64-bit" ops vs. using 32-bit ones and doubling the register pressure via your algorithm's parameters (the maybe-improvement from making use of the carry-out/carry-in is very slight), yes the hardware implementation of 32-bit rotates is probably of almost same complexity as funnel shift's(*), and you seem to be correct that NVIDIA doesn't allow runtime indexing into the register file (unlike AMD GCN).

(*) But I still think it'd take extra resources to route two separate 32-bit inputs to the circuit, instead of one such input. AMD has funnel shift, too: it's amd_bitalign() in their OpenCL.

Regarding bitwise ops (even 3-input ones), I continue to think they're too simple to be worth using (except where we might have to e.g. for non-linearity with other ops). On every such op, we'd waste more time and energy on register file access than on the op itself.

I also continue to see the unclear optimization potential of those instruction sequences as a concern, and to that I add that your clz() and popcount() based steps produce tiny results (most result bits at zero), so any subsequent e.g. integer multiplication could be routed to a smaller and quicker 32x6-bit or even 6x6-bit circuit that an ASIC might provide for such cases.

Regarding pipelining in an ASIC, I continue to think that it may become impractical with varying locations of inputs to each algorithm step (would-be pipeline stage) and/or too much state to carry along the pipeline, even if each step's operations are fixed and known at ASIC design time.

For example, consider implementing bcrypt in an ASIC: each Blowfish round has a fixed set of ADDs and XORs, but the input locations keep changing. Indeed, our and others' implementations of bcrypt on FPGA don't pipeline it - instead, they include multiple iterated bcrypt instances per core, which is similar to what's done in software when cracking bcrypt hashes on a CPU core.

For another example, a person on our team is now working on sha512crypt on FPGA. SHA-512 seems like a perfect fit for pipelining, and almost all of the higher-level logic of sha512crypt is pre-determined (aside from variance by password and salt length, which affect how many times SHA-512's compression function is invoked, but we can group candidate passwords by same length and crack one salt at a time to overcome this runtime variance). So he started with exploring fully-pipelined designs for SHA-512 (thus, with ~80 or ~160 in-flight computations), but after quite some trial and error realized that the size of SHA-512 inputs to carry along the pipeline and the complexity of the high-level algorithm make this prohibitively wasteful. For the current design, he switched to modules of four iterated SHA-512 cores (each computing two SHA-512 hashes at a time, and with slight overlap between previous and new sets of two-hash computations, so at times with up to four in-flight computations) controlled with specialized soft CPU cores (with 16-way SMT, thus one soft CPU core controls the four SHA-512 cores times four in-flights from its different hardware threads). The CPU cores are small in comparison to their groups of four SHA-512 cores. This may sound crazy, but it proved more efficient than a fully-pipelined design so far. Indeed, this design still uses a little bit of pipelining, but to an extent similar to what we see in CPUs and GPUs.

Sorry for this "off-topic" detail, but I think it illustrates that data dependencies can dictate optimal hardware design. I recognize that these examples I gave are for FPGAs rather than ASICs, but I expect similar considerations in ASIC design. I don't readily have an example to illustrate my other sub-point, namely that variability in instruction stream alone (without complex data dependencies on intermediate computation results from much earlier than the previous clock cycle) isn't as good at reducing potential ASIC advantage. Is use of both approaches (dependencies on older intermediate results and variable instruction stream) even better, and how much better? My current understanding is that it's potentially only slightly better, but it carries additional risks and pitfalls to avoid (redundancy, small intermediate results, other optimization potential, and lower utilization of the largest and naturally highest-latency execution units that we have).

@RadixPi

This comment has been minimized.

RadixPi commented May 14, 2018

Thanks again for all the comments.

Regarding the speeds, what are the validation speeds on CPU? And proof size?

Proof size is just 64-bit nonce + 128-bit intermediate state (that can be used for DOS protection). This is a small change from ethash's 64-bit nonce + 256 bit intermediate. It's fairly easy to change.

tiny results

The results of the math get passed as the B argument to merge(), which either adds or xors B into a modified version of A. Making a specialized 6-bit adder or XOR is probably more overhead than simply using a normal 32-bit ALU for everything. Merge is also setup so that even if the B input is 0 the result will be a version of A with high entropy.

bitwise ops power is dominated by register file

That's a very true statement, that the actual executing of a 32-bit logical op is drastically lower power than reading/writing the register file. This is where specialized ASICs get their efficiency gains compared to a CPU/GPU. I like including as many ops as possible to force the execution hardware to be as general purpose as possible, but keeping or removing LOPs won't make too much of a difference either way.

pipelining

Your bcrypt and sha512crypt examples are about an outer loop. Having a pipeline of 64 bcrypt cores in series or a single bcrypt core that's iterated on 64 times is going to have very similar efficiency. The gains mostly come from pipelining the inner loop. A small CPU sequencing the outer loop makes a lot of sense.

I could build a ASIC pipeline to handle a fixed sequence of math where only the inputs changed. The fixed math sequence would have a bit of control logic on the side programmed with which registers to read/write at each step. The ASIC would still need a large register file to store all the state like a GPU, but it could be much higher performance. Say the pipeline is 32 logical stages you could have 32 difference hash calculations in flight at once. If the register file is 32-way banked they'd never conflict with each other. This pipeline would be 32x higher throughput than a GPU that has to process a single instruction at a time.

@solardiz

This comment has been minimized.

solardiz commented May 15, 2018

You seem to be right about the (non-)tiny results - I missed the fact that you merge() so often. I think it's this code (in the code generator):

ret << math("data32", mix_src(), mix_src(), rnd());
ret << merge(mix_dst(), "data32", rnd());

However, the operation based on clz() usually ignores most of its input bits, so maybe a preceding e.g. MUL + merge() could have its multiplication circuit optimized. OTOH, I can see how it'd be difficult to do when targeting fixed timings and having to fallback to the full, slower circuit once in a rare while. I think more importantly those bit-counting operations are simply too cheap in hardware, much like bitwise logical ones. So their use is more like a defense against other/weird GPU/CPU archs that might lack such instructions than a defense against ASICs.

What you say about bcrypt doesn't make sense to me. (Maybe you're unfamiliar with it? That's OK; I'm also unfamiliar with a few things you referenced previously. But these non-overlaps make it harder for us to get on the same page.) This is about the inner loop. Assuming you'd try to build a pipeline of 64 Blowfish rounds (and then build 64 bcrypt cores by iterating around that pipeline), you'd have only the 64-bit Blowfish blocks propagating along the pipeline efficiently, yet you'd have to manage 64 x 4 KiB = 256 KiB worth of S-boxes near that pipeline, with routing from each 4 KiB set to each pipeline stage, which will waste die area (on the complex addressing and routing) and introduce latency at best on par with a CPU's L2 cache's - and I don't see a way to hide that latency since each round's lookup indices are only known after computation of the previous round (previous pipeline stage) and each round's results are similarly needed by the next round (next pipeline stage). Thus, you'd have a larger circuit that runs slower (per hash computed) than a fully-iterated circuit would run (with only the control logic shared and thus amortized across several such iterated bcrypt instances). [What does make sense, though, is interleaving two bcrypt instances' uses of the same S-box RAMs (of double capacity) to partially hide the access latencies - this is similar to what's done when cracking bcrypt on CPUs and it should also fit e.g. BlockRAM in Xilinx's 7-series FPGAs.]

I think you actually try to achieve a defense similar to bcrypt's with your "cache accesses", which you currently make as numerous as your random math steps.

For sha512crypt, it's about both inner and outer loops. Neither uses variable indexing like we have for bcrypt, and the inner loop (SHA-512 compression function) is a reasonably good fit for pipelining - but the amount of state+inputs to carry along a fully-pipelined implementation is prohibitively large. As I understand, Bitcoin mining FPGA designs and ASICs do use fully-pipelined double-SHA-256 and thus do carry that state+inputs, but there must be optimizations due to similarity of inputs (only the nonce changes) and for SHA-256 the inputs are twice smaller. So we may have a borderline case with SHA-512, and with the added complexity of needing to interface a non-trivial higher-level algorithm to SHA-512 (would need to manage 80+ in-flight sha512crypt's in the specialized CPU) we may have just passed the point where fully-pipelined SHA-512 made sense.

As I understood your example, the 32-stage pipeline + 32-way register file would consume at least 32x the area of a fully-iterated design (excluding control logic, which could be amortized across several such fully-iterated instances - somewhat like we have with wide SIMD in GPUs), and will likely incur higher delays (thus, would have to run at a lower clock rate) to access registers in the larger register file. Yes, you will have 32x higher throughput per clock cycle but at >32x the area and at a lower clock rate, defeating the purpose of this design.

Pipelining is great, but it has its limitations. For example, for descrypt a 400-stage pipeline (25 iterations x 16 DES rounds) can work great (and it's been done). DES and descrypt fit pipelining perfectly. Even then, going for a mixed e.g. 25 iterations of 16 stages design is only a slight slowdown (which has also been done and made sense as part of a more complex design). As soon as there's any extra complication for pipelining, fewer stages or even a non-pipelined design could work better.

MUL instructions are good in that they're already optimally pipelined on a GPU (across 4+ clock cycles). Any instruction that runs on a CPU (yes, CPU) in one clock cycle (of latency) is not a good fit since this indicates that its latency on a GPU is mostly part of the GPU design rather than inherent to the computation.

@cryptomined

This comment has been minimized.

cryptomined commented May 16, 2018

I vote For implementation, For changing algo to ProgPOW.
Equal playground is ideal.
If this would work, ifdefelse and Radix should be the new ZEC lead developers and zooko should be fired.

@ofpcarnage

This comment has been minimized.

ofpcarnage commented May 16, 2018

I vote For implementation, For changing algo to ProgPOW.

@bjames301

This comment has been minimized.

bjames301 commented May 16, 2018

I vote for Implementation for Changing algo to ProgPOW

3 similar comments
@gordela

This comment has been minimized.

gordela commented May 16, 2018

I vote for Implementation for Changing algo to ProgPOW

@Miner356

This comment has been minimized.

Miner356 commented May 16, 2018

I vote for Implementation for Changing algo to ProgPOW

@thunder02dragon

This comment has been minimized.

thunder02dragon commented May 16, 2018

I vote for Implementation for Changing algo to ProgPOW

@mineZcash

This comment has been minimized.

mineZcash commented May 16, 2018

Gentlemen, I appreciate the "Votes" of confidence in this proposal. Keep in mind this is just the first step in a process.

The funding of this and other proposals will be reviewed by the Foundations Grant Review Committee as outlined here:

https://github.com/ZcashFoundation/GrantProposals-2018Q2/blob/master/README.md

And will be subject to ratification by the Zcash Foundation Board if approved for funding.

After which the implimentation of a change of PoW could be considered as outlined by Josh Cincinnati @acityinohio in this post:

https://z.cash.foundation//blog/statement-on-asics/

@Miner356

This comment has been minimized.

Miner356 commented May 16, 2018

Thanks for letting us know, but is it any problem that we continue voting here?

@cryptomined

This comment has been minimized.

cryptomined commented May 16, 2018

@mineZcash Thanks for clarification shawn
@Miner356 OhGodAGirl said you could voice support here.

and dont forget, accoring to zooko's interview with coindesk... if the Foundation tries to go through with this there will more than likely be a chain split and he will continue with the ASIC chain... atleast until sapling... or until he gets hung for treason

Shawn can clarify on that 👍

@mineZcash

This comment has been minimized.

mineZcash commented May 16, 2018

I don't see a problem with voting here, but I will leave that call to @amiller or @acityinohio . The only issue that could occur is if the conversation about specific details and solutions regarding the actual PoW is overwhelmed by users voting.

Perhaps the Ballot that Andrew is working on would be able to provide more accurate accounting of Votes cast: ZcashFoundation/Elections#1

@cryptomined

This comment has been minimized.

cryptomined commented May 16, 2018

@mineZcash when will the ballot go live? will we all be able to vote?

@Miner356

This comment has been minimized.

Miner356 commented May 16, 2018

@mineZcash The purpose of Zcash was Asic resistance, what do you guys think we miners want?
It should not be necessary to vote, we're just wasting time.
Please save resources and audit the code and say what you think.
Thank you

@mineZcash

This comment has been minimized.

mineZcash commented May 16, 2018

@cryptomined That's a good question for @amiller
@Miner356 The process that we are following has been clearly outlined by the Zcash Foundations Executive Director: https://z.cash.foundation//blog/statement-on-asics/ . Please read through it when you get a chance.

@Miner356

This comment has been minimized.

Miner356 commented May 17, 2018

@mineZcash ok I will follow you to keep me informed about what is going on.
Thank you

@Miner356

This comment has been minimized.

Miner356 commented May 17, 2018

@mineZcash I had to research what the trends of computing would be in general and I came to the conclusion that ASICs at this moment are the fastest but not for long, I think that within 3 to 4 years GPUs CPUs will compete with ASICs.
https://www.youtube.com/watch?time_continue=31&v=55JK5bDLoAg
https://www.youtube.com/watch?v=rgbnv1UpAcE
Intel AMD and Nvidia will use the same technology in the gaming industry, they will be forced to use this technology because Moore's law does not allow much further advancement in hardware technology, and of course Moore's law also applies to ASICs
This technology is cheaply manufactured and efficient.

@raodluffy

This comment has been minimized.

raodluffy commented May 17, 2018

I vote For implementation too and thanks for that

@ifdefelse

This comment has been minimized.

ifdefelse commented May 17, 2018

We appreciate the votes of confidence, but more peer review is needed; if you, or anyone you know, has time to review and audit the code, please feel free to reach out to the team via email: ifdefelse@protonmail.com, or through Twitter, with @OhGodAGirl, or via Gitter.

@bitcartel

This comment has been minimized.

bitcartel commented May 17, 2018

@ifdefelse The proposal doesn't specify if you are seeking funding to implement code, or to cover the expense of expert analysis (although @solardiz seems to be doing that already!). Perhaps funding isn't required and you are submitting the proposal for the foundation to review and adopt as a priority (at which point the foundation would implement). I think clarifying this would help the review committee.

@amiller

This comment has been minimized.

amiller commented May 17, 2018

@Miner356 @mineZcash I'm happy to see informal voting/polling/protesting occurring somewhere in this repo, even though I think there are betters places for it

  • If you want to suggest "Zcash Foundation should support an immediate fork to ProgPow" or some variation like that for our upcoming Foundation board election and helios ballot, that would make a good issue or pull request in https://github.com/ZcashFoundation/Elections/ Popular posts there should make it onto the ballot, which we're still working on

@ifdefelse I was also gonna ask what bitcartel said... I'm excited this proposal has generated a technical discussion! What do you have in mind as the role of a grant? Maybe during the discussion phase we can get a list of suggested review or validation tasks and see if you or someone else would want to carry them out. Is there an open source GPU implementation of ProgPow yet?

@tromer

This comment has been minimized.

Collaborator

tromer commented May 30, 2018

@ifdefelse, the above information is crucially needed by the grant review committee in order to evaluate your proposal.

@MoneroCrusher

This comment has been minimized.

MoneroCrusher commented May 30, 2018

@ifdefelse ping!

@ifdefelse

This comment has been minimized.

ifdefelse commented May 31, 2018

@tromer @MoneroCrusher Sorry about the late response to this. This grant proposal was submitted for review priority and not for funding. While we do not need any funding for the work on our side, we are a very small team of part-time collaborators and would welcome any other contributors to the code and its review, funded or not. Would the foundation be willing to help source the following roles as sub-grants at time of completion? 1. project/community manager 2. client implementer/maintainer 3. miner implementer/maintainer 4. hardware optimization reviewer

@tromer

This comment has been minimized.

Collaborator

tromer commented May 31, 2018

@ifdefelse, while the proposal is technically very interesting, the Foundation itself doesn't have personnel who can fulfill these review, dev and communication roles for your PoW. One way to proceed is for you to team up with suitable people and have them funded by this grant program; but you would need to be proactive in reaching out to them.

@tromer

This comment has been minimized.

Collaborator

tromer commented Jun 1, 2018

The Zcash Foundation Grant Review committee has reviewed your pre-proposal, including the above discussion, to evaluate its potential and competitiveness relative to other proposals. Every pre-proposal was evaluated by at least 3 (and typically more than 4) committee members .

The committee's opinion is that your pre-proposal is a promising candidate funding in this round, and the committee therefore invites you to submit a full proposal.
Please submit a full proposal by June 15th, following the detailed structure described in the Call for Proposals. We encourage you to submit a draft as early as possible, to allow for community feedback.

@tromer tromer added the invited-full label Jun 1, 2018

@solardiz

This comment has been minimized.

solardiz commented Jun 9, 2018

Is there a CPU-only (preferably, plain C) implementation of ProgPoW? I understand the OpenCL kernel can be run on a CPU if it doesn't use vendor-specific extensions, but I'd prefer to also have something in plain C or at worst C++, thus not needing an OpenCL runtime. This would be useful for review, experiments, as well as to serve as a reference implementation for correctness testing of the OpenCL and CUDA implementations against this reference. I understand that C(++) isn't a very good fit for dynamically adjusted code, but the same applies to CUDA, and this isn't literally a show-stopper - it only means that development tools need to be installed.

If this exists, it'd help my potential contributions. If it doesn't exist yet, then maybe I should include creation of such an implementation in my proposal.

(I continue to think that dynamically generated code might be overkill, and carefully tuned dynamically generated data dependencies could be sufficient to achieve similar "ASIC resistance". But I realize that current ProgPoW does use dynamically generated code.)

@ifdefelse

This comment has been minimized.

ifdefelse commented Jun 14, 2018

Posting the full Grant Proposal:
zgp-15.pdf

@solardiz

This comment has been minimized.

solardiz commented Jun 15, 2018

Thank you for posting the proposal, @ifdefelse.

This is a very aggressive schedule you propose, which is great given Zcash's immediate need for a PoW scheme change, and I think you'll deliver, but it raises concerns of the extent of review and testing and tweaking that may be done on ProgPoW in this limited time. Part of the problem we have with Equihash now is that its choice of parameters was also a rushed decision with no time for adjustment even when much more efficient implementations appeared and the parameters were already seen as non-optimal shortly before launch. I'm afraid it'll be similar in that respect.

Do you already have specific candidates in mind for the "2 more people"? Are they really available to start work on the project full-time(?) on such short notice?

(I might also post a proposal today. I think the research, review, tweaking, testing needed here is quite separate from client and miner development, so I wouldn't expect the same people to do both kinds of work.)

@tromer tromer added full-submission and removed info-needed labels Jun 15, 2018

@solardiz

This comment has been minimized.

solardiz commented Jun 17, 2018

FWIW, I contacted @eXtremal-ik7 about his possible help with miner development, and he's potentially interested, but obviously needs specifics to start any work. eXtremal previously contributed to Zcash miners: mbevand/silentarmy@b879b79 https://bitcointalk.org/index.php?topic=1660023.0

@T3N2aWsK

This comment has been minimized.

T3N2aWsK commented Jun 19, 2018

Here are my observations after going through the code provided in the proposal:

  • There are no definitions of DAG_SIZE or ProgPoW_CACHE_WORDS.

  • The code in the proposal could use some tuning of the order in which functions are presented so that it's easier to understand its structure for a first-time reader.

  • The algorithm isn't actually well defined, as the evaluation order of function parameters is unspecified, and there are e.g. 3 calls to kiss99() in the parameter list for the call to math().

  • The inner loop uses the functions merge() and kiss99() a lot, and the most complex parts of those are multiplications with constants. The Hamming weights of those constants are 2, 5, 6 and 9, allowing an ASIC to save energy by computing them as 2-, 5-, 6- and 9-way additions of less than 32 bits.

  • The proof-of-capacity part (ref. the data64 variable) can easily be supported with a relatively low-cost RAM. This is read-only (as in very infrequently updated) and can be shared between many staggered instances of ProgPoWLoop loading 256 consecutive bytes at a time.

@ifdefelse

This comment has been minimized.

ifdefelse commented Jun 21, 2018

@T3N2aWsK , @PrometheusXDE, @solardiz, @donamelo We thank you for your feedback. We encourage folks to continue to review this proposal from a technical perspective. We hope reviewers can start with a deeper technical understanding of what is being proposed, especially from the hardware side of things.

  1. re: DAG_SIZE. We should have been clearer with this. However, we assumed that Dagger Hashimoto was sufficiently documented so that it didn't need to be repeated here. This is the size of the DAG and the number of 256 byte words in it. The size is based on Dagger Hashimoto's formula for determining DAG size from block number.

  2. re: code ordering. We always welcome feedback on improving clarity. How would you like to see the code presented?

  3. re: function parameters. We believe the ordering is well defined in C. Parameters are evaluated from left to right. Would specifying that this is C code help clear up the confusion?

  4. re: ASIC on inner loop. We can see how you might be confused by the concept of programmatic workload generation. The inner loop you are referencing is part of the programmatic generation of the workload algorithm and not actually part of the PoW workload itself. The calls to kiss99 only happen during generation to determine which ops and registers to execute. For the code executed during mining, that part is compiled away.

  5. re: proof-of-capacity. The size of the DAG is currently 2-3 GB and can be easily increased. However, we didn't want to disadvantage miners with smaller memory sizes. We are willing to consider increasing the memory requirement. Note that the bandwidth (throughput) is just as important. Dagger Hashimoto requires both significant memory bandwidth and capacity to support for mining. Again Dagger Hashimoto's memory hardness is well battle-tested: it requires not just capacity, but also high throughput and predictable latency-INsensitivity. In fact, we've made it even more latency-insensitive by increasing the read size. We have done an extensive cost-efficiently analysis based on RAM pricing from the trough 1+ year ago and the recent RAM price spike for various types of RAM and the corresponding silicon area the interface would demand. Do this analysis for any silicon and you'll see it's very hard to beat GPUs for price per throughput-capacity.

@MoneroCrusher

This comment has been minimized.

MoneroCrusher commented Jun 21, 2018

@PrometheusXDE
was just writing this response and wanted to say wait a couple days and you'll see @ifdefelse refute every single point proficiently (just like they did in their reddit thread https://www.reddit.com/r/EtherMining/comments/8k4zc7/comment/dz5v2r3?st=JIO8LL6E&sh=54bb7b39) but they beat me to it!

@T3N2aWsK 's account is just 2 days old (edit: not 5) and clearly he/she misunderstood the algo (or maybe even wanted to spread FUD, as there are big economical consequences for some people (ppl that bought ASICs + manufacturers)).

About the author (from my perspective):

  • best open source cryptonight miner (xmr-stak) is based on their work.
  • ETHLargementPill, which increased Nvidia GPU Ethash hashrate by over 50%. They gave it out for free.
  • various tools by OhGodACompany.
  • helping out folks in various forums, chats etc

So in my eyes there's no reason to distrust them. But ProgPOW is open source anyway so..
I don't care either if they do that in their self-interest (ie running mining GPUs).

What I (and presumably every cryptocurrency dev) care about is decentralisation, and blockchain promises it. It is up to the software devs to keep that promise and ProgPOW is the best way to achieve that for now.

It makes GPUs king in town and luckily AMD and NVIDIA have a public face to lose in case of shady business while Bitmain et al don't (or they already did? Re: Antbleed).

There's never perfect decentralisation but for now it's better to go the way of most possible alignment with decentralisation (AMD/Nvidia/Intel/Samsung) than to go full retard and let ASIC manufacturers control 80% of every single algo. I promise you: they do pre-mine. They won't sell 50 day break-even miners to consumers, their investors wouldnt let them. The consumer will always be the ones chasing the carrot.

So, I'm all for ProgPOW! Implement it in a testnet and see how it goes. Then mainnet.

One critique:
DAG size shouldn't be 2-3GB (or even higher). It should be way less, so everyone can participate. Even integrated graphics.
As I remember DAG-size increase was implemented to defeat ASICs, right? So this will be moot once ProgPOW is implemented, as ASICs would always be as fast as GPUs.
Correct me if I'm wrong :)

@PrometheusXDE

This comment has been minimized.

PrometheusXDE commented Jun 21, 2018

I dont agree, because the hacker who hackd my account is an dickhead

@solardiz

This comment has been minimized.

solardiz commented Jun 21, 2018

@MoneroCrusher @PrometheusXDE Folks, please don't treat reviews pointing out (potential) issues as attacks on the algorithm or on its authors, and please don't treat the existence of issues (even if confirmed) as indicative of incompetence or whatever. It is perfectly normal for a reviewer (especially a volunteer) to be confused at first and find non-issues. If e.g. 1 in 10 issues is for real, that's worth it. And some of the rest might point to a need for better documentation. It is also perfectly normal for there to be issues in an algorithm as initially proposed. That's why independent review is needed. BTW, thank you for linking to that Reddit thread - I wasn't aware of it - it's very good news that someone intends to try implementing this on FPGA, and again it's no problem they misunderstood things at first.

@ifdefelse No, the evaluation order of function arguments in C is in fact unspecified, and may/will vary between compilers, etc. I haven't checked whether/where in the code you have that issue, though - I'm just commenting in general.

@T3N2aWsK

This comment has been minimized.

T3N2aWsK commented Jun 21, 2018

@ifdefelse Thanks for your feedback.

The intention of my comments is neither to shoot down nor endorse this project, but to help with the analysis of its merits and making it better. The developers are obviously free to use my comments to improve their code and algorithm.

  1. Thanks for the clarification.

  2. As a general rule I would suggest presenting it either top-down or bottom-up. Bottom-up would be the natural choice in C, defining each function before its first use.

  3. I haven't verified by checking the latest standard documents, but I have never seen the parameter evaluation order being specified for C or C++. For a discussion on this topic see e.g. https://stackoverflow.com/questions/9566187/function-parameter-evaluation-order

  4. Ok, I missed the relative const-ness of prog_seed when focusing on ProgPoWLoop(). mix_dst(), mix_src() and rnd() all use kiss99() in a different way. mix_dst() has its values generated and stored in a lookup table, and then rnd() and mix_src() share state while generating the pseudorandom load and operation sequence. This is not a very clear and readable way of defining the algorithm.
    With the calls to kiss99() being part of precomputation and compilation, we are left with merge() which uses multiplication by 33. This will of course be implemented with a 27-bit adder in hardware, but any good coder or compiler will also replace it with a shift and an add in software. Hence this won't be a big win for ASIC.

  5. Ack. With the pseudorandom sequence precomputed, there is not so much left in the inner loop, making this consume significant bandwidth.

@ifdefelse

This comment has been minimized.

ifdefelse commented Jun 22, 2018

@T3N2aWsK on point 3, we stand corrected. Thanks for pointing this out. We will update the code.
We will also update the documentation based on your feedback on point 2 when we get a chance.

@alexcryptan

This comment has been minimized.

alexcryptan commented Jun 30, 2018

This was a useful read, as well as the comments:
https://medium.com/@OhGodAGirl/the-problem-with-proof-of-work-da9f0512dad9

I am wondering if it is possible to do ProgPow style algo tailored to CPU. Seems a better thing than tuning for GPU.

@MoneroCrusher

This comment has been minimized.

MoneroCrusher commented Jun 30, 2018

@alexcryptan
Youd probably have problems with botnets then. But I also think a mix between GPU and CPU would be best, slightly favouring GPU.

@alexcryptan

This comment has been minimized.

alexcryptan commented Jul 1, 2018

@MoneroCrusher
with 2-3GB or required RAM it would be "botnet resistant" - very noticeable to the end user.

@bitcartel

This comment has been minimized.

bitcartel commented Jul 3, 2018

@ifdefelse
ProgPoW was discussed and mentioned a few times at the Zcon0 mining workshop: https://forum.z.cash/t/what-happened-at-the-zcon0-mining-workshop-and-more/30062/

@tromer

This comment has been minimized.

Collaborator

tromer commented Sep 25, 2018

@ifdefelse, we are finalizing decisions on the grants. Are there updates about the ProgPoW design, progress, plans or funding requirements that we should be aware of?

@greerso

This comment has been minimized.

greerso commented Sep 29, 2018

Ethereum Core developers get answers to questions on ProgPow from IfDefElse.
https://youtu.be/z2mefVnZHpw?t=48m45s

@OhGodAGirl

This comment has been minimized.

OhGodAGirl commented Oct 12, 2018

@ifdefelse, we are finalizing decisions on the grants. Are there updates about the ProgPoW design, progress, plans or funding requirements that we should be aware of?

Hey Eran!

There are no main changes to the ProgPoW design, and progress is going well - we're working closely with the ETH team for integration. Nothing has changed on the funding, but I will be updating the EIP to simply add more information/transparency and more clarity to a lot of areas, and that will translate to the ZIP, naturally.

So there's more on the education front coming.

@ifdefelse

This comment has been minimized.

ifdefelse commented Oct 13, 2018

Hello team,

We are unable to update our official ZGP proposal at this time, but I would like to make some brief statements in order to clarify some concerns of the public.

  1. We are aware that significant work needs to be done on the education level to have ProgPoW make sense to the general public. When we originally wrote this EIP and spec, we positioned it as a piece just for hardware designers and engineers, or people with experience like ourselves - not the general public. We're now learning the chaos that causes, and we're going to adjust accordingly.

  2. We seek to release open source prototype implementations of an Equihash variant of ProgPoW, as well as the Ethash variant of ProgPoW (currently being worked on for enterprise level use by various members of the Ethereum community), in both the mining and node space. We'd hope that the engineers at the Zcash Foundation could assist us on this part, because we are not node developers ourselves, and as the entire premise of our Medium article is simply 'do what you do best, and leave the others to the rest', we'd hope that the node development could be left to experienced node developers, the miner development left to experienced miner developers, and the hardware design left to us.

  3. There is a public implementation of ProgPoW called Bitcoin Interest that gives real world testing data for the power consumption and hashrate of ProgPoW. Note that it is unique code, and doesn't necessarily follow the spec, but the premise is the same (utilizing all parts of the GPU- the difference is in the node and verification times).

  4. AMD and NVIDIA have independently reviewed ProgPoW, and so have multiple engineers.

  5. We are seeking to make a modified request of 10k, maximum, for the research, development and design of ProgPoW for Equihash (the open source implementation). This will simply cover the burn rate of one developer, and unused funds would be returned back to the ZCash Foundation. We are not able to do the development, due to the nature of where we are now in October- back in May, things were a lot different.

  6. The justification for Dagger Hashimoto was simply due to the natural memory hardness of the Ethash algorithm. ProgPoW acts as an extension of this, but there is no reason that the variants of Equihash can't be adapted to be naturally tuned to all existing parts of a GPU card, thus creating a ProgPoW Equihash variant.

The official ZGP proposal will be updated to match these posts accordingly.

Oh, and on our identities, and our backgrounds, we'll be tying our skillset more into the medium post that is an extension of what was already posted. But please keep in mind that an algorithm should be reviewed based on it's own merits- not on the identities of the people who created it. That is why Satoshi Nakamoto chose to remain anonymous. The technology needs to speak for itself.

@solardiz

This comment has been minimized.

solardiz commented Oct 13, 2018

Thanks for the update @ifdefelse! It would also be helpful to have some pointers on a few things you referenced. I searched for Bitcoin Interest and found that it's a cryptocurrency that recently hard-forked to use ProgPoW (or something similar, as you say). But I can't similarly search for AMD's and NVIDIA's reviews of ProgPoW, for lack of sufficiently focused keywords to search for. Are their reviews public?

"Equihash variant of ProgPoW" aka "ProgPoW for Equihash" sounds exciting at first, but starts to fall apart the more I think of it. I assume ProgPoW would be used for the memory filling phase (in place of the sequential writes with BLAKE2b, which Zcash's Equihash currently uses) and would be followed by the usual Equihash processing? If so, the Equihash phase would be as (moderately) ASIC-friendly as Zcash's Equihash currently is, and we'd need to ensure that either the memory filling phase contributes to a larger portion of the total processing cost or the frequent and large data transfer from a GPU-like to a more specialized Equihash ASIC would be prohibitively expensive. With this added requirement, having Equihash at the end of the processing feels superfluous - we're simply designing a new sequential memory-hard scheme (with the usual memory usage vs. verification time trade-off), and could use its computed hash value directly. Having Equihash's collision search step at the end makes sense if the collision-producing hashes can be recomputed quickly, thereby providing fast verification, but this weakens the memory filling (is inconsistent with it being sequential memory-hard). Is this correct, or do you have something different in mind?

A potentially better "Equihash variant of ProgPoW" aka "ProgPoW for Equihash" would be to enhance Equihash's collision search phase, applying it to a ProgPoW'ish function of the memory contents rather than to the actual memory contents directly. This would need to be done in a way requiring that the computation be done dynamically on every ex-memory lookup rather than in advance (which would essentially replace the memory contents, to be followed by classic Equihash, thereby defeating our attempt at requiring GPU-like hardware), and that's tricky (perhaps achievable through inclusion of frequent memory writes, so that the collision search would be over changing data, or/and through doing the collision search over a much larger amount of expanded data, where each lookup of the virtual, say, 1 TB of data would be based on a function of multiple lookups from the real, say, 3 GB of data). Do you possibly have specific ideas on it? Overall, I think this is a research project requiring multiple iterations of (re-)design and attacks and tweaks. It's not just an engineering effort, and thus I see no way how it'd reasonably fit in the moderate budget you suggest, unless the research phase has somehow already been done?

@sonyamann

This comment has been minimized.

Collaborator

sonyamann commented Nov 6, 2018

I'm thrilled to inform you that the Grant Review Committee and the Zcash Foundation Board of Directors have approved your proposal, pending a final compliance review. Congratulations, and thank you for the excellent submission!

Next steps: Please email josh@z.cash.foundation from an email address that will be a suitable point of contact going forward. We plan to proceed with disbursements following a final confirmation that your grant is within the strictures of our 501(c)(3) status, and that our payment to you will comply with the relevant United States regulations.

We also wish to remind you of the requirement for monthly progress updates to the Foundation’s general mailing list, as noted in the call for proposals.

Before the end of this week, the Zcash Foundation plans to publish a blog post announcing grant winners to the public at large, including a lightly edited version of the Grant Review Committee’s comments on your project. The verbatim original text of the comments can be found below.

Congratulations again!

Grant Review Committee comments:

Proposed by the pseudonymous team "ifdefelse". This team developed ProgPoW, an alternative proof of work algorithm aiming to be ASIC-resistant but GPU-friendly. Under this proposal (in its revised form), the team will integrate ProgPoW with Zcash, in the sense of releasing a proof-of-concept fork of the Zcash node software that uses ProgPoW instead of Equihash, as well as compatible open-source mining software and operating testnet. This will facilitate empirically evaluation of the difficulty and network behavior of such integration.

While not all committee members believe the notion of ASIC-resistance is an important one to cryptocurrencies, the committee gave submissions related to proof-of-work and ASIC-resistance the benefit of the doubt, judging these submissions under the assumption that the Zcash community is interested in ASIC-resistant developments.

Within this space, the committee sees the ProgPoW as having a high potential merit. We observe that ProgPoW is considered by many in the community as a promising alternative to classic proof of work, and an interesting variation of GPU-friendly ASIC resistance. We recognize the large body of work contributed by ifdefelse in developing an implementation of ProgPoW. We also observe that ProgPoW is being considered for use in Ethereum, with experimental integration underway.

We believe that the suggested prototype integration of ProgPoW with the Zcash code base will significantly further the ongoing discussion of potential PoW changes in ZCash, by demonstrating the technical feasibility, identifying difficulties, enabling experimentation with dynamics of hardness adjustment under various loads, and also -- should ProgPoW be chosen for future Zcash network upgrades -- reducing the time for developing that upgrade.

Thus, funding of this proposal is recommended. Moreover, we encourage the Zcash Foundation to support complementary efforts to independently evaluate the security properties of ProgPoW, within or outside the Grants program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment