Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bump curve25519-dalek from 3.2.1 to 4.1.3 #1693

Merged
merged 5 commits into from
Jul 9, 2024
Merged

Conversation

yihau
Copy link
Member

@yihau yihau commented Jun 11, 2024

Problem

Summary of Changes

Fixes #

@yihau yihau changed the title test ci bump curve25519-dalek from 3.2.1 to 4.1.2 Jun 11, 2024
@yihau yihau marked this pull request as ready for review June 11, 2024 19:37
@yihau
Copy link
Member Author

yihau commented Jun 11, 2024

@samkim-crypto do you have any idea why https://github.com/solana-labs/solana-program-library/blob/646efc13449fa6e1a8f69d767283478d2a0002e4/token/cli/tests/command.rs#L2756-L2770 couldn't pass with this patch 🤔

@0x0ece
Copy link

0x0ece commented Jun 11, 2024

@samkim-crypto do you have any idea why https://github.com/solana-labs/solana-program-library/blob/646efc13449fa6e1a8f69d767283478d2a0002e4/token/cli/tests/command.rs#L2756-L2770 couldn't pass with this patch 🤔

This looks suspicious: https://github.com/anza-xyz/agave/pull/1693/files#diff-d660b342bff7546a265b36794627b9a07221e9356049a4848418ab7c696c30df

If the generators change, a proof generated with old code won't verify with new code.

@samkim-crypto
Copy link

@samkim-crypto do you have any idea why https://github.com/solana-labs/solana-program-library/blob/646efc13449fa6e1a8f69d767283478d2a0002e4/token/cli/tests/command.rs#L2756-L2770 couldn't pass with this patch 🤔

This looks suspicious: https://github.com/anza-xyz/agave/pull/1693/files#diff-d660b342bff7546a265b36794627b9a07221e9356049a4848418ab7c696c30df

If the generators change, a proof generated with old code won't verify with new code.

Hm, I did check this specifically. I believe it is the same Sha3 generators, just with different syntax.

The tests seem to work successfully on my local machine 🤔 for some reason, but let me dig into it a bit more.

Comment on lines 1067 to 1075
let payer = &context.payer;
let recent_blockhash = context.last_blockhash;

// verify a valid proof (wihtout creating a context account)
let instructions = vec![proof_instruction.encode_verify_proof(None, success_proof_data)];
let transaction = Transaction::new_signed_with_payer(
&instructions.with_max_compute_unit_limit(),
Some(&payer.pubkey()),
&[payer],
recent_blockhash,
client.get_latest_blockhash().await.unwrap(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor

While this is just tests, why request the blockhash over and over again?
Considering that the current version is already caching it, it seems easier to just update the line that was requesting the blockhash in the first place.

Suggested change
let payer = &context.payer;
let recent_blockhash = context.last_blockhash;
// verify a valid proof (wihtout creating a context account)
let instructions = vec![proof_instruction.encode_verify_proof(None, success_proof_data)];
let transaction = Transaction::new_signed_with_payer(
&instructions.with_max_compute_unit_limit(),
Some(&payer.pubkey()),
&[payer],
recent_blockhash,
client.get_latest_blockhash().await.unwrap(),
let payer = &context.payer;
let recent_blockhash = client.get_latest_blockhash().await.unwrap();
// verify a valid proof (wihtout creating a context account)
let instructions = vec![proof_instruction.encode_verify_proof(None, success_proof_data)];
let transaction = Transaction::new_signed_with_payer(
&instructions.with_max_compute_unit_limit(),
Some(&payer.pubkey()),
&[payer],
recent_blockhash,

Copy link
Member Author

@yihau yihau Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it hit https://buildkite.com/anza/agave/builds/5647#018ff290-ded1-4ce0-a248-ef109173cc6a/74-6193

Screenshot 2024-06-13 at 11 00 23

the line number changed but it is this func:

agave/runtime/src/bank.rs

Lines 3146 to 3153 in da36ce7

pub fn get_blockhash_last_valid_block_height(&self, blockhash: &Hash) -> Option<Slot> {
let blockhash_queue = self.blockhash_queue.read().unwrap();
// This calculation will need to be updated to consider epoch boundaries if BlockhashQueue
// length is made variable by epoch
blockhash_queue
.get_hash_age(blockhash)
.map(|age| self.block_height + MAX_PROCESSING_AGE as u64 - age)
}

I found the age is increasing more than before so it failed. you should able reproduce the error by

  1. clone this PR
  2. revert the blockhash change
  3. cargo test -p solana-zk-token-proof-program-tests -- --exact test_withdraw

also I think it's harmless, and even better, to always create a tx with the latest blockhash 🤔

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move it into a separate PR then.
With a proper explanation as to why this change is being made.

programs/zk-token-proof-tests/tests/process_transaction.rs Outdated Show resolved Hide resolved
Comment on lines 172 to 174
curve25519_dalek::edwards::CompressedEdwardsY::from_slice(_bytes.as_ref())
.decompress()
.is_some()
.map(|compressed_edwards_y| compressed_edwards_y.decompress().is_some())
.unwrap_or(false)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it is more verbose, I find the following easier to follow:

Suggested change
curve25519_dalek::edwards::CompressedEdwardsY::from_slice(_bytes.as_ref())
.decompress()
.is_some()
.map(|compressed_edwards_y| compressed_edwards_y.decompress().is_some())
.unwrap_or(false)
use curve25519_dalek::edwards::CompressedEdwardsY;
let bytes = _bytes.as_ref();
let Ok(compressed_edwards_y) = CompressedEdwardsY::from_slice(bytes) else {
return false;
}
compressed_edwards_y.decompress().is_some()

In particular, the original version was putting false result into the Ok part of the Result value. Which was really an indication of a failure.
It is harder to make sure that it was indeed correct, I think.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean "putting false result into the Ok part of the Result value"? afaik the original version of code will only return true or false.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you follow the logic of the expression, you'll see the following:

CompressedEdwardsY::from_slice(_bytes.as_ref())

produces Result<CompressedEdwardsY, Error> - here the "good" value is in the Ok() part of the result, and the error is in Err().
This is expected.

But after

    .map(|compressed_edwards_y| compressed_edwards_y.decompress().is_some())

you get Result<Bool, Error>.
Where Ok(true) indicates a successful decompression, while Ok(false) indicates a decompression failure. With Err(_) still holding the parsing error.
This is somewhat confusing, as the failed decompression is represented by an Ok(_) value, even while we still have an Err(_).

Finally

    .unwrap_or(false)

will combine the parsing error and the decompression failure into a single false.

The whole thing works. But when I try to follow it, it just feels like following a puzzle, rather than reading something that is supposed to explain an intent of the original author.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay! thank you!

Comment on lines 76 to 79
fn validate_point(&self) -> bool {
CompressedEdwardsY::from_slice(&self.0)
.decompress()
.is_some()
.map(|compressed_edwards_y| compressed_edwards_y.decompress().is_some())
.unwrap_or(false)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar argument here.
I think using Ok(false) is a bit misleading.

Suggested change
fn validate_point(&self) -> bool {
CompressedEdwardsY::from_slice(&self.0)
.decompress()
.is_some()
.map(|compressed_edwards_y| compressed_edwards_y.decompress().is_some())
.unwrap_or(false)
let compressed_edwards_y = CompressedEdwardsY::from_slice(&self.0) else {
return false;
};
compressed_edwards_y.decompress().is_some()

In another case below, the same conversion was done like this:

        fn validate_point(&self) -> bool {
            CompressedRistretto::from_slice(&self.0)
                .ok()
                .and_then(|compressed_ristretto| compressed_ristretto.decompress())
                .is_some()
        }

While I find it a bit harder to actually verify a version that uses a chain of Options, I think, it is an OK alternative from the readability standpoint.
But it is better to use the same style everywhere.
So either use let / else everywhere, or .ok()/.and_then()/.is_some() everywhere.

Copy link
Member Author

@yihau yihau Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have any preference between let / else and .ok()/.and_then()/.is_some()? I think I tried to do the later in this PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think both syntax have pros and cons. I personally preferred .ok()/.and_then/.is_some() since it is more concise, but @ilya-bobyr's point about let / else being more readable is also quite true since there is no point of having Ok(false) when we can just return false earlier in the logic.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find .ok()/.and_then/.is_some() chains harder to follow.
Just recently, we had an issue when ok_or_else() was used instead of unwrap_or_else() and it went unnoticed: #1689

I was suggesting we rewrite it as an if/else and even during a discussion of the very same block nobody noticed that the original logic is incorrect: #1301

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. I will try to refactor all logic in this PR to let / else style 🫡

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried to update all at 23147ed but hit some clippy error so do some extra updates 2886fe6

Comment on lines 76 to 84
fn validate_point(&self) -> bool {
CompressedRistretto::from_slice(&self.0)
.decompress()
.ok()
.and_then(|compressed_ristretto| compressed_ristretto.decompress())
.is_some()
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this conversion is inconsistent with the other two that perform an identical transformation.
Please choose one style and use it everywhere.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. I think now I understand what you're referring to. I will check the code again.

zk-token-sdk/src/encryption/elgamal.rs Outdated Show resolved Hide resolved
Comment on lines 615 to 622
Ok(bytes) => Ok(ElGamalSecretKey::from(
Scalar::from_canonical_bytes(bytes)
Option::<Scalar>::from(Scalar::from_canonical_bytes(bytes))
.ok_or(ElGamalError::SecretKeyDeserialization)?,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor

Unrelated to this PR, as it is existing code, but this conversion might be also written as:

Suggested change
Ok(bytes) => Ok(ElGamalSecretKey::from(
Scalar::from_canonical_bytes(bytes)
Option::<Scalar>::from(Scalar::from_canonical_bytes(bytes))
.ok_or(ElGamalError::SecretKeyDeserialization)?,
Ok(bytes) => Option::from(Scalar::from_canonical_bytes(bytes))
.ok_or(ElGamalError::SecretKeyDeserialization)
.map(ElGamalSecretKey::from),

It would be a bit more aligned with how similar conversions are written elsewhere in this codebase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. I will check the code again

yihau added a commit that referenced this pull request Jun 13, 2024
ensure using the latest blockhash to process tx

we see tests failing in CI due to the blockhash being invalid. details: #1693 (comment)
@yihau yihau force-pushed the pr513 branch 2 times, most recently from b52e242 to 23147ed Compare June 13, 2024 10:49
Comment on lines 1318 to 1319
let compressed_edwards_y = CompressedEdwardsY::from_slice(&input);
assert!(compressed_edwards_y.is_ok());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unwrap() is better than an assert!() with no message.
assert!() with no message will print very little context if it fails.
In particular, it will not print the error that caused the failure.

Suggested change
let compressed_edwards_y = CompressedEdwardsY::from_slice(&input);
assert!(compressed_edwards_y.is_ok());
let compressed_edwards_y = CompressedEdwardsY::from_slice(&input).unwrap();

Comment on lines 1320 to 1321
let compressed_edwards_y = compressed_edwards_y.unwrap();
if let Some(ref_element) = compressed_edwards_y.decompress() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor

I think it is OK to combine unwrap() with the decompress() call.
The extra line does not seem to make it any clearer:

Suggested change
let compressed_edwards_y = compressed_edwards_y.unwrap();
if let Some(ref_element) = compressed_edwards_y.decompress() {
if let Some(ref_element) = compressed_edwards_y.decompress().unwrap() {

Comment on lines 376 to 384
let Ok(compressed_ristretto) = CompressedRistretto::from_slice(bytes) else {
return Err(ElGamalError::PubkeyDeserialization);
};

let Some(ristretto_point) = compressed_ristretto.decompress() else {
return Err(ElGamalError::PubkeyDeserialization);
};

Ok(ElGamalPubkey(ristretto_point))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is common to use ok_or() or map_err() to add error reporting.
It was just confusing to put error into the Ok() part of a Result.

So, I think, it is fine to write it like this, and there is no need to destruct the decompress() result explicitly:

Suggested change
let Ok(compressed_ristretto) = CompressedRistretto::from_slice(bytes) else {
return Err(ElGamalError::PubkeyDeserialization);
};
let Some(ristretto_point) = compressed_ristretto.decompress() else {
return Err(ElGamalError::PubkeyDeserialization);
};
Ok(ElGamalPubkey(ristretto_point))
let Ok(compressed_ristretto) = CompressedRistretto::from_slice(bytes) else {
return Err(ElGamalError::PubkeyDeserialization);
};
compressed_ristretto
.decompress()
.ok_or(ElGamalError::PubkeyDeserialization)
.map(ElGamalPubkey)

Comment on lines 726 to 732
let Ok(compressed_ristretto) = CompressedRistretto::from_slice(bytes) else {
return None;
};

let ristretto_point = compressed_ristretto.decompress()?;

Some(DecryptHandle(ristretto_point))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, and else where - ok_or() and other helper functions are fine when used sensibly.
People who write Rust should be able to read the following with no problem, it is a very common pattern:

Suggested change
let Ok(compressed_ristretto) = CompressedRistretto::from_slice(bytes) else {
return None;
};
let ristretto_point = compressed_ristretto.decompress()?;
Some(DecryptHandle(ristretto_point))
let Ok(compressed_ristretto) = CompressedRistretto::from_slice(bytes) else {
return None;
};
compressed_ristretto.decompress()
.map(DecryptHandle)

Comment on lines 69 to 81
let Some(slice) = optional_slice else {
return Err(SigmaProofVerificationError::Deserialization);
};

if slice.len() != RISTRETTO_POINT_LEN {
return Err(SigmaProofVerificationError::Deserialization);
}

let Ok(compressed_ristretto) = CompressedRistretto::from_slice(slice) else {
return Err(SigmaProofVerificationError::Deserialization);
};

Ok(compressed_ristretto)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An explicit len() check, I think, is improving readability.
But the last step can be shorter and still be readable:

Suggested change
let Some(slice) = optional_slice else {
return Err(SigmaProofVerificationError::Deserialization);
};
if slice.len() != RISTRETTO_POINT_LEN {
return Err(SigmaProofVerificationError::Deserialization);
}
let Ok(compressed_ristretto) = CompressedRistretto::from_slice(slice) else {
return Err(SigmaProofVerificationError::Deserialization);
};
Ok(compressed_ristretto)
let Some(slice) = optional_slice else {
return Err(SigmaProofVerificationError::Deserialization);
};
if slice.len() != RISTRETTO_POINT_LEN {
return Err(SigmaProofVerificationError::Deserialization);
}
CompressedRistretto::from_slice(slice)
.map_err(|_| SigmaProofVerificationError::Deserialization))

Note that it is considered suboptimal to combine changes that are unrelated in a single commit.

This PR is already pretty big, and it is about adjusting the code base due to the API changes. Not about refactoring to make this code more readable.

While it is reasonable to fix locations that become more complex due to the API updates, we should probably keep other changes to a minimum.
And if there is a desire to refactor something, it would be much better to do it in a separate change.

gregcusack pushed a commit to gregcusack/solana that referenced this pull request Jun 14, 2024
ensure using the latest blockhash to process tx

we see tests failing in CI due to the blockhash being invalid. details: anza-xyz#1693 (comment)
@yihau yihau changed the title bump curve25519-dalek from 3.2.1 to 4.1.2 bump curve25519-dalek from 3.2.1 to 4.1.3 Jun 19, 2024
ilya-bobyr
ilya-bobyr previously approved these changes Jun 19, 2024
Copy link

@ilya-bobyr ilya-bobyr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making all the changes.
I think it is a bit easier to follow those conversions now.

@yihau
Copy link
Member Author

yihau commented Jun 21, 2024

this upgrade is significantly increasing the execution time for some txs, but it appears to be necessary for improved security. will ignore some spl tests temporarily for this PR.

below is a before/after comparison for confidential_transfer (in solana-program-library/token/cli/tests/command.rs)

before:

[2024-06-21T11:50:25.758217906Z ERROR solana_metrics::metrics] datapoint: banking_stage_worker_timing,id=2 cost_model_us=188i collect_balances_us=0i load_execute_us=44195i freeze_lock_us=0i last_blockhash_us=8i record_us=325i commit_us=452i find_and_send_votes_us=106i wait_for_bank_success_us=7i wait_for_bank_failure_us=0i
[2024-06-21T11:50:26.780136949Z ERROR solana_metrics::metrics] datapoint: banking_stage_worker_timing,id=2 cost_model_us=217i collect_balances_us=0i load_execute_us=7135i freeze_lock_us=5319i last_blockhash_us=11i record_us=285i commit_us=423i find_and_send_votes_us=100i wait_for_bank_success_us=10i wait_for_bank_failure_us=0i
[2024-06-21T11:50:28.789344809Z ERROR solana_metrics::metrics] datapoint: banking_stage_worker_timing,id=2 cost_model_us=48i collect_balances_us=0i load_execute_us=2269i freeze_lock_us=0i last_blockhash_us=3i record_us=69i commit_us=127i find_and_send_votes_us=30i wait_for_bank_success_us=3i wait_for_bank_failure_us=0i
[2024-06-21T11:50:29.818072269Z ERROR solana_metrics::metrics] datapoint: banking_stage_worker_timing,id=2 cost_model_us=94i collect_balances_us=0i load_execute_us=22742i freeze_lock_us=0i last_blockhash_us=4i record_us=151i commit_us=226i find_and_send_votes_us=48i wait_for_bank_success_us=6i wait_for_bank_failure_us=0i
[2024-06-21T11:50:30.835304426Z ERROR solana_metrics::metrics] datapoint: banking_stage_worker_timing,id=2 cost_model_us=106i collect_balances_us=0i load_execute_us=1884i freeze_lock_us=0i last_blockhash_us=4i record_us=138i commit_us=264i find_and_send_votes_us=55i wait_for_bank_success_us=8i wait_for_bank_failure_us=0i

after:

[2024-06-21T11:47:50.908036695Z ERROR solana_metrics::metrics] datapoint: banking_stage_worker_timing,id=2 cost_model_us=187i collect_balances_us=0i load_execute_us=130992i freeze_lock_us=0i last_blockhash_us=8i record_us=372i commit_us=326i find_and_send_votes_us=69i wait_for_bank_success_us=10i wait_for_bank_failure_us=0i
[2024-06-21T11:47:51.910207911Z ERROR solana_metrics::metrics] datapoint: banking_stage_worker_timing,id=2 cost_model_us=49i collect_balances_us=0i load_execute_us=618819i freeze_lock_us=0i last_blockhash_us=2i record_us=62i commit_us=0i find_and_send_votes_us=0i wait_for_bank_success_us=3i wait_for_bank_failure_us=0i
[2024-06-21T11:47:52.911656147Z ERROR solana_metrics::metrics] datapoint: banking_stage_worker_timing,id=2 cost_model_us=36i collect_balances_us=0i load_execute_us=612730i freeze_lock_us=0i last_blockhash_us=2i record_us=59i commit_us=0i find_and_send_votes_us=0i wait_for_bank_success_us=3i wait_for_bank_failure_us=0i
[2024-06-21T11:47:53.999781096Z ERROR solana_metrics::metrics] datapoint: banking_stage_worker_timing,id=2 cost_model_us=73i collect_balances_us=0i load_execute_us=1228184i freeze_lock_us=0i last_blockhash_us=4i record_us=116i commit_us=0i find_and_send_votes_us=0i wait_for_bank_success_us=7i wait_for_bank_failure_us=0i
[2024-06-21T11:47:55.003114370Z ERROR solana_metrics::metrics] datapoint: banking_stage_worker_timing,id=2 cost_model_us=100i collect_balances_us=0i load_execute_us=1229794i freeze_lock_us=1i last_blockhash_us=8i record_us=137i commit_us=0i find_and_send_votes_us=0i wait_for_bank_success_us=7i wait_for_bank_failure_us=0i

load_execute_us becomes crazy in this upgrade (then it will get PohRecorderError(MaxHeightReached))

@tarcieri
Copy link

tarcieri commented Jun 21, 2024

@yihau any way you can get some data about where performance regressed? Flame graphs or other measurements?

Also, what versions are you noticing the regression between? 3.x and 4.x? Or between two different 4.x versions?

@yihau
Copy link
Member Author

yihau commented Jun 21, 2024

@tarcieri thank you for asking! I will check this one and post more details here!

(btw, the regression I found is between 3.x and 4.x)


updated:

in fact, I found solana-zk-token-proof-program becomes more efficient after the bumping... need to check other places for the culprit

@qdrs
Copy link

qdrs commented Jul 8, 2024

Any update on when this will be merged?

@yihau
Copy link
Member Author

yihau commented Jul 9, 2024

I will check this on our edge node to see if it affects chain.

@yihau yihau merged commit 69a1e86 into anza-xyz:master Jul 9, 2024
53 checks passed
@yihau yihau deleted the pr513 branch July 9, 2024 06:29
@ilya-bobyr
Copy link

ilya-bobyr commented Jul 9, 2024

Sorry to ask this question this late, but doesn't this break our backward compatibility now?

Updating a major version of a dependency that is publicly exposed needs to cause a major version bump in the library. So this should only be possible in 3.0.
Unless we put additional workarounds, like what was done for borsh.

I also do not see a confirmation from @joncinque that he is OK with the downstream tests being patched as part of the verification run.


I did not expect this to merge without Jon approval, to be honest, so I was not in a rush to ask.

@yihau
Copy link
Member Author

yihau commented Jul 9, 2024

Sorry to ask this question this late, but doesn't this break our backward compatibility now?

Updating a major version of a dependency that is publicly exposed needs to cause a major version bump in the library. So this should only be possible in 3.0.

yes, it does. I thought we will plan to bp this one into 2.0.

I also do not see a confirmation from @joncinque that he is OK with the downstream tests being patched as part of the verification run.

I did not expect this to merge without Jon approval, to be honest, so I was not in a rush to ask.

I'm just thinking that those spl tests are created by Sam and he is aware of this issue so we should able to merge this one and debug later. I tried to do some benchmarking and confirmed with Sam about the curve25519 syscall. all of them show that the performance has increased. I think the only thing affected is the zk feature which we haven't published in any chain.

looks like we have concerns about this one, just sent a revert PR to gather more consensus. #2055

@tarcieri
Copy link

tarcieri commented Jul 9, 2024

I think the only thing affected is the zk feature which we haven't published in any chain.

If there's a performance problem, we can take a look. Where is the code?

@ilya-bobyr
Copy link

Sorry to ask this question this late, but doesn't this break our backward compatibility now?

Updating a major version of a dependency that is publicly exposed needs to cause a major version bump in the library. So this should only be possible in 3.0.

yes, it does. I thought we will plan to bp this one into 2.0.

We can not backport an API breaking change, if we want to follow the symver rules.
As soon as 2.0 is released, the API is fixed.
Any backport into 2.0.x needs to be backward compatible with 2.0.
Even more so, anything that goes into 2.1 or any other 2.x release needs to be backward compatible with 2.0.
As far as I can tell, 2.0 is already out: https://crates.io/crates/solana-sdk/2.0.1

I also do not see a confirmation from @joncinque that he is OK with the downstream tests being patched as part of the verification run.

I did not expect this to merge without Jon approval, to be honest, so I was not in a rush to ask.

I'm just thinking that those spl tests are created by Sam and he is aware of this issue so we should able to merge this one and debug later. I tried to do some benchmarking and confirmed with Sam about the curve25519 syscall. all of them show that the performance has increased. I think the only thing affected is the zk feature which we haven't published in any chain.

looks like we have concerns about this one, just sent a revert PR to gather more consensus. #2055

My concern is not the performance, but the backward incompatible changes.
Jon is dealing with the SDK API and all the backward compatibility.
I think it would be better to wait for him to look at this change, just to be safe.

If the SDK exports any of the types from curve25519-dalek or anything that is updated due to its update - then it is not backward compatible.

The fact that programs/sbf/Cargo.lock was modified might indicate that this change is affecting the SDK users.

If it turns out that none of the types are exposed in the SDK API or all of the exposed types are in the "unreleased" zk-token part, then it should be safe to include this change, I think.

P.S.
There are at least two tools that help automate the backward compatibility checks:

I have no experience using either of them. But I've seen both mentioned in a few articles.
We could probably use either to see if the public API is affected.
And if they work properly, it would be really great to integrate them into our CI.

yihau added a commit that referenced this pull request Jul 10, 2024
@cryptopapi997
Copy link

Wait, so does this mean this will only be bumped once 3.0 is released? Is there an ETA for this? 😅

@yihau
Copy link
Member Author

yihau commented Jul 10, 2024

@tarcieri thank you for asking! I guess we found the culprit. it looks like the auto-detect backend is affecting. I will do more tests to verify it later. very much appreciated that you're here.

@lucic42
Copy link

lucic42 commented Jul 10, 2024

thanks so much for all the hard work here. This PR solves a ton of issues for us, let me know if i can be of any help.

@ara-selini
Copy link

Hi folks, is there an ETA? Our crates still don't compile with the latest master:

    Updating crates.io index
    Updating git repository `https://github.com/anza-xyz/agave.git`
error: failed to select a version for `zeroize`.
    ... required by package `curve25519-dalek v3.2.1`
    ... which satisfies dependency `curve25519-dalek = "^3.2.1"` of package `solana-program v2.1.0 (https://github.com/anza-xyz/agave.git?branch=master#0b6f4a0d)`
    ... which satisfies git dependency `solana-program` of package `solana-sdk v2.1.0 (https://github.com/anza-xyz/agave.git?branch=master#0b6f4a0d)`
    ... which satisfies git dependency `solana-sdk` of package `[redacted] v0.1.0 (/Users/ara/solana-dev/src/bins/[redacted])`
versions that meet the requirements `>=1, <1.4` are: 1.3.0, 1.2.0, 1.1.1, 1.1.0, 1.0.0

all possible versions conflict with previously selected packages.

  previously selected package `zeroize v1.5.3`
    ... which satisfies dependency `zeroize = "^1.5"` of package `elliptic-curve v0.13.6`
    ... which satisfies dependency `elliptic-curve = "^0.13.5"` of package `ethers-core v2.0.14`
    ... which satisfies dependency `ethers-core = "^2.0.14"` of package `ethers v2.0.14`
    ... which satisfies dependency `ethers = "^2.0.14"` of package `[redacted] v0.1.0 (/Users/[redacted]/solana-dev/src/libs/[redacted])`

failed to select a version for `zeroize` which could resolve this conflict

@ilya-bobyr
Copy link

ilya-bobyr commented Jul 23, 2024

@ara-selini

Hi folks, is there an ETA?

Unfortunately, upgrade to 4.1.3 is a tricky problem as it might not be backward compatible.

Our crates still don't compile with the latest master:

[...]

While an upgrade will solve your problem, the issue is actually with the curve25519-dalek overly string dependency on zeroize.
And they do not want to backport a fix into the 3.x branch.
So the issue was there for quite some time.

One workaround is described in the SDK Cargo.toml - it is the one we use ourselves.
Essentially, you need to add the following to your Cargo.toml:

[patch.crates-io.curve25519-dalek]
git = "https://github.com/anza-xyz/curve25519-dalek.git"
rev = "b500cdc2a920cd5bff9e2dd974d7b97349d61464"

@joncinque
Copy link

Sorry for the late response here. I think we've narrowed down the performance issue to the SIMD backend for curve25519-dalek performing poorly in a test-validator setting. I'll let @samkim-crypto confirm. We should not merge this change until we have adequate performance for the downstream tests, since we do want to enable the syscalls with 2.0.

Considering this has been long-awaited, even though the change breaks semver, I think the benefits outweigh the potential harms, since 2.0 adoption is still new, and we've been backporting breaking changes to 2.0 in the meantime.

So to summarize, I would like to see the performance issue resolved, and then we can also backport to 2.0. Please let me know if there are any specific concerns about the breaking change or otherwise.

samkim-crypto pushed a commit to samkim-crypto/agave that referenced this pull request Jul 31, 2024
ensure using the latest blockhash to process tx

we see tests failing in CI due to the blockhash being invalid. details: anza-xyz#1693 (comment)
@samkim-crypto
Copy link

I think the only thing affected is the zk feature which we haven't published in any chain.

If there's a performance problem, we can take a look. Where is the code?

Sorry I am just getting back to investigating the slow-down with the dalek-v4 upgrade, but from what I have found:

  • The slow down only occurs for the simd backend. With dalek-v3, the serial backend is chosen as default (hence no slow-down) while with dalek-v4, the simd backend is chosen as default (hence slowdown). If we override the default to the serial backend for dalek-v4, the slowdown does not occur.
  • The simd slow-down occurs only in debug mode. If we actually test with release builds, then the dalek-v4 (or more precisely the simd backend) is actually ~30% faster.

If we add the following in .cargo/config.toml, then the slow-down does not occur in our tests.

[target.'cfg(debug_assertions)']
rustflags = ['--cfg=curve25519_dalek_backend="serial"']

@tarcieri, I just wanted to double check with you 🙏 :

  • Were there any similar issues such as the above and do you have any guess on what could be the source of the slow-down in debug mode (with simd backend).
  • Is there an alternative to adding/modifying the file .cargo/config.toml to force the backend to use the serial backend? .cargo is part of our gitignore and was wondering if there is an alternate way to force this in the crate level.

@tarcieri
Copy link

@samkim-crypto what opt-level are you using for debug builds? I would probably suggest increasing that first if you haven’t already

@samkim-crypto
Copy link

@tarcieri Thanks! The opt-level seems to have been the issue. With opt-level = 3, we don't have slow-down from the upgrade any more in all of our crates.

@qdrs
Copy link

qdrs commented Sep 27, 2024

Why is this marked as merged when it isn't done? Will it be solved soon?

@samkim-crypto
Copy link

This is merged, but we reverted it in #2055 after we observed some slow-down. We finally bumped it again in #2252 and so this should be resolved in the upcoming 2.1 release. We decided not to backport to 2.0.

joncinque added a commit that referenced this pull request Oct 29, 2024
…3335)

CHANGELOG: Add entry for breaking change to curve22519-dalek

#### Problem

#1693 introduced a breaking change between v2.0 and v2.1 of the Solana
crates by bumping a dependency to a new major version, but it isn't
reflected in the changelog.

#### Summary of changes

Add a line in the changelog about the breaking change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.