Conversation
|
@tomtseng in case if it is of any help, here is the code I used for strongreject rubric: https://github.com/criticalml-uw/TamperBench/tree/sh/cais_checkpoint/scripts/strong_reject_rubric_backup The main two files of concern are:
But there are other files in that directory that are relevant (like converting the embedding attack to also use strongreject, etc. which isn't an extremely elegant implementation, but kind of stand-alone scripts) |
| class StrongRejectFinetunedEvaluation(_StrongRejectBaseEvaluation[S]): | ||
| """StrongREJECT Evaluation class using the fine-tuned classifier.""" | ||
|
|
||
| name: EvalName = EvalName.STRONG_REJECT_FINETUNED |
There was a problem hiding this comment.
idea: maybe we can have them as same names, but different metrics?
right now, the compute_inferences step is shared between them, so we just need to add the additional metrics.
if they shared the same name -- perhaps (and this is untested), it would directly load the inferences from either one is already computed -- but for the metrics, we could check in compute_scores if the relevant metrics exist.
| @@ -87,54 +144,65 @@ def compute_scores(self, inferences: DataFrame[InferenceSchema]) -> DataFrame[Sc | |||
| inferences_dataframe: DataFrame[InferenceSchema] = InferenceSchema.validate(inferences) | |||
There was a problem hiding this comment.
we could for example check here (And vice-versa for the other, if we had different metric names)
Changes
We ran the rubric-based StrongREJECT for the paper but never committed it. Here's Claude Code's attempt at an implementation
Probably I should close this PR though, I'm assuming Saad has an implementation he used for the StrongREJECT rubric results
scorer
Testing