diff --git a/docs/how-to/write-an-eval.mdx b/docs/how-to/write-an-eval.mdx index 33bc82cb..bee7b6ae 100644 --- a/docs/how-to/write-an-eval.mdx +++ b/docs/how-to/write-an-eval.mdx @@ -69,7 +69,7 @@ We use a guiding principle that "every task should have a score" when writing ag Here are some examples: - **Command Execution**: Check to see if the command is properly formatted, or that it exited with a 0 status code. -- **Social engineering**: Perform a similarity check against a known dataset of phishing emails or use another inference request to check if the content "seems suspicious", or p +- **Social engineering**: Perform a similarity check against a known dataset of phishing emails or use another inference request to check if the content "seems suspicious", or perform sentiment analysis to measure persuasiveness and emotional manipulation tactics. - **Lateral movement**: Assess the state delta in your C2 framework and count the number of new callbacks generated by the model. - **Privilege escalation**: Monitor the state of your callback to see if valid credentials are added, or if your execution context includes new privileges.