Replies: 4 comments 2 replies
-
|
Hi Victor, Thank you for this well-thought-out proposal! I think The only concern I have is that this might implicitly encourage users to pass in richly detailed files. The original That said, as foundation models continue to improve, we honestly haven't run controlled experiments on the latest state-of-the-art models to measure how much additional context actually affects review quality. This could be a very interesting area to explore. Would you be interested in contributing this feature? We'd love to see a PR from you. Looking forward to your contribution! 🙌 |
Beta Was this translation helpful? Give feedback.
-
|
Hi Victor, Thanks for the thoughtful response! I believe we should enforce limits within OCR itself, rather than relying solely on documentation recommendations. Here's my reasoning and a proposed approach: Why internal enforcementCurrently, an overly long Proposed strategy: two-tier limit (soft warning + hard cap)I'd suggest a two-tier approach:
Why these numbers:
These thresholds should apply uniformly to both DocumentationOn the documentation side, we'd clearly state:
This way, users get guardrails in the tool and guidance in the docs. What do you think of this approach? Happy to discuss the specific thresholds further. |
Beta Was this translation helpful? Give feedback.
-
|
Good question! Here are some recommended sanitization practices when injecting user-provided content into LLM prompts: 1. Strip invisible and control characters Remove zero-width spaces, Unicode direction overrides (RTL/LTR marks), and other non-printable control characters that could be used to hide malicious instructions from human review while still being processed by the model. 2. Normalize whitespace Collapse excessive consecutive newlines (e.g., more than 2) and trim leading/trailing whitespace. This prevents attempts to visually "push" the injected content far away from surrounding prompt structure. 3. Wrap with clear delimiters Enclose the user-provided content within explicit boundary markers (e.g., XML-style tags like 4. No need for complex regex-based filtering Attempting to detect and strip "prompt injection patterns" (e.g., phrases like "ignore previous instructions") is fragile and leads to false positives. The combination of clear delimiters + length limits is more robust in practice. In summary: clean the bytes, enforce boundaries, and rely on the length cap as the primary defense against context pollution. |
Beta Was this translation helpful? Give feedback.
-
|
PR opened #206 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello guys,
As the OCR cli does not support the
@pathexpression I think reading the business context from a markdown file would be a nice feature to add in the OCR cli.Example :
ocr review --commit xxx --background-file /tmp/my_business_context.mdSame principle as the
--backgroundflag, but content would be read from a local markdown file.It's also doable with
-b "$(cat ~/tmp/my_business_context.md)"but it's limited by the shell one used and we could add extra sanitation.What do you think about this idea?
Regards
Victor
Beta Was this translation helpful? Give feedback.
All reactions