Read business context from a file #176

Victor-D · 2026-06-19T15:32:14Z

Victor-D
Jun 19, 2026

Hello guys,

As the OCR cli does not support the @path expression I think reading the business context from a markdown file would be a nice feature to add in the OCR cli.

Example : ocr review --commit xxx --background-file /tmp/my_business_context.md
Same principle as the --background flag, but content would be read from a local markdown file.
It's also doable with -b "$(cat ~/tmp/my_business_context.md)" but it's limited by the shell one used and we could add extra sanitation.

What do you think about this idea?

Regards
Victor

lizhengfeng101 · 2026-06-20T02:56:18Z

lizhengfeng101
Jun 20, 2026
Maintainer

Hi Victor,

Thank you for this well-thought-out proposal! I think --background-file is a great idea — reading business context from a file is indeed more ergonomic than shell substitution like $(cat ...), which varies across shells and has its limitations.

The only concern I have is that this might implicitly encourage users to pass in richly detailed files. The original --background flag was intentionally designed as a plain string input to nudge users toward providing concise context — in code review scenarios, overly verbose background information can dilute the reviewer's focus and reduce the quality of findings.

That said, as foundation models continue to improve, we honestly haven't run controlled experiments on the latest state-of-the-art models to measure how much additional context actually affects review quality. This could be a very interesting area to explore.

Would you be interested in contributing this feature? We'd love to see a PR from you. Looking forward to your contribution! 🙌

1 reply

Victor-D Jun 22, 2026
Author

Yes, I can prepare a PR for this feature.

However, I'm still wondering whether discouraging users from passing richly detailed files should be enforced within OCR itself or simply documented as a recommendation.

A background file can contain information that indeed helps improve the review quality, but it can also introduce noise and reduce the relevance of the code review.

In my use case, I gather business context from documentation, Jira tickets, the PR description, and PR comments, then provide a synthesized summary through the --background flag.

Regards,

lizhengfeng101 · 2026-06-22T13:32:12Z

lizhengfeng101
Jun 22, 2026
Maintainer

Hi Victor,

Thanks for the thoughtful response! I believe we should enforce limits within OCR itself, rather than relying solely on documentation recommendations. Here's my reasoning and a proposed approach:

Why internal enforcement

Currently, an overly long --background value can silently push the prompt over our token budget threshold (80% of MaxTokens), causing individual files to be skipped entirely during review — with only a warning in the log. Users may not even realize they're getting incomplete reviews. Enforcing limits at the input layer gives users clear, actionable feedback upfront instead of subtle degradation downstream.

Proposed strategy: two-tier limit (soft warning + hard cap)

I'd suggest a two-tier approach:

Tier	Threshold	Behavior
Soft warning	~2,000 characters (~300–500 words)	Print a warning recommending the user condense the context, but proceed with the review
Hard cap	~8,000 characters (~1,200–2,000 words)	Reject with a clear error message explaining the limit and suggesting summarization

Why these numbers:

2,000 chars is roughly the sweet spot for concise business context — enough for a synthesized summary from Jira tickets, PR descriptions, and documentation (which aligns with your use case).
8,000 chars provides a generous upper bound while still leaving sufficient token headroom for the actual code diff and review prompts across both the plan and main-task phases.

These thresholds should apply uniformly to both --background and the new --background-file, since the enforcement point would be after content loading — making the two flags functionally equivalent from a validation standpoint.

Documentation

On the documentation side, we'd clearly state:

The enforced limits and their rationale
Recommended usage: provide a synthesized summary of business context, not raw dumps of tickets or documents
Examples of good vs. overly verbose background input

This way, users get guardrails in the tool and guidance in the docs. What do you think of this approach? Happy to discuss the specific thresholds further.

1 reply

Victor-D Jun 22, 2026
Author

It makes sense to me.
Is there any recommended sanitation when reading a MD file and injecting its content to a LLM?

lizhengfeng101 · 2026-06-23T01:55:33Z

lizhengfeng101
Jun 23, 2026
Maintainer

Good question! Here are some recommended sanitization practices when injecting user-provided content into LLM prompts:

1. Strip invisible and control characters

Remove zero-width spaces, Unicode direction overrides (RTL/LTR marks), and other non-printable control characters that could be used to hide malicious instructions from human review while still being processed by the model.

2. Normalize whitespace

Collapse excessive consecutive newlines (e.g., more than 2) and trim leading/trailing whitespace. This prevents attempts to visually "push" the injected content far away from surrounding prompt structure.

3. Wrap with clear delimiters

Enclose the user-provided content within explicit boundary markers (e.g., XML-style tags like <user_background>...</user_background>) so the model can distinguish between instructions and user-supplied data. This reduces the risk of the content being interpreted as prompt instructions.

4. No need for complex regex-based filtering

Attempting to detect and strip "prompt injection patterns" (e.g., phrases like "ignore previous instructions") is fragile and leads to false positives. The combination of clear delimiters + length limits is more robust in practice.

In summary: clean the bytes, enforce boundaries, and rely on the length cap as the primary defense against context pollution.

0 replies

Victor-D · 2026-06-24T14:45:01Z

Victor-D
Jun 24, 2026
Author

PR opened #206

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Read business context from a file #176

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Read business context from a file #176

Uh oh!

Victor-D Jun 19, 2026

Replies: 4 comments · 2 replies

Uh oh!

lizhengfeng101 Jun 20, 2026 Maintainer

Uh oh!

Victor-D Jun 22, 2026 Author

Uh oh!

lizhengfeng101 Jun 22, 2026 Maintainer

Why internal enforcement

Proposed strategy: two-tier limit (soft warning + hard cap)

Documentation

Uh oh!

Victor-D Jun 22, 2026 Author

Uh oh!

lizhengfeng101 Jun 23, 2026 Maintainer

Uh oh!

Victor-D Jun 24, 2026 Author

Victor-D
Jun 19, 2026

Replies: 4 comments 2 replies

lizhengfeng101
Jun 20, 2026
Maintainer

Victor-D Jun 22, 2026
Author

lizhengfeng101
Jun 22, 2026
Maintainer

Victor-D Jun 22, 2026
Author

lizhengfeng101
Jun 23, 2026
Maintainer

Victor-D
Jun 24, 2026
Author