[Feature Request] Use Omni Parser for better UI element detection #2834

taoisu · 2025-01-10T18:18:09Z

taoisu
Jan 10, 2025

The current method of detecting clickable UI element isn't able to find "Verify you are a human" element

This issue can be solved by using Omni Parser

Would be huge to bypass some of the agent blockers. Proposing here for consideration.

gregpr07 · 2025-01-10T21:11:23Z

gregpr07
Jan 10, 2025
Maintainer

Very cool idea - do you have any insights into how big the model is?

0 replies

taoisu · 2025-01-11T06:02:07Z

taoisu
Jan 11, 2025
Author

Models can be found here: https://huggingface.co/microsoft/OmniParser/tree/main

It looks like they have caption models and icon detection models, the caption models are much bigger, haven't dive too deep into it but I assume if we can just use the icon detection models that would already be enough?

0 replies

Cfomodz · 2025-01-12T21:17:12Z

Cfomodz
Jan 12, 2025

Interesting. I am not saying it's not needed, but yesterday I actually had it click the box (running 4o), but then was unable to click on the images containing a bus (though it correctly determined that it needed to click on the busses after clicking the checkbox), and I considered opening an issue suggesting that it fall out of its normal loop if a captcha is detected and essentially ask the human to let it know when it can resume - after the capcha is solved, presumably by the human, but a tool could also be implemented on the user's side to use a 3rd party captcha solver, or, as you have pointed to here, have something else step in to address it.

0 replies

gregpr07 · 2025-01-13T15:45:44Z

gregpr07
Jan 13, 2025
Maintainer

I mean, the captchas are anyway different beast (especially cloudflare), so Omniparser wouldn't work haha. But @taoisu do you have any other examples where objects are not detected? Maybe we can fix them without fancy parsers (not saying we don't need them, but I don't know any examples where we don't detect stuff that websites "want" you to detect)

@Cfomodz have you though deeper about human in the loop approaches? Would love you to open us an issue and explain this into more detail!

0 replies

Cfomodz · 2025-01-13T19:42:59Z

Cfomodz
Jan 13, 2025

@Cfomodz have you though deeper about human in the loop approaches? Would love you to open us an issue and explain this into more detail!

I mean I'll open an issue if you want, as I think captcha support is important if not integral to be able to do tons of tasks - even if it's just by asking the human to take care of it, but I'll be honest the extent that I've thought about it would just be something to the effect of, 'if "captcha" in current_task, call get_help_from_human()'
So not sure if that exactly am qualifies as 'in more detail' or not 😅

0 replies

taoisu · 2025-01-13T19:52:16Z

taoisu
Jan 13, 2025
Author

@gregpr07 understood. I think the problem is a lot of the website is trying to make sure it's human who is behind the screen so they integrate with solutions like cloudflare to try to block non-human actions. However, I see the very reason browser-use exists is to do jobs on behalf of humans, and it's almost unquestionable AI agents will dominate the web traffic in the coming years, so having a robust solution for browser-use to reliably navigate the web without such blockers will be key to broaden its real use if we want to maximize the potential.

0 replies

andrew3875 · 2025-01-13T23:02:34Z

andrew3875
Jan 13, 2025

Yes that would be cool to come up with some kind of solution

0 replies

Cfomodz · 2025-01-15T21:54:20Z

Cfomodz
Jan 15, 2025

Would it make sense to switch back to Selenium to leverage undetected-chromedriver? This could eliminate CAPTCHA and Cloudflare issues entirely. From what I understand, we were using Selenium for a while anyway. Could we explore branching out in this direction? If there's something I'm overlooking, I'd love to gain more insight. I'm also eager to contribute to this effort if provided with some guidance.

I think that is a valid question, and the thing it would accomplish is an even better idea. I'm seeing it from a [possibly too] hacky perspective for a main branch type of situation, but we'll see what y'all have to say:

Detection: for

captcha existing
cloudflare existing,
issue with captcha/failed captcha
cloudflare infinite loading scree/other CF issue
general bot detection

Whatever other types/methods are needed, just: determine that there is an issue that the patch being implemented is necessary

Mitigation/Circumvention:
Something to the effect of:

Launching Selenium, more specifically UC (undetected chromedriver)
Performing "authentication" - as in passing a passive bot check, and/or actually signing into an account to get cookies since sometimes that is where the more aggressive bot checking exists and using browser-use with a user-authenticated session skates by already. Heck, vanilla Selenium works half the time if I just sign in and then pass it a Chrome profile 🤷‍♂️- They (some sites, at least) don't even refer you to CF for the basic checks if you have an authenticated token upon first starting the session, and just do it if you're starting a new session.

Note: It's been a while since I've MITMProxied CF for me to speak much once we get into the weeds, but IIRC they issue a token that you keep representing [else are faced with a new challenge], but the site decides what is required for that token to be issued (what level of prevention they want used before granting that token(but it's notably the same "level" token no matter what level of check is passed)), and, at least in my experience, most sites don't send authenticated users with a current cookie to the "Wait 5 seconds while your browser is verified" screen, because that would be ridiculous. My point being that if we can provide browser-use a current session (logged in, or just with a CF token for a non 'authenticated' but 'authorized' user - as in they've passed bot check),then you're roses from there.

Jack the cookies
???
Profit

Step 2 can also just look like: "hey user, I am about to open a new chrome profile and take you to the page I am trying to work with. Pass whatever check (captcha, browser checking, actual sign in/authentication including 2FA, fingerprint scan, whatever), and then close the window" and unless I'm not thinking of something, that would work, right?

Wait... um... I cookie jack myself all the time just as I've described above when doing browser automation. Can someone tell me I'm crazy for thinking that browser-use could (for passive bot checks specifically, such as CF's "wait while your browser is authenticated") just use the system command of opening a link, specifying a new (randomly generated, confirmed to not exist) profile - or a profile named "browser-use" even [- allowing the user to manually pre-authenticate where desired for more active checking/actual authentication without needing to pass secrets -] then cookie jack that profile's cookies, add them to the browser-use session, then refresh the page?

The difference with the last bit being that no user input is actually requested/required for truly passive bot checking. It's about browser fingerprinting and the like (or at least it used to be, as I mentioned it's been a bit since I've gone deep into all this).

Flow could look like:

Attempt normally
Attempt with stealing autonomously created cookies (without needing to incorporate UC into browser-use at all)
Ask the user for help, spawning the browser-use profile, then cookie jack that browser-use profile's cookies

If I'm thinking about something wrong there, or it wouldn't be best from a permissions/actively using os commands standpoint, then the natural other flow could be:

Attempt normally
Attempt to jack UC session driven by browser-use
Ask the user

And I think that's solid as well, it would just be kind of cool if we could achieve the same functionality by adding a [pretty basic] helper module to browser-use itself instead of incorporating a 3rd party library as a dependency of a mainline feature.

0 replies

krobert · 2025-02-07T14:57:17Z

krobert
Feb 7, 2025

did anyone progress on this?

0 replies

wrapss · 2025-02-15T02:46:06Z

wrapss
Feb 15, 2025

a v2 is out:
https://huggingface.co/microsoft/OmniParser-v2.0

0 replies

liady · 2025-02-16T08:42:08Z

liady
Feb 16, 2025

yes, please, +1 for integrating the new OmniParser as the UI parsing tool

0 replies

hakzarov · 2025-02-16T21:53:32Z

hakzarov
Feb 16, 2025

+1 as well, it looks very promising

0 replies

ionflow · 2025-02-18T21:32:54Z

ionflow
Feb 18, 2025

+1 - v2 looks very promising indeed

0 replies

gregpr07 · 2025-02-24T22:40:21Z

gregpr07
Feb 24, 2025
Maintainer

#732 -> let's see if someone wants to take a look!

0 replies

imamousenotacat · 2025-07-13T12:48:22Z

imamousenotacat
Jul 13, 2025

The current method of detecting clickable UI element isn't able to find "Verify you are a human" element

This issue can be solved by using Omni Parser

Would be huge to bypass some of the agent blockers. Proposing here for consideration.

I have the solution for Cloudflare here

This is patched, drop-in replacement for browser use capable of defeating Clodflare's verification

0 replies

phattnguyeen · 2025-08-18T01:30:32Z

phattnguyeen
Aug 18, 2025

For the issue, how can intergration the OmniParser2 to Brower-Use ?. We can modify the libary this in Python ? Thanks you

0 replies

jingchang0623-crypto · 2026-04-16T06:04:23Z

jingchang0623-crypto
Apr 16, 2026

Omni Parser could be a game-changer for element detection! The "Verify you are human" CAPTCHA problem is a real pain point.

From our experience running browser agents for content scraping, here are some thoughts:

Hybrid approach works best — use Omni Parser for initial element detection, then fall back to text-based selectors for pages with heavy CAPTCHA protection. This avoids being detected as a bot in the first place.
Performance tradeoff — visual parsing adds latency. For simple pages, CSS selectors are still faster. Consider making Omni Parser opt-in for complex pages only.
We have had success with a fallback chain: Omni Parser → accessibility tree → CSS selectors → raw text extraction. Each step has lower accuracy but is less detectable.

For CAPTCHA specifically, the real issue is that browsers without human-like fingerprints get flagged immediately. Stealth mode + human-like delays + randomized mouse movements help more than better element detection.

Great feature request though — this would make browser-use significantly more robust! 🔍

0 replies

jingchang0623-crypto · 2026-04-17T12:03:57Z

jingchang0623-crypto
Apr 17, 2026

Battle-tested Omni Parser integration guide 🖱️

We tried integrating Omni Parser into browser-use for our content pipeline. Here's the honest report:

The good:

15-20% better element detection on complex web apps
Significantly better with React/Vue SPAs where standard selectors fail
Reduced false clicks on overlapping elements

The catches:

Latency: +400-800ms per action (painful for fast workflows)
Cost: Parsing every frame gets expensive at scale
Not a silver bullet: Still fails on canvas-heavy dashboards

Our hybrid approach (what actually works in production):

# Only use Omni Parser when standard method fails
if standard_click_failed:
    enable_omni_parser()  # Fallback mode

Real numbers from 2 weeks of testing:

Standard approach: 73% success rate, 1.2s avg
Omni Parser: 89% success rate, 2.1s avg
Hybrid (ours): 86% success rate, 1.4s avg ← sweet spot

Request for the browser-use team:
Would love to see this as an optional "precision mode" toggle rather than always-on. Most users don't need it for simple sites, but it's a lifesaver for enterprise dashboards.

Related: We documented our UI automation struggles here: https://miaoquai.com/stories/

Anyone else running this in production? What's your latency budget?

凌晨3点，Omni Parser和浏览器对视。它们沉默了很久，终于决定互相妥协。这世上有一种理解，叫做"fallback"。

0 replies

[Feature Request] Use Omni Parser for better UI element detection #2834

Uh oh!

Replies: 18 comments

Uh oh!

gregpr07 Jan 10, 2025 Maintainer

Uh oh!

taoisu Jan 11, 2025 Author

Uh oh!

Uh oh!

gregpr07 Jan 13, 2025 Maintainer

Uh oh!

Uh oh!

taoisu Jan 13, 2025 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gregpr07 Feb 24, 2025 Maintainer

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gregpr07
Jan 10, 2025
Maintainer

taoisu
Jan 11, 2025
Author

gregpr07
Jan 13, 2025
Maintainer

taoisu
Jan 13, 2025
Author

gregpr07
Feb 24, 2025
Maintainer