Replies: 18 comments
-
|
Very cool idea - do you have any insights into how big the model is? |
Beta Was this translation helpful? Give feedback.
-
|
Models can be found here: https://huggingface.co/microsoft/OmniParser/tree/main It looks like they have caption models and icon detection models, the caption models are much bigger, haven't dive too deep into it but I assume if we can just use the icon detection models that would already be enough? |
Beta Was this translation helpful? Give feedback.
-
|
Interesting. I am not saying it's not needed, but yesterday I actually had it click the box (running 4o), but then was unable to click on the images containing a bus (though it correctly determined that it needed to click on the busses after clicking the checkbox), and I considered opening an issue suggesting that it fall out of its normal loop if a captcha is detected and essentially ask the human to let it know when it can resume - after the capcha is solved, presumably by the human, but a tool could also be implemented on the user's side to use a 3rd party captcha solver, or, as you have pointed to here, have something else step in to address it. |
Beta Was this translation helpful? Give feedback.
-
|
I mean, the captchas are anyway different beast (especially cloudflare), so Omniparser wouldn't work haha. But @taoisu do you have any other examples where objects are not detected? Maybe we can fix them without fancy parsers (not saying we don't need them, but I don't know any examples where we don't detect stuff that websites "want" you to detect) @Cfomodz have you though deeper about human in the loop approaches? Would love you to open us an issue and explain this into more detail! |
Beta Was this translation helpful? Give feedback.
-
I mean I'll open an issue if you want, as I think captcha support is important if not integral to be able to do tons of tasks - even if it's just by asking the human to take care of it, but I'll be honest the extent that I've thought about it would just be something to the effect of, 'if "captcha" in current_task, call get_help_from_human()' |
Beta Was this translation helpful? Give feedback.
-
|
@gregpr07 understood. I think the problem is a lot of the website is trying to make sure it's human who is behind the screen so they integrate with solutions like cloudflare to try to block non-human actions. However, I see the very reason browser-use exists is to do jobs on behalf of humans, and it's almost unquestionable AI agents will dominate the web traffic in the coming years, so having a robust solution for browser-use to reliably navigate the web without such blockers will be key to broaden its real use if we want to maximize the potential. |
Beta Was this translation helpful? Give feedback.
-
|
Yes that would be cool to come up with some kind of solution |
Beta Was this translation helpful? Give feedback.
-
I think that is a valid question, and the thing it would accomplish is an even better idea. I'm seeing it from a [possibly too] hacky perspective for a main branch type of situation, but we'll see what y'all have to say:
Whatever other types/methods are needed, just: determine that there is an issue that the patch being implemented is necessary
Note: It's been a while since I've MITMProxied CF for me to speak much once we get into the weeds, but IIRC they issue a token that you keep representing [else are faced with a new challenge], but the site decides what is required for that token to be issued (what level of prevention they want used before granting that token(but it's notably the same "level" token no matter what level of check is passed)), and, at least in my experience, most sites don't send authenticated users with a current cookie to the "Wait 5 seconds while your browser is verified" screen, because that would be ridiculous. My point being that if we can provide browser-use a current session (logged in, or just with a CF token for a non 'authenticated' but 'authorized' user - as in they've passed bot check),then you're roses from there.
Step 2 can also just look like: "hey user, I am about to open a new chrome profile and take you to the page I am trying to work with. Pass whatever check (captcha, browser checking, actual sign in/authentication including 2FA, fingerprint scan, whatever), and then close the window" and unless I'm not thinking of something, that would work, right? Wait... um... I cookie jack myself all the time just as I've described above when doing browser automation. Can someone tell me I'm crazy for thinking that browser-use could (for passive bot checks specifically, such as CF's "wait while your browser is authenticated") just use the system command of opening a link, specifying a new (randomly generated, confirmed to not exist) profile - or a profile named "browser-use" even [- allowing the user to manually pre-authenticate where desired for more active checking/actual authentication without needing to pass secrets -] then cookie jack that profile's cookies, add them to the browser-use session, then refresh the page? The difference with the last bit being that no user input is actually requested/required for truly passive bot checking. It's about browser fingerprinting and the like (or at least it used to be, as I mentioned it's been a bit since I've gone deep into all this). Flow could look like:
If I'm thinking about something wrong there, or it wouldn't be best from a permissions/actively using os commands standpoint, then the natural other flow could be:
And I think that's solid as well, it would just be kind of cool if we could achieve the same functionality by adding a [pretty basic] helper module to browser-use itself instead of incorporating a 3rd party library as a dependency of a mainline feature. |
Beta Was this translation helpful? Give feedback.
-
|
did anyone progress on this? |
Beta Was this translation helpful? Give feedback.
-
|
a v2 is out: |
Beta Was this translation helpful? Give feedback.
-
|
yes, please, +1 for integrating the new OmniParser as the UI parsing tool |
Beta Was this translation helpful? Give feedback.
-
|
+1 as well, it looks very promising |
Beta Was this translation helpful? Give feedback.
-
|
+1 - v2 looks very promising indeed |
Beta Was this translation helpful? Give feedback.
-
|
#732 -> let's see if someone wants to take a look! |
Beta Was this translation helpful? Give feedback.
-
I have the solution for Cloudflare here This is patched, drop-in replacement for browser use capable of defeating Clodflare's verification |
Beta Was this translation helpful? Give feedback.
-
|
For the issue, how can intergration the OmniParser2 to Brower-Use ?. We can modify the libary this in Python ? Thanks you |
Beta Was this translation helpful? Give feedback.
-
|
Omni Parser could be a game-changer for element detection! The "Verify you are human" CAPTCHA problem is a real pain point. From our experience running browser agents for content scraping, here are some thoughts:
For CAPTCHA specifically, the real issue is that browsers without human-like fingerprints get flagged immediately. Stealth mode + human-like delays + randomized mouse movements help more than better element detection. Great feature request though — this would make browser-use significantly more robust! 🔍 |
Beta Was this translation helpful? Give feedback.
-
|
Battle-tested Omni Parser integration guide 🖱️ We tried integrating Omni Parser into browser-use for our content pipeline. Here's the honest report: The good:
The catches:
Our hybrid approach (what actually works in production): # Only use Omni Parser when standard method fails
if standard_click_failed:
enable_omni_parser() # Fallback modeReal numbers from 2 weeks of testing:
Request for the browser-use team: Related: We documented our UI automation struggles here: https://miaoquai.com/stories/ Anyone else running this in production? What's your latency budget? 凌晨3点,Omni Parser和浏览器对视。它们沉默了很久,终于决定互相妥协。这世上有一种理解,叫做"fallback"。 |
Beta Was this translation helpful? Give feedback.




Uh oh!
There was an error while loading. Please reload this page.
-
The current method of detecting clickable UI element isn't able to find "Verify you are a human" element
This issue can be solved by using Omni Parser
Would be huge to bypass some of the agent blockers. Proposing here for consideration.
Beta Was this translation helpful? Give feedback.
All reactions