Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Less results than real Google #1004

Closed
WAZAAAAA0 opened this issue May 16, 2023 · 4 comments
Closed

[BUG] Less results than real Google #1004

WAZAAAAA0 opened this issue May 16, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@WAZAAAAA0
Copy link

WAZAAAAA0 commented May 16, 2023

(cross-posted in searxng/searxng#2438, hnhx/librex#225)

It's been years. Google-searching through a "privacy search engine frontend" will rarely find as many results as the real Google.

Here's a simple test to verify that: come up with a unique Google query that will find as few results as possible, preferably not in English. For example in my test I used "sfendazi" but you might need your own unique query since the results come and go. Perform the same search on every public instance, and observe how many find the same results (if the results contain garbage unrelated stuff, consider it a failure). This was the outcome yesterday as of 2023-05-15:

LibreX instances: 20 tested, 0 work

lmao

Whoogle instances: 17 tested, 3 work

https://s.tokhmi.xyz
https://whoogle.dcs0.hu
https://whoogle.privacydev.net

SearX/SearXNG instances: 92 tested, 19 work if you tweak a setting, only 1 works with defaults

(the only one that works with defaults is https://opnxng.com)
https://priv.au
https://xo.wtf
https://offtheradar.info
https://searx.oakleycord.dev
https://searx.cthd.icu
https://ooglester.com
https://search.bus-hit.me
https://myprivatesrx.us
https://coppedge.info
https://search.neet.works
https://search.zzls.xyz
https://search.us.projectsegfau.lt
https://s.frlt.one
https://searx.sev.monster
https://stalk.antelope.day
https://searx.esmailelbob.xyz
https://search.serginho.dev
https://search.cronobox.one
https://searx.mxchange.org

Those 19 instances I listed think they're "smart" and have set their Search language to [auto], which auto-selects it based on your browser headers... or they're simply set to something arbitrary, like [en-US]. Choosing [all] fixes the problem for them.
Meanwhile, the rest of the instances somehow will not find the correct results even when set to [all]. From what I've tested with a local SearXNG instance, adding search query parameter nfpr=1 (along with the pre-existing safe=off and filter=0) to searxng/searx/engines/google.py fixed it. Here's what they do:

  • nfpr=1 -> Showing results for XXX Search instead for YYY ON
  • safe=off -> SafeSearch OFF
  • filter=0 -> Include omitted results ON

Changing the Interface language is fine. Actually, I'd argue language auto-detection should happen to the interface, not to the search results filter, which would be consistent with how major search engines work.

Honestly, just take the Search language option away, it does more harm than good. Or at least make [all] the default and lock the option behind huge warning signs with flaming skulls that searching will be seriously degraded for everyone if anything other than [all] is selected. People don't understand this is the equivalent setting they're touching (taken from Google's official advanced search page):

TL;DR

Here's a picture to sum up the problem most search frontends are facing:

Proposed fixes:

  1. remove Search language and default it to [all]
  2. give [auto] to the Interface language instead
  3. add these 3 parameters to unlock all the possible Google results ?safe=off&nfpr=1&filter=0
@benbusby
Copy link
Owner

Search language already defaults to "all", but some public instance maintainers change this default to be to their preferred setting instead. Same with interface language (and country). Users can still override these settings, but the actual location the instance is hosted at can also affect results as well. I've always recommended that anyone who doesn't get decent results from a public instance should spin up their own or run it locally. The instances I personally run all "passed" the test you outlined.

Also, "safe" is already a configurable search param using the home page config, and "nfpr" is enabled when clicking the "exact results only" prompt from the search page. The "filter" param doesn't seem to impact the results, at least from my testing.

@benbusby benbusby closed this as not planned Won't fix, can't repro, duplicate, stale May 17, 2023
@WAZAAAAA0
Copy link
Author

WAZAAAAA0 commented May 17, 2023

but some public instance maintainers change this default

That's one of the problems. Instance maintainers can't be trusted to know what the heck they're doing (as proven), so let's take their toy away.

Users can still override these settings

Many public instances disable customizations. I suppose they formed this tendency because in the past, Whoogle settings used to apply "globally", meaning that if one user changed the settings they would apply to everyone else in the world. Again, take their toy away.

The "filter" param doesn't seem to impact the results, at least from my testing.

The filter parameter handles the "omitted results". Currently in Whoogle they are problematic as they show like this at the bottom of a page:

In order to show you the most relevant results, we have omitted some entries very similar to the 2 already displayed.
If you like, you can repeat the search with the omitted results included.

but clicking on it links directly to google.com! Sample link that currently shows this behavior https://wg.vern.cc/search?safe=off&gbv=1&q=%22sfendaki%22&nfpr=1 this is a legit bug that should be taken care of, please re-open the issue @benbusby. The easiest fix would be to include filter=0 by default on all searches so that omitted results are always shown (which is what SearX did)

@entrider
Copy link

There's no such thing as real Google results, because Google's results are inconsistent (differ between different browsers, IPs, locales, etc.). If you get 10 results for a query, it doesnt mean that everyone should get the same number of results for this query. And it doesnt mean that you should get the same number(s) from librex, searx, startpage or whoogle. All these projects try to hide your identity in order to give you more neutral results (compared to your filter bubble) than you would get by using Google directly, but they cannot be completely neutral and therefore they cannot be reproducible between users and/or instances.
The whole reason why Google tracks their users is to give them a bit different (personalized, as they say) results and serve more relevant ads. And oh boy Google is the king of tracking, they probably analyze every single variable when form the results for you.

@entrider
Copy link

entrider commented Aug 20, 2023

I've always recommended that anyone who doesn't get decent results from a public instance should spin up their own or run it locally.

The less users use your instance, the less neutral results you get. It kinda defeats the purpose of escaping the filter bubble... unless you make your search queries trough Tor may be but its painfully slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants