Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About searx #1634

Open
dalf opened this issue Jul 7, 2019 · 10 comments

Comments

Projects
None yet
6 participants
@dalf
Copy link
Collaborator

commented Jul 7, 2019

I feel the issue #1228 as become "what is the future of searx ?"

To answer that question, I would like to list some of the important topics to show that the searx maintenance is a full time job.

Disclaimer : it is only my opinion as an individual.

Engines keep stop working

It is a recurrent endless task.
But also it means that searx administrators have to update frequently.
So in the end, searx have to be easy to deploy and update.

How to install searx ?

You can pick an answer according to the day of the month:

So:

  • the easy to update is ... difficult to implement.
  • when someone creates an issue saying "the searx docker image is not working", how the image was built, from what?
  • when someone want to install searx with an existing configuration, help means to understand this configuration.

This doesn't help the next topic: Security.

Security & Privacy.

There is always the risk of tracking and/or malicious code getting into the results since searx fetches content from elsewhere (and the web living standard ).
Two ways to avoid that: review the code and rely on the browser security (HTTP headers).
It is possible to do this first point here in the repository, but not the second one since there are a lot of different ways to install searx.
So in the end, the security may vary depending of the instance you choose.
See https://stats.searx.xyz/ (BTW #1560).
Side effect of the HTTP security headers: some searx features don't work anymore without for the user (autocomplete for example).
Basically, if I want to use searx without thinking, I want a less secure instance because at least the UI works as intended.
Moreover, what we don't see in the stats.searx.xyz table is the fact that some instances use CloudFlare as a front-end, which defeat the purpose of privacy.
And actually, I understand why, see next topic.

Bots

searx is really interesting for some people to harvest data. Sooner or later an new searx instance will be found and hammered with bot requests. Since there are a lot of different ways to install searx, it's not easy to give recommendations. Moreover, currently the reverse proxy filtron doesn't provide a good default configuration, and is not referenced in the searx documentation.
In fact, this is a huge topic to deal with. Sooner or later, the users won't be able to use some engines.

Note the dealing with bots problem means keep a state somewhere (for example : keep in memory, the number of request per minute and per IP address).

Trust

The last but not the least: Trust.
Some searx instances are modified:

  • Obvious and cool when it is about the theme (BTW sorry the maintenance is not easy).
  • Less obvious and less cool when it is about modifying links to a shopping website (basically to create affiliate links).
  • Less obvious it is about adding a Matomo tracking.

What about the logs? For example, when there is an error in an engine (timeout, whatever), the requested url may be logged.
The image proxy, morty logs all request on stdout.
And basically, I don't know if some instances do not keep the logs to do some data mining, sell the data.

[edit] Some instance store the results in a temporary cache, and may use proxies to send the requests to the actual search engines.

Sum up

  • searx may work or not randomly because of broken engines, bot hammerings, HTTP headers.
  • the UI/UX may be broken (because of the HTTP headers)
  • requests and/or results may be logged and/or tracked.

Okay. So what to do?

CHATONS

In my opinion, searx fits in to this CHATONS initiative :
> CHATONS is a collective of independent, transparent, open, neutral and ethical hosters providing FLOSS-based online services.

Each host (named kitten) may have different services (etherpad, wallabag, zerobin, etc), and different policy.
Often it means you know the person(s) hosting the kitten.
For some kitten, the searx instance is password protected.
So, there is more trust, and no bots.

I know that most of people won't agree to do that.

One size won't fit all

The searx-docker project has been created to find an intermediate solution :

  • easy to install and update.
  • good security by default without killing the UI/UX.
  • one drawback: it won't fit in the existing configurations.
  • one unresolved big issue : bots (I'm afraid there is no easy solution).

The second step is the Embedded searx-checker so an user can clearly see what engines are not working.

After that, the user experience will be better. Of course, if during this time most of the engines stop working, it won't be useful.

@dalf dalf added the meta label Jul 7, 2019

@veloute

This comment has been minimized.

Copy link

commented Jul 8, 2019

this issue should be pinned.

@dalf dalf pinned this issue Jul 8, 2019

@rachmadaniHaryono

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2019

How to install searx

i recommend to put it in readme


when someone creates an issue saying "the searx docker image is not working", how the image was built, from what?

this can be easier when there is issue template

https://help.github.com/en/articles/creating-issue-templates-for-your-repository

user have to do some checklist, before reporting it. it sure take more time but it will be easier to debug/help by other contributor

it can be split into several category. what i can think of for example issue for searx engine, searx instance and actual searx issue


related to reporting issue from searx instance

while there is this line Powered by searx - 0.15.0 - a privacy-respecting, hackable metasearch engine it is not telling the full story.

maybe put commit hash like how danbooru did (Running Danbooru (5e5e86c3834166e5999e99f2bc79c4140d424058))

also maybe report page, which is just page between searx instance and searx github issue page, which contain required info of that searx instance (simple text with copy to clipboard button).


installation: imo better to stick to few method and let maintainer do the rest.


searx instance: maybe list some searx instance that include on second (modify link) and third category (adding tracking)


embedded searx checker: i recommend to add github issue link and maybe possible pr related to that engine

@dalf

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 14, 2019

About "better to stick to few (installation) methods"

I think that one of the reason why searx became popular is the numberous blog posts on "how to install searx" (search for +"searx" +"install" or search for searx on hub.docker.com). Another similar reason is the easy to hack / modify / customize aspect, and Python helps because most of the time it is installed with the OS.
The drawback: all these documentations and some time code bases are [not] updated at a different pace.

About the version number

I agree, the 0.15.0 is meaningless right now. But the lack of routine maintenance is also a reason.
I've implemented this on the Docker image, example: 0.15.0-95-acb0f113
Of course, as long it is not included directly on searx/version.py very (few|none of the) instances will have it.

BTW right now, stats.searx.xyz doesn't recognize this version schema.

About searx instance list

The first instance list was made by @pointhi in September 2016 : https://github.com/pointhi/searx_stats
Unfinished rewrite / never deploy : https://github.com/dalf/searx-stats2 (include a first searx-checker implemetation).

Having an API to get all the searx instance would be nice, it would help https://searx.neocities.org/ , https://searxes.eu.org/ and especially Android applications. I know the wiki page can be parsed as most probably stats.searx.xyz does.

What ever the tool, what information can [not] be automatically gather from each instances:

  • HTTPS grade and HTTP headers grade: automatic
  • display the embedded searx checker results: automatic (as long searx instances are updated to a new version that include this currently unimplemented feature). It would be nice to have the response time per engine too. BTW some attempts to improve the measure response time : #162 (comment) and #477
  • detect tracker: automatic but with possible false alarm. Use a headless browser, inspect all sent requests: can be automatic without pain (?).
  • detect affiliated links: not automatic. If the Python code modify the results it is not possible to know when/how. On the html template side: one idea, is to extract all the scripts and compare them to the exist one in the github repository, it's a lot for hazard results....

Doing the two last points means that the list will point at all the instances that have been hack in the good way. To sum up, it requires regular manually reviews.

... and sum up of sum up: it's balance between customization / privacy / usable:

  • From a "random" user point of view, a searx clone army is really good: whatever the instance, same experience.
  • From the (admin|users knowing the admin|small circle of people) point of view, customized configurations/themes/engines are way better.

Security & Privacy

Just to reference the issue #715 : should searx emits the Content-Security-Policy headers ?

Python and Golang are on the same boat

Disclaimer: I'm far from being a Python expert, I'm may be wrong

Another point about the current implementation:

  • I think asciimoo has created Morty to decrease the server load on searx.me. A first implementation was written in Python but not as fast (see the opened issue #934 for possible improvement). So we end up with two image proxy implementations, and two different languages.
  • Same comment about Filtron. See for example: #1610 (comment)
  • GIL: in the current implementation, searx starts a thread per request and per engine because the requests v2 doesn't support async and there is no async in Python 2. Yes there is grequests, but it has created issues here and there as I remember. There was an attempt to use PyCurl but the results were not here.
  • More or less related to the previous point: it would be good to throttle the sent requests. For example if we guess/know/think that (google|yandex|whatever) will "ban" the ip with 20 requests in 5 minutes, searx could stop the requests before reaching this point. Not easy to implement if uwsgi is used.

That's on reason searx-docker exists : combine all these different tools and configuration in point central point, even if it is not deployed but at least used a documentation reference.

(sorry for the syntax & grammar errors)

@Neustradamus

This comment has been minimized.

Copy link

commented Jul 14, 2019

@dalf: Please see with @asciimoo for the little change specified in #1228.
It is for a real future, and a nice project, ...
Easy to do it, no lost, ...
I am not alone to think it.

Example from @nerzhul: #1228 (comment)

cc: @asciimoo, @kvch, @Pofilo, @rachmadaniHaryono, @nerzhul

@unixfox

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2019

Could there be some kind of user review of each searx instances with for example the ability to note with stars just like on TripAdvisor or other e-commerce websites? Instead of implementing some complex algorithms that will detect if an instance has been badly modified and probably introduce loads of false positive.

@rachmadaniHaryono

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2019

the easiest we can do is put pinned issue about this, also put some template for user report

@Pofilo

This comment has been minimized.

Copy link
Collaborator

commented Jul 16, 2019

@unixfox it depends of the location of the user and of the instance. An instance in France will be faster for a French user than a American user so rating it is not relevant according to me.

@dalf

This comment has been minimized.

Copy link
Collaborator Author

commented Jul 16, 2019

@Pofilo : I think @unixfox was talking about user rating. Most probably (?) will multiple criterion, otherwise the XKCD 937 may describe the result:
XKCD #937

So the question is what are the criterion ?

  • Speed: depends of the location.
  • UI/UX: may apply when there is a specific theme.
  • Good results: ?
  • Bug free: ?
@rachmadaniHaryono

This comment has been minimized.

Copy link
Contributor

commented Jul 16, 2019

we can start with the objective issue first, like these 2 categories

Less obvious and less cool when it is about modifying links to a shopping website (basically to create affiliate links).

Less obvious it is about adding a Matomo tracking.

@unixfox

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2019

Should we talk about that in another issue to avoid creating too much comments on this important issue?

I would also add to this issue that we need to cleanup the github repository by closing outdated and fixed issues or if the issue is still present add label(s) to it. Moreover review old pull requests that haven't been merged and deal with them by either closing it or ask the author to make changes if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.