Skip to content
This repository has been archived by the owner on Sep 7, 2023. It is now read-only.

Add Arch Linux Wiki search engine #525

Merged
merged 2 commits into from Mar 24, 2016
Merged

Add Arch Linux Wiki search engine #525

merged 2 commits into from Mar 24, 2016

Conversation

ghost
Copy link

@ghost ghost commented Mar 23, 2016

Hi,

Somewhere in the project documentation I read that project maintainers were somewhat interested in adding a search engine for Gentoo Wiki. In my opinion, Arch Linux Wiki is a great source of information too, and is used by many Linux users I've met throughout the years. Will you be interested in adding its support to searx?

Here are a few technical details.

  1. Arch Wiki has rather strange architecture and is spread throughout several domains for different languages. Wiki for 20 languages is hosted on the main domain (http://wiki.archlinux.org/), and five more languages were moved to five different domains. You can see the details in source code listings below.
  2. Pages served by the main domain contain the name of the language they are written in (except for English).
  3. Although Arch Wiki is built on the very popular mediawiki engine, access to its search API is blocked to the outside world, so I could not use it and had to resort to using xpath queries.

This has a few implications:

  1. There is no single base URL for all possible search requests, in fact, there are six of them. Obviously, I am not an expert in searx's internals, so I am not sure if it will cause any breakage. Currently I decided to set the base_url variable to https://wiki.archlinux.org/, as it is the most used domain.
  2. I believe I can only return one search URL from the request method, so when a user tries to make a search in all languages, they only get results for 20 of them (see the reason in the previous paragraph).
  3. Unfortunately, it is not possible to filter out results in other languages if you are only searching in English (it only applies to English and not to any other supported language).

Here are a few examples.

Search for nvidia-xconfig in all languages:
http://i.imgur.com/Ava4HIi.png

Results for the same search in French (Wiki hosted on different domain):
http://i.imgur.com/50CgQaS.png

Same search in Russian (Wiki hosted on the main domain):
http://i.imgur.com/LLPSfDl.png

Any comments or suggestions?

I will be happy to continue working on the project (gentoo wiki, etc.) if this pull request is of any help.

Thanks.

@asciimoo
Copy link
Member

Hi,
first of all, thanks the detailed description. Arch wiki is definitely a resource what should be the part of the IT category.

Will you be interested in adding its support to searx?

Definitely

  1. There is no single base URL for all possible search requests, in fact, there are six of them. Obviously, I am not an expert in searx's internals, so I am not sure if it will cause any breakage. Currently I decided to set the base_url variable to https://wiki.archlinux.org/, as it is the most used domain.

base_url is just a convention, it isn't a required engine attribute.

  1. I believe I can only return one search URL from the request method, so when a user tries to make a search in all languages, they only get results for 20 of them (see the reason in the previous paragraph).

Yes, an engine currently can't spawn more than one request. I think it's totally acceptable if the engine supports "only" that 20 languages by default, maybe an option to set a custom url (that's how the base_url concept can be useful - base_url (and any engine attribute) can be overwritten from the config or the engine can be duplicated with different base_url)

Any comments or suggestions?

Great job.
One little thing: the contents of the results are formatted with their wiki syntax and seems it doesn't contain much information, perhaps we should leave it blank.

I will be happy to continue working on the project (gentoo wiki, etc.) if this pull request is of any help.

Very cool, it's always good to see new engines =)

…x.py

Content field in Arch Wiki search results is of no real use, more often
than not it contains no usable information and includes too many markup
tags which make the text unreadable. It is safe to remove it.
@ghost
Copy link
Author

ghost commented Mar 24, 2016

@asciimoo thank you for your reply. You are definitely right, I tested the search for a bit and most of the time the content field is of no real use, too little text for too much page links and markup. I think results would be cleaner without it.

@asciimoo
Copy link
Member

@ukwt great, thanks

@asciimoo asciimoo merged commit ba16f21 into searx:master Mar 24, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant