Skip to content
This repository has been archived by the owner on Feb 27, 2021. It is now read-only.

Rework search fields #696

Open
Phyks opened this issue Apr 1, 2019 · 6 comments
Open

Rework search fields #696

Phyks opened this issue Apr 1, 2019 · 6 comments

Comments

@Phyks
Copy link
Member

Phyks commented Apr 1, 2019

This is an issue to centralize and discuss what should be done with search field.

From #683 (comment)

My feeling about these things is that people generally expect search to work "as in Google". So, one search input covering multiple data fields. Anything else confuses people (including me). But I'm not an UX expert, I might be wrong!

I agree with this and I'd be more inclined to have a single search field in the main page to search for both titles and authors. In the end, this is what arXiv is doing (https://arxiv.org) for instance and it works quite well.

From #684 (comment)

Would it work to dynamically add new author name fields? I mean, per standard one author with fields for last and first name and behind that a Plus-Button/Add-Author-Button to add another author, which generates a new set of a tuple for last and first name, via JS? So then it would be clear which first name corresponds to which last name.

That would be an option in the advanced search fields, which would have the nice impact of easing the search logic for us. I think there were two fields in the very beginning of Dissemin (if I remember correctly) and then this was moved to a single author field, isn't it?

From #661 (comment)

Naive opinion here: advanced search operators in the search field of the main page are not easily discoverable (I didn't know about them either) and I don't expect that many people will use them.
My opinion would be that it's fine not to bother with them too much (e.g., make the simplest possible choices in terms of parsing), and encourage users that want to make complicated queries to use the advanced search page. (In particular when the user submits a query with advanced operators that cannot be parsed properly.)

@wetneb
Copy link
Member

wetneb commented Apr 1, 2019

Note that the "title" field already indexes authors too. But if you type a name such as "Marie Farge" you will have some false positives (for instance because "Marie" and "Farge" would appear in different authors of the same paper):
https://dissem.in/search/?q=marie+farge

One thing that we could do is to maintain a search index for Researcher objects too. Potentially, we could want to show a few matching researchers at the top of the list if there are good matches. This way, we would suggest them to the user, who could then have better search results. Django Haystack already uses a single index to store multiple models so I assume it should not be too expensive to search both in Paper and Researcher objects. But it will surely be a bit annoying to make the search view more complicated to handle these two types of objects.

@beckstefan
Copy link
Member

I would suggest to make DOIs searchable, since this is an easy way to find out if a single publication is open access or not.

This shouldn't require to much magic, since DOI are in a very specific format. Maybe we could fetch CrossRef-Data in the background in the case, that the publication is not yet in the system or simply redirect to view where we put the DOI directly into the URL.

@Phyks
Copy link
Member Author

Phyks commented Apr 8, 2019

This shouldn't require to much magic, since DOI are in a very specific format. Maybe we could fetch CrossRef-Data in the background in the case, that the publication is not yet in the system or simply redirect to view where we put the DOI directly into the URL.

This is actually already implemented as https://dissem.in/<DOI>. This looks for the paper from the DOI and eventually fetch it from Crossref if required.

Huge 👍 however for having it more easily discoverable and visible in the search forms!

@beckstefan
Copy link
Member

The trend is definitely to use one universal search field. On the other hand: A researcher should be able to formulate a complex search, but experience tells that this often not the case. Indeed, a complex search is not user friendly.

I am not sure what would happen, if we put e.g. all titles and names in one index and present the output to user. I understand the current search logic is a binary search. This search is nice if the result set is small. Making the result set small is done usually by adding more and more restrictions during the search. This works for most bibliographic resources, e.g. you can use author names and words from title and the year, but this is of course not possible for searching by name only.

On the other hand, more sophisticated search mechanisms try to search with relevance. However, it does not really affect the results, but their ordering. The idea is not to cut down the amount of results during the search, but to put the presumably best result first. Implementing a relevance search implies giving up the timeline - but the question remains: What is relevant?

Omitting the possibility of searching by "author + title" we can do two indexes: One for authors, one for title. The main search form performs a query on both and merges the results. On the results page we then need a one-click-solution for eliminating one of the two sets. The idea is, that most names appear in few titles and we can distinguish between a name and title. Nevertheless, it would be interesting to measure something.

Maybe we can try to measure the overlap of of author names with title words (there are some nice and fast algorithms out there) and do some playing, by adding separate indexes and put in the user queries and compare, how it's doing on average without presenting anything to the user. Of course, this would involve way more load on the server.

@Phyks
Copy link
Member Author

Phyks commented Apr 15, 2019

👍

So far, what I have in mind would be to have a unified search field on the front page, similar to what they have at http://arxiv.org/ for instance:

  • ORCID matches should be translated into authors=<ORCID> filters
  • DOI matches should be translated into DOI=<DOI> filters
  • All the rest should match against title and authors.

I would keep the advanced search form with all of the dedicated fields. The remaining issue there is "how to handle the search form on the results page?". This is somewhat a detail for now though.

Concerning the unified search:

  • I have in mind that there was such a thing in the past (in the first versions of Dissemin), but I may be confusing. @a3nm and @wetneb would you have more background on this?
  • We should investigate the cost in terms of index (storage / perfs) and this would probably mean rebuilding the index. This is quite a costly (and long) operation, probably worth post-poning this to after the current import processes ran by @wetneb.

@wetneb
Copy link
Member

wetneb commented Apr 15, 2019

I am not sure how easy it is to separate the author words from the title words in a search query, it's an interesting problem, but it is indeed likely to be more expensive than simply searching against a single field where both are indexed, and I am not sure how much we would gain in terms of accuracy in the results.

Relevance-based ordering would definitely make sense too (especially given that the publication dates are not so reliable).

I personally know very little about search. All I know is that it takes about a week to re-generate a lightweight version of the index (without search publishers and journals). If you want to test any of your proposals at scale, I am happy to deploy other indexes in parallel (for instance by serving them on other ports or domains). We can do that prototyping in parallel with the other import processes as long as we are not touching the production index.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants