Improve title-based category search #306

Open
nicolas-raoul opened this Issue Oct 25, 2016 · 2 comments

Comments

Projects
None yet
3 participants
@nicolas-raoul
Member

nicolas-raoul commented Oct 25, 2016

  1. Chose picture
  2. Entered file name "Nintendo sign at Tokyo branch in Taito"
  3. The proposed categories did not contain anything related to Nintendo nor Taito.
  4. Waiting some time does not change the proposed categories

That's because the API searches for "Nintendo sign at Tokyo branch in Taito" instead of "Nintendo" and "sign" and "Tokyo" and "branch" and "Taito".

We would need to split into words, then remove grammar words such as "the is first to into" or more generally all small words, then perform a search for each seemingly relevant word (these seem to be called stop words)
It is less easy for languages without spaces (like Japanese), but most file names are in space-separated languages so for now that's not a big problem.

A bigger problem is that most titles are not in English, which means we would have to first guess the language, and then extract stop words in the context of that language. To not make the app to big, we could write a multi-languages extractor (for instance using nltk) and host it on a Wikimedia server.

@whym

This comment has been minimized.

Show comment
Hide comment
@whym

whym Mar 27, 2017

Collaborator

we would have to first guess the language

Do we? Since we are not just receiving a string from an unknown source but we also (kind of) have access to the input side, isn't it possible to retrieve and use the locale/keyboard information? It doesn't always work, though - I could more or less enter Spanish with an English keyboard if I am being lazy to not switch (and if I knew how to write in Spanish :)). But I believe it often does.

Another problem is translation - titles may not be in English, while most categories are in English. In theory the search can find the right category by looking at the category's multilingual descriptions, but in reality not many categories have any non-English descriptions. In future this might be resolved by https://meta.wikimedia.org/wiki/Community_Tech/Allow_categories_in_Commons_in_all_languages but for now we'd have to translate (although, as in Nicolas's example, sometimes company names and place names might not need translation/transliteration).

Collaborator

whym commented Mar 27, 2017

we would have to first guess the language

Do we? Since we are not just receiving a string from an unknown source but we also (kind of) have access to the input side, isn't it possible to retrieve and use the locale/keyboard information? It doesn't always work, though - I could more or less enter Spanish with an English keyboard if I am being lazy to not switch (and if I knew how to write in Spanish :)). But I believe it often does.

Another problem is translation - titles may not be in English, while most categories are in English. In theory the search can find the right category by looking at the category's multilingual descriptions, but in reality not many categories have any non-English descriptions. In future this might be resolved by https://meta.wikimedia.org/wiki/Community_Tech/Allow_categories_in_Commons_in_all_languages but for now we'd have to translate (although, as in Nicolas's example, sometimes company names and place names might not need translation/transliteration).

@nicolas-raoul

This comment has been minimized.

Show comment
Hide comment
@nicolas-raoul

nicolas-raoul Mar 28, 2017

Member

How about searching in English, then in the app's locale, and showing the results of both?

Member

nicolas-raoul commented Mar 28, 2017

How about searching in English, then in the app's locale, and showing the results of both?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment