Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Improve title-based category search #306
That's because the API searches for "Nintendo sign at Tokyo branch in Taito" instead of "Nintendo" and "sign" and "Tokyo" and "branch" and "Taito".
We would need to split into words, then remove grammar words such as "the is first to into" or more generally all small words, then perform a search for each seemingly relevant word (these seem to be called stop words)
A bigger problem is that most titles are not in English, which means we would have to first guess the language, and then extract stop words in the context of that language. To not make the app to big, we could write a multi-languages extractor (for instance using nltk) and host it on a Wikimedia server.
referenced this issue
Oct 25, 2016
Do we? Since we are not just receiving a string from an unknown source but we also (kind of) have access to the input side, isn't it possible to retrieve and use the locale/keyboard information? It doesn't always work, though - I could more or less enter Spanish with an English keyboard if I am being lazy to not switch (and if I knew how to write in Spanish :)). But I believe it often does.
Another problem is translation - titles may not be in English, while most categories are in English. In theory the search can find the right category by looking at the category's multilingual descriptions, but in reality not many categories have any non-English descriptions. In future this might be resolved by https://meta.wikimedia.org/wiki/Community_Tech/Allow_categories_in_Commons_in_all_languages but for now we'd have to translate (although, as in Nicolas's example, sometimes company names and place names might not need translation/transliteration).