Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support matching double-barrelled names #780

Open
CHTJonas opened this issue Dec 14, 2019 · 5 comments
Open

Support matching double-barrelled names #780

CHTJonas opened this issue Dec 14, 2019 · 5 comments

Comments

@CHTJonas
Copy link
Member

@CHTJonas CHTJonas commented Dec 14, 2019

I've just noticed that searching for 'harding' produces a hit for someone called 'Izzie Harding-Perrott' whereas searching for 'perrot' does not. I think we should probably be hyphen-agnostic in these kind of instances.

@GKFX

This comment has been minimized.

Copy link
Member

@GKFX GKFX commented Jan 17, 2020

I do wonder whether we'd be better off searching the database directly with a MySQL query. Elasitcsearch seems to only cause trouble in both poor results and having to keep the indexes up to date. It also occupies about ⅙ of Antigone's RAM. Perhaps there might be some performance issues if we received a huge volume of searches but I don't think we operate on a scale where that would be a concern.

This would also make an "advanced search" page etc. very easy to implement as that's just a few more conditions to add to a where clause.

@CHTJonas

This comment has been minimized.

Copy link
Member Author

@CHTJonas CHTJonas commented Jan 18, 2020

👍 to the above, at least in principle; Elasticsearch is great, but I feel we could be overcomplicating things at our scale. Also reclaiming some RAM back would also be nice! 😄

As a quick example, I was investigating Elasticsearch for ADC Room Booking but decided to go with the full text searching functionality that was present in the database instead (Room Booking uses Postgres instead of MySQL/MariaDB so not sure how they compare here). It might be that actually coding around non-ASCII characters, common name permutations, spelling mistakes and strange edge cases is more complicated than just using Elasticsearch but then again the DB might actually handle this for us.

Also relevant to this discussion is #519 and #489 which was also "advanced search" at one point.

@GKFX

This comment has been minimized.

Copy link
Member

@GKFX GKFX commented Jan 18, 2020

To see how MySQL handles non-ASCII characters I tried putting some accents in https://www.camdram.net/shows/1997-api-test-1 and ran

SELECT title FROM acts_shows WHERE title LIKE "API %";

and it successfully returned both ÁPÏ Test 1 and API Test 2.

SELECT title FROM acts_shows WHERE title = "API Test 1";

also returns ÁPÏ Test 1, so it seems that both the LIKE and even the equality operator are insensitive to accents.

MySQL (and MariaDB) actually seem to have a search engine mode built in which might be our best approach; it just needs a FULLTEXT index on the columns we want to search against. This would mean we could also search against the body text of shows etc.

@philosophicles

This comment has been minimized.

Copy link
Member

@philosophicles philosophicles commented Jan 20, 2020

so it seems that both the LIKE and even the equality operator are insensitive to accents

Note, not sure if I'm stating the obvious to you both here, but that behavior is only the case because the prod database was created with a case-insentitive default collation (utf8mb4_unicode_ci).

It's unlikely but possible that specific objects within the database use different collations, which might be case-sensitive (having a cs suffix, predictably enough). Collations also define accent-sensitivity, somewhat independently from case-sensitivity in at least some cases.

So this isn't guaranteed to behave exactly the same for every possible query - it's not an intrinsic property of LIKE or = operator behaviour.

(I'm familiar with how collations work on MS SQL Server mainly; MySQL/MariaDB will presumably be similar but certainly not identical in behaviour.)

MySQL (and MariaDB) actually seem to have a search engine mode built in which might be our best approach; it just needs a FULLTEXT index on the columns we want to search against. This would mean we could also search against the body text of shows etc.

I've used full-text indexing on MS SQL Server and it certainly works pretty well there. So I'd also agree this is worth exploring further.

@CHTJonas

This comment has been minimized.

Copy link
Member Author

@CHTJonas CHTJonas commented Jan 23, 2020

Thanks @philosophicles - I'd forgotten the peculiarities of MySQL collations!

I do like the look of SELECT * FROM acts_shows WHERE MATCH (title,author) AGAINST ('Shakespeare' IN NATURAL LANGUAGE MODE); and seems like it could be well-suited to our needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.