Clarify why we need ElasticSearch #8

raquelalegre · 2020-12-01T21:43:29Z

Estimate: 3 days

Most of the search requirements can be satisfied by Postgres (partial search, search by similarity, etc.). Perhaps we can get rid of ElasticSearch altogether?

From the Reqs doc:

Search mechanism for ancient texts, summaries and comments, to accept terms in English, Latin and Greek.
Search should account for different spelling variations, symbols and grammatical forms, whether by use of wildcards ("*"), automatic substitution (e.g. v ↔ u) or more sophisticated rules
Search should not be affected by editorial symbols and other non-textual characters.
Option to restrict search to text in particular fields (e.g. fragment text, testimonia text, antiquarian essay), or to search across all content

raquelalegre · 2020-12-01T23:47:23Z

collate the use-cases for searching the database and the exact feature requirements
examine / prototype whether these can be done using postgres search tools
possible that elastic was included in the tech considerations because it was used by Arches but that was possibly because it was needed for that platform as their internal data storage was unsuitable for fast search..?
if elasticsearch not required then the tech stack is simpler and might be simpler to code/deploy/maintain

raquelalegre · 2020-12-01T23:47:33Z

We'll wait for some feedback once the project team starts using the application's search functionality but it doesn't sound like they'll have very complex search requirements. See #1.

raquelalegre · 2021-01-12T11:22:48Z

Looking further into ElasticSearch vs PostgreSQL, whereas the functionality requested by the team is not very complex, it will be less performant in PostgreSQL than Elastic as more text is added by the team. However, maintenance of Elastic is much more costly, and we'd have to change the current data ingestion functionality since so far everything has been done only in Postgres.

We could keep it simple using only Postgres in a way that is as performant as Elastic, as explained on this post. This requires some database tuning as well as some tweaking of the queries, which poses a problem since as this is a Django application and all SQL is autogenerated. It is possible to write bespoke SQL statements, but would be harder to maintain and less understandable.

Next steps would be:

Find which database queries actually need the tweaking - if it's too many of them, we'd be better off with ElasticSeach
See how GIN indexes can be added to the model declaration in Django (and the other performance tweaks mentioned in the post)
See what pg specific optimisation is provided in the version of Django we use
What would it imply to start using Elastic? Take into account this won't be part of the MVP/first release and might imply migrating data that is only in Postgres to Elastic rather than copying.
What does maintenance of Elastic imply?

raquelalegre · 2021-09-22T22:19:59Z

It seems like the text search is getting complicated with Postgres, so we are considering using ElasticSearch. We could:

Replace Postgres completely with Elastic - not a very good idea as we might lose data from the production server
Duplicate the data and save it both in Postgres and Elastic. We've considered two approaches:
- Overload the save method that writes in the PostgresDB so it also writes to Elastic - this might also not be a good idea as we might miss places in the codebase where the database is modified and eventually end up with two different versions of the data.
- Have a process (e.g. something configured with celery and rabbitmq, for example) that does mirroring between the 2 databases, always keeping Postgres as the main DB.

We'd only use the Elastic version of the database for the search functionality of the website.

Some questions we need to address:

VM data storage:
- How much storage space do we have left in production? Would it fit a copy of the database?
- How much does it take to store the current Postgres database? Would it be comparable to what we expect an Elastic version of it to take?
- If we need to increase data storage, how do we fund it?
Should we mirror the whole database or only those fields that are being searched for? ie those in the dropdown in the search section, or those with more text like original texts, translation, biographies and comments
How often to mirror the DBs? Can we configure a trigger instead of running the mirror e.g. every hour?
How does our data model have to change in order to make the queries that need running available?
How can we deal with languages and translations?
How about rare characters? How are we doing that currently on Django?

acholyn · 2024-06-13T11:03:16Z

might be worth using in regards to #433 #454 , otherwise, we should get rid of it

raquelalegre added Technical Search labels Dec 1, 2020

raquelalegre self-assigned this Dec 1, 2020

raquelalegre added this to the Mid-project milestone Feb 25, 2021

raquelalegre mentioned this issue Sep 1, 2021

Search substitutions #20

Closed

raquelalegre assigned tcouch Sep 22, 2021

acholyn added the end stage label Sep 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify why we need ElasticSearch #8

Clarify why we need ElasticSearch #8

raquelalegre commented Dec 1, 2020 •

edited

Loading

raquelalegre commented Dec 1, 2020

raquelalegre commented Dec 1, 2020

raquelalegre commented Jan 12, 2021 •

edited

Loading

raquelalegre commented Sep 22, 2021

acholyn commented Jun 13, 2024

Clarify why we need ElasticSearch #8

Clarify why we need ElasticSearch #8

Comments

raquelalegre commented Dec 1, 2020 • edited Loading

raquelalegre commented Dec 1, 2020

raquelalegre commented Dec 1, 2020

raquelalegre commented Jan 12, 2021 • edited Loading

raquelalegre commented Sep 22, 2021

acholyn commented Jun 13, 2024

raquelalegre commented Dec 1, 2020 •

edited

Loading

raquelalegre commented Jan 12, 2021 •

edited

Loading