Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify why we need ElasticSearch #8

Open
raquelalegre opened this issue Dec 1, 2020 · 5 comments
Open

Clarify why we need ElasticSearch #8

raquelalegre opened this issue Dec 1, 2020 · 5 comments

Comments

@raquelalegre
Copy link
Contributor

raquelalegre commented Dec 1, 2020

Estimate: 3 days

Most of the search requirements can be satisfied by Postgres (partial search, search by similarity, etc.). Perhaps we can get rid of ElasticSearch altogether?

From the Reqs doc:

Search mechanism for ancient texts, summaries and comments, to accept terms in English, Latin and Greek.
Search should account for different spelling variations, symbols and grammatical forms, whether by use of wildcards ("*"), automatic substitution (e.g. v ↔ u) or more sophisticated rules
Search should not be affected by editorial symbols and other non-textual characters.
Option to restrict search to text in particular fields (e.g. fragment text, testimonia text, antiquarian essay), or to search across all content
@raquelalegre
Copy link
Contributor Author

  • collate the use-cases for searching the database and the exact feature requirements
  • examine / prototype whether these can be done using postgres search tools
  • possible that elastic was included in the tech considerations because it was used by Arches but that was possibly because it was needed for that platform as their internal data storage was unsuitable for fast search..?
  • if elasticsearch not required then the tech stack is simpler and might be simpler to code/deploy/maintain

@raquelalegre
Copy link
Contributor Author

We'll wait for some feedback once the project team starts using the application's search functionality but it doesn't sound like they'll have very complex search requirements. See #1.

@raquelalegre
Copy link
Contributor Author

raquelalegre commented Jan 12, 2021

Looking further into ElasticSearch vs PostgreSQL, whereas the functionality requested by the team is not very complex, it will be less performant in PostgreSQL than Elastic as more text is added by the team. However, maintenance of Elastic is much more costly, and we'd have to change the current data ingestion functionality since so far everything has been done only in Postgres.

We could keep it simple using only Postgres in a way that is as performant as Elastic, as explained on this post. This requires some database tuning as well as some tweaking of the queries, which poses a problem since as this is a Django application and all SQL is autogenerated. It is possible to write bespoke SQL statements, but would be harder to maintain and less understandable.

Next steps would be:

  • Find which database queries actually need the tweaking - if it's too many of them, we'd be better off with ElasticSeach
  • See how GIN indexes can be added to the model declaration in Django (and the other performance tweaks mentioned in the post)
  • See what pg specific optimisation is provided in the version of Django we use
  • What would it imply to start using Elastic? Take into account this won't be part of the MVP/first release and might imply migrating data that is only in Postgres to Elastic rather than copying.
  • What does maintenance of Elastic imply?

@raquelalegre raquelalegre added this to the Mid-project milestone Feb 25, 2021
@raquelalegre
Copy link
Contributor Author

It seems like the text search is getting complicated with Postgres, so we are considering using ElasticSearch. We could:

  1. Replace Postgres completely with Elastic - not a very good idea as we might lose data from the production server
  2. Duplicate the data and save it both in Postgres and Elastic. We've considered two approaches:
    • Overload the save method that writes in the PostgresDB so it also writes to Elastic - this might also not be a good idea as we might miss places in the codebase where the database is modified and eventually end up with two different versions of the data.
    • Have a process (e.g. something configured with celery and rabbitmq, for example) that does mirroring between the 2 databases, always keeping Postgres as the main DB.

We'd only use the Elastic version of the database for the search functionality of the website.

Some questions we need to address:

  • VM data storage:
    • How much storage space do we have left in production? Would it fit a copy of the database?
    • How much does it take to store the current Postgres database? Would it be comparable to what we expect an Elastic version of it to take?
    • If we need to increase data storage, how do we fund it?
  • Should we mirror the whole database or only those fields that are being searched for? ie those in the dropdown in the search section, or those with more text like original texts, translation, biographies and comments
  • How often to mirror the DBs? Can we configure a trigger instead of running the mirror e.g. every hour?
  • How does our data model have to change in order to make the queries that need running available?
  • How can we deal with languages and translations?
  • How about rare characters? How are we doing that currently on Django?

@acholyn
Copy link
Collaborator

acholyn commented Jun 13, 2024

might be worth using in regards to #433 #454 , otherwise, we should get rid of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Technical
Development

No branches or pull requests

3 participants