Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for PostgreSQL and MySQL only #2825

Closed
nijel opened this issue Jun 22, 2019 · 24 comments
Closed

Support for PostgreSQL and MySQL only #2825

nijel opened this issue Jun 22, 2019 · 24 comments
Assignees
Labels
enhancement Adding or requesting a new feature.
Milestone

Comments

@nijel
Copy link
Member

nijel commented Jun 22, 2019

Currently Weblate supports any database backend supported by Django. That's nice, however that limits possibilities to use database specific features.

For example switching to PostgreSQL only would allow us to use it's full text search removing need for external full text engine. The Whoosh usage currently leads to some problems for example see #2549 and #2543. Having in database full text engine would also heavily improve it's performance, because currently it's first doing search in Whoosh to get fulltext matches and then filter the results in the database to perform filtering by translation and permission checks, what leads to issues as #2876.


Want to back this issue? Post a bounty on it! We accept bounties via Bountysource.

@nijel nijel added enhancement Adding or requesting a new feature. undecided These features might not be implemented. Can be prioritized by sponsorship. labels Jun 22, 2019
@nijel
Copy link
Member Author

nijel commented Jun 22, 2019

The migration guide for the database can be found here https://www.calazan.com/migrating-django-app-from-mysql-to-postgresql/

nijel added a commit that referenced this issue Jun 24, 2019
We might eventually drop support for MySQL, so better give
users an early warning.

Issue #2825

Signed-off-by: Michal Čihař <michal@cihar.com>
nijel added a commit that referenced this issue Jul 5, 2019
We will most likely drop support for MySQL, so avoid pointing new users
to it.

See #2825

Signed-off-by: Michal Čihař <michal@cihar.com>
@nijel
Copy link
Member Author

nijel commented Jul 25, 2019

Another nice benefit of switching to Postgres might be using Agagama instead of writing translation memory service on our own, see https://github.com/translate/amagama

@nijel nijel removed the undecided These features might not be implemented. Can be prioritized by sponsorship. label Aug 31, 2019
@nijel nijel added this to the 4.0 milestone Aug 31, 2019
nijel added a commit that referenced this issue Aug 31, 2019
Add check showing warning about not supported setup in future and add
migration docs.

Issue #2825

Signed-off-by: Michal Čihař <michal@cihar.com>
@nijel nijel changed the title Reconsider database backends support Support PostgreSQL only Aug 31, 2019
@nijel
Copy link
Member Author

nijel commented Aug 31, 2019

Things to improve based on this change:

@maxking
Copy link
Contributor

maxking commented Sep 20, 2019

While whoosh is pretty slow, we have seen a lot of performance improvements when using Django-haystack with Elasticsearch in Mailman project. It is definitely another service to support, but with containers it isn't a lot of work to setup. There are system packages for most operating system.

Xapian is written in C++ and also is one the supported backends in Django-haystack.

Just a thought, in case the only reason to drop support for other databases is slow searching.

@nijel
Copy link
Member Author

nijel commented Sep 20, 2019

In case the PostgreSQL only solution will not perform well, Elasticsearch is certainly worth trying. However my current impression is that we can do all what we need with PostgreSQL and reasonable query parser. We might stick with Whoosh for query parsing as it was quite easy to adapt for our purpose, see https://github.com/WeblateOrg/weblate/blob/master/weblate/utils/search.py.

PS: The performance is not the biggest issue with Whoosh for me, it does search quite fast. The biggest issue is huge memory consumption (the updating process easily consumes gigabytes of memory) and lockups while updating (I've not yet found time to debug this).

nijel added a commit that referenced this issue Sep 22, 2019
See #2825

Signed-off-by: Michal Čihař <michal@cihar.com>
@unho
Copy link

unho commented Nov 20, 2019

Another nice benefit of switching to Postgres might be using Agagama instead of writing translation memory service on our own, see https://github.com/translate/amagama

amaGama is a standalone service, so there is no need for Weblate to migrate to PostgreSQL since Weblate will communicate with it using http://docs.translatehouse.org/projects/amagama/en/latest/api.html. And yes, amaGama is PostgreSQL only, AFAICT.

@nijel
Copy link
Member Author

nijel commented Nov 21, 2019

@unho I know it's standalone service. Still for most deployments it makes sense if they can share same database.

Anyway, the main motivation of switch to PostgreSQL is to have database with decent full-text search and to get rid of Whoosh as a full-text engine which apparently doesn't scale well for our purposes.

@DominicWatson
Copy link

I would love it if the database engine implementation remained agnostic. For those of us deploying Weblate into existing infrastructure, being able to use our existing database engines is a real bonus.

If this were Postgres only, I simply wouldn't use it (I don't expect anyone else to care about that, but just stating my case here).

Seems crazy to break compatibility when you already have it.

@nijel
Copy link
Member Author

nijel commented Nov 26, 2019

I'd love to keep the compatibility, but I don't see way to do it right now. We need to replace Whoosh with something what scales better and doing the full-text search in the database seems to be the way to go. Django has built in support for it on PostgreSQL so that looks like the best choice. Support for MySQL is not implemented (see for example adamchainz/django-mysql#314 which is open for years) and we certainly don't have manpower to implement that.

I won't reject merge requests adding support for other databases, but we will definitely switch to using in database full-text search and native JSON fields in the 4.0 with initial support for PostgreSQL only.

@unho
Copy link

unho commented Nov 27, 2019

Sorry for interject. I guess that Whoosh was being used to provide translation memory matches, and you need a faster replacemente, have you considered using Elasticsearch instead?

@nijel
Copy link
Member Author

nijel commented Nov 28, 2019

It's used for search in the translations as well. Doing that in Elasticsearch would mean either to duplicate whole Weblate database there (to be able to do filtering based on user permissions, projects or languages). Right now we do ID lookup in Whoosh and then additional filtering in the database, what performs terrible in case of common terms which produce millions of matches in the full-text search.

But you're right that Elasticseach would be indeed great solution for translation memory, which is pretty much write only and needs only fulltext searches. We can also switch to using external service for translation memory such as amaGama.

@DominicWatson
Copy link

Support for MySQL is not implemented (see for example adamchainz/django-mysql#314 which is open for years) and we certainly don't have manpower to implement that.

Roger that (didn't realise that MySQL was already not supported, spent quite a bit of time trying to get it working ;) ).

Makes sense in that case.

@unho
Copy link

unho commented Nov 28, 2019

If I recall correctly Pootle has the concept of external TM which are not meant to be altered (like amaGama), and local TM using Elasticsearch which are updated every time a translation is saved and can be updated/bootstraped using TMX or PO files. More in https://pootle.readthedocs.io/en/latest/features/translation_memory.html

Just sharing in case it might help.

@nijel
Copy link
Member Author

nijel commented Nov 28, 2019

MySQL is currently supported. The Django integration just doesn't support features we will use in upcoming 4.0 release.

@unho We have external read only memories supported as well. What we're looking for is read-write support and I'm still not sure whether we will support that for third party services, but some people are interested in that as well.

@nijel
Copy link
Member Author

nijel commented Feb 20, 2020

Did some more research on this and it seems that keeping support for MySQL is doable:

The question is whether it's worth of the effort ;-).

@nijel
Copy link
Member Author

nijel commented Feb 21, 2020

On the other side, there are some limitations of MySQL which we are hitting already and causing performance penalties or other limitations:

  • Lack of support for distinct on field. See https://code.djangoproject.com/ticket/17974, current workaround is to fetch all results and do this in Python:
    def get_distinct_translations(units):
    """Return list of distinct translations.
    It should be possible to use distinct('target') since Django 1.4, but it is not
    supported with MySQL, so let's emulate that based on presumption we won't get too
    many results.
    """
    targets = {}
    result = []
    for unit in units:
    if unit.target in targets:
    continue
    targets[unit.target] = 1
    result.append(unit)
    return result
  • Bulk creation limits, big queries make server fail with "MySQL server has gone away" error, so bulk_insert has to be used with limits
  • Case insensitive comparison forces us to compare strings again in Python, see https://code.djangoproject.com/ticket/9682 and workaround we use:
    sugs = Suggestion.objects.filter(
    unit=suggestion.unit, target=suggestion.target
    ).exclude(id=suggestion.id)
    # Do not rely on the SQL as MySQL compares strings case insensitive
    for other in sugs:
    if other.target == suggestion.target:
    suggestion.delete_log(
    anonymous_user, change=Change.ACTION_SUGGESTION_CLEANUP
    )
    break
  • The utf8m4 charset (required for higher Unicode plans such as emojis) limits indexed char field to at most 192 chars
  • The regex engine in MySQL is limited what reduces search capabilities
  • The fulltext index behaves wrongly inside transactions, see https://dev.mysql.com/doc/refman/8.0/en/innodb-fulltext-index.html#innodb-fulltext-index-transaction

@rob006
Copy link
Contributor

rob006 commented Feb 21, 2020

* Case insensitive comparison forces us to compare strings again in Python, see https://code.djangoproject.com/ticket/9682 and workaround we use:

That depends on used collation. If you want strict comparison, utf8mb4_bin can be used as table charset.

* The utf8m4 charset (required for higher Unicode plans such as emojis) limits indexed char field to at most 192 chars

This is configurable, and default configuration was changed years ago, it should not be a problem anymore.

@nijel
Copy link
Member Author

nijel commented Feb 21, 2020

That depends on used collation. If you want strict comparison, utf8mb4_bin can be used as table charset.

True, but that breaks case-insensitive lookups in Django. See https://code.djangoproject.com/ticket/9682

This is configurable, and default configuration was changed years ago, it should not be a problem anymore.

Thanks for update, I was not aware of that.

I believe that many of the issues we see could be addressed at ORM level in Django, but PosgtgreSQL seems to be the primary focus for them.

@mestaritonttu
Copy link

Just ran into this, which might be useful for you: Postgresql Search: From the trenches

@nijel
Copy link
Member Author

nijel commented Feb 28, 2020

Thanks, I've already read a lot on the topic and I'm aware that it's not a silver bullet.

I've done some testing on data from Hosted Weblate and PostgreSQL performs reasonably well on that (about 10 times faster than our current implementation using Whoosh) and MySQL is performance wise about similar to current solution (this is mostly caused by lack of index merging). In both cases it would address huge memory consumption we see with Whoosh. So, it seems to be the way to go for now because it will scale from small installations to bigger ones.

As for Elasticseach, I don't think it's way to go for us. It's definitely great in doing searches on text. The problem is in the more complex queries we support. We probably would have to mirror most of our database to it (what could easily go wrong). The alternative approach of doing full text in Elasticsearch and then filtering results in database doesn't scale well for things that have millions of matches in Elasticsearch and would be heavily filtered on the database side.

@nijel nijel changed the title Support PostgreSQL only Support for PostgreSQL and MySQL only Mar 2, 2020
nijel added a commit that referenced this issue Mar 2, 2020
nijel added a commit that referenced this issue Mar 2, 2020
Support for it will be removed in the 4.0 release.

Issue #2825
nijel added a commit that referenced this issue Mar 2, 2020
nijel added a commit that referenced this issue Mar 2, 2020
It will be gone in 4.0.

Issue #2825
nijel added a commit that referenced this issue Mar 2, 2020
This is probably more widely used than original MySQL.

Issue #2825
@nijel nijel self-assigned this Mar 10, 2020
@nijel nijel unpinned this issue Mar 10, 2020
@nijel nijel closed this as completed in a091b10 Mar 10, 2020
@github-actions
Copy link

Thank you for your report, the issue you have reported has just been fixed.

  • In case you see a problem with the fix, please comment on this issue.
  • In case you see similar problem, please open separate issue.
  • If you are happy with the outcome, consider supporting Weblate by donating.

@nijel
Copy link
Member Author

nijel commented Mar 12, 2020

The good news is that I've maintained MySQL support for 4.0. The bad news is that it seems to perform really badly. The Weblate testsuite takes 35 minutes on PostgreSQL, but 150 minutes on MySQL/MariaDB. I've probably made something wrong there (or MySQL really sucks in this use case). Any suggestions to improve the peformance are welcome.

@mestaritonttu
Copy link

The good news is that I've maintained MySQL support for 4.0. The bad news is that it seems to perform really badly. The Weblate testsuite takes 35 minutes on PostgreSQL, but 150 minutes on MySQL/MariaDB. I've probably made something wrong there (or MySQL really sucks in this use case). Any suggestions to improve the peformance are welcome.

In your commit 9c75b52 you have a comment "This is workaroud for MySQL as FULL TEXT index does not work well inside a transaction, so we avoid using transactions for tests".

From what little I know, InnoDB perf can be horrible, if not utilising transactions.

@nijel
Copy link
Member Author

nijel commented Mar 12, 2020

Yes, that can be reason for the slowness (what would be good as it would not affect real usage). The tests before executed in reasonable time (https://github.com/WeblateOrg/weblate/runs/501385902?check_suite_focus=true, though there are several failures caused by not updated fulltext indices). This change however also makes MySQL actually update the full-text indices, so that might be the slow part as well.

The method however doesn't affect whether transactions are user or not, but only how tests are executed in Django. The default approach is to use transactions to rollback database to original state, which sadly doesn't work with MySQL because fulltext index is updated only on commit and thus using it during the transaction results in not getting up to date results (see https://dev.mysql.com/doc/refman/5.6/en/innodb-fulltext-index.html#innodb-fulltext-index-transaction).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding or requesting a new feature.
Projects
None yet
Development

No branches or pull requests

6 participants