Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(sqlite3.OperationalError) database is locked #4653

Closed
epigramx opened this issue Apr 3, 2021 · 47 comments · Fixed by #4671
Closed

(sqlite3.OperationalError) database is locked #4653

epigramx opened this issue Apr 3, 2021 · 47 comments · Fixed by #4671
Labels
Triage Needed Issues yet to verify

Comments

@epigramx
Copy link
Contributor

epigramx commented Apr 3, 2021

I was fiddling with the web ui (only viewing: reloading at the order tab: no actual trading or changes) and the 'regular' software crashed at that moment while it was trying to order something (it went through at the exchange but not at the database).

freqtrade/freqtradebot.py", line 625, in execute_buy
    Trade.session.flush()
  File "<string>", line 2, in flush
sqlalchemy/orm/session.py", line 3233, in flush
    self._flush(objects)
sqlalchemy/orm/session.py", line 3373, in _flush
    transaction.rollback(_capture_exception=True)
sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
sqlalchemy/util/compat.py", line 198, in raise_
    raise exception
sqlalchemy/orm/session.py", line 3369, in _flush
    transaction.commit()
sqlalchemy/orm/session.py", line 836, in commit
    trans.commit()
sqlalchemy/engine/base.py", line 2257, in commit
    self._do_commit()
sqlalchemy/engine/base.py", line 2421, in _do_commit
    self._connection_commit_impl()
sqlalchemy/engine/base.py", line 2392, in _connection_commit_impl
    self.connection._commit_impl()
sqlalchemy/engine/base.py", line 930, in _commit_impl
    self._handle_dbapi_exception(e, None, None, None, None)
sqlalchemy/engine/base.py", line 1929, in _handle_dbapi_exception
    util.raise_(
sqlalchemy/util/compat.py", line 198, in raise_
    raise exception
sqlalchemy/engine/base.py", line 928, in _commit_impl
    self.engine.dialect.do_commit(self.connection)
sqlalchemy/engine/default.py", line 634, in do_commit
    dbapi_connection.commit()
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
(Background on this error at: http://sqlalche.me/e/14/e3q8)
freqtrade.commands.trade_commands - INFO - worker found ... calling exit
@epigramx epigramx added the Triage Needed Issues yet to verify label Apr 3, 2021
@epigramx
Copy link
Contributor Author

epigramx commented Apr 3, 2021

Also I mentioned "viewing only" to state the severity of the problem. It does not mean it should crash if the web ui was trying to edit trading while the rest of the software does the same. It's clear that there should be a transactional nature in these things and a lock of some sort that avoids this sort of thing is missing.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 3, 2021

This thread is interesting, though not just for the top answer, but also this answer: https://stackoverflow.com/a/53470179/277716

@xmatthias
Copy link
Member

there can be many things that cause this - blaming frequi might be a quick assumption - which i don't see confirmed by the logs you provide.

The same error will probably also happen if the sqlite database is locked at a file-system level.
Freqtrade does not use explicit cursors - so sqlalchemy will always do an immediate cursor.close().

Please note: I will start closing issues without the issue template filled out. It is unclear which version you're using (stable, develop, an older stable version) - and i am sick of having to ask that all the time as people simply delete the issue template.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 3, 2021

I'm not lying: the only thing that was being done to the dbase was accessing the web-ui apart from the regular operation (the web-ui was being reloaded the moment it happened).

Also it would be hard for users to prove with logging they're not doing anything else with the dbase anyway since you can't see their entire OS anyway don't you think?

linux python 3.8
ccxt 1.45.44
develop-45b7b0c98 (from a couple of days ago)

@xmatthias
Copy link
Member

i didn't say you're lying - i said that you're concluding it's frequi - while in reality it might have other causes i see no reason to exclude at the moment.

Obviously it's very difficult to prove (or disprove) if another process / app was using the file at that moment.

another stackoverflow thread suggests multiple other reasons (for example acces via command line - or even a virus scanner)...
other samples might be backup programs ...

If you're sure it was frequi, we'll need to know which request (which endpoint) was causing it - otherwise there'll be little way to have a chance of fixing this.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 3, 2021

It's an ubuntu server updated to the latest packages and the only thing that it was running apart from sshd is ft, so it's either very unlikely to be an OS issue or most users of ft on server environments have the bug.

I wonder if this is more related:

Traceback (most recent call last):
freqtrade/commands/trade_commands.py", line 19, in start_trading
    worker.run()
freqtrade/freqtrade/worker.py", line 74, in run
    state = self._worker(old_state=state)
freqtrade/freqtrade/worker.py", line 111, in _worker
    self._throttle(func=self._process_running, throttle_secs=self._throttle_secs)
freqtrade/freqtrade/worker.py", line 132, in _throttle
    result = func(*args, **kwargs)
freqtrade/freqtrade/worker.py", line 145, in _process_running
    self.freqtrade.process()
freqtrade/freqtradebot.py", line 188, in process
    self.enter_positions()
freqtrade/freqtradebot.py", line 388, in enter_positions
    trades_created += self.create_trade(pair)
freqtrade/freqtradebot.py", line 495, in create_trade
    return self.execute_buy(pair, stake_amount)
freqtrade/freqtradebot.py", line 625, in execute_buy
    Trade.session.flush()
  File "<string>", line 2, in flush
sqlalchemy/orm/session.py",
    self._flush(objects)
sqlalchemy/orm/session.py", line 3373, in _flush
    transaction.rollback(_capture_exception=True)
sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
sqlalchemy/util/compat.py", line 198, in raise_
    raise exception
sqlalchemy/orm/session.py", line 3369, in _flush
    transaction.commit()
sqlalchemy/orm/session.py", line 836, in commit
    trans.commit()
sqlalchemy/engine/base.py", line 2257, in commit
    self._do_commit()
sqlalchemy/engine/base.py", line 2421, in _do_commit
    self._connection_commit_impl()
sqlalchemy/engine/base.py", line 2392, in _connection_commit_impl
    self.connection._commit_impl()
sqlalchemy/engine/base.py", line 930, in _commit_impl
    self._handle_dbapi_exception(e, None, None, None, None)
sqlalchemy/engine/base.py", line 1929, in _handle_dbapi_exception
    util.raise_(
sqlalchemy/util/compat.py", line 198, in raise_
    raise exception
sqlalchemy/engine/base.py", line 928, in _commit_impl
    self.engine.dialect.do_commit(self.connection)
sqlalchemy/engine/default.py", line 634, in do_commit
    dbapi_connection.commit()
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) database is locked
(Background on this error at: http://sqlalche.me/e/14/e3q8)
freqtrade.commands.trade_commands - INFO - worker found ... calling exit

@xmatthias
Copy link
Member

is that the same log - or a new one?

i wonder as you're the first to ever report this - so if this would be a real issue, all users of frequi would encounter this - or at least more than one - considering that freqUI is inclued now fro more than one version ...

@epigramx
Copy link
Contributor Author

epigramx commented Apr 3, 2021

It's the same log: I just hadn't posted the top of it with the full ft-related parts in it.

It's not common (!): I encountered it only once after using the web UI for weeks.

I was just about to say: since SQLAlchemy documentation appears to pass the blame to the underlying database technology (when they describe OperationalError) and since this might be rare for most users: then: an exception handling that just re-tries a moment later might be useful (even though I seriously suspect that the practical cause of this is reloading the web-ui the very moment of altering the dbase with orders (which should be something reproducible with a "spammy" dry-run configuration and spam-reloading/spam-using the web-ui)).

@epigramx
Copy link
Contributor Author

epigramx commented Apr 3, 2021

I couldn't reproduce it with a spam-webpage-reloading during spam-buying in a dryrun (and the exchange hit a throttle too). I guess I'll just stop using the web UI when I suspect ordering may happen.

I noticed flush() is used in a dozen of places so one couldn't wrap it with an exception retry that easily ..even though I don't know if the web UI ever uses it indirectly or directly.

@xmatthias
Copy link
Member

reloading the UI (pressing f5) is the wrong approach - that will only download the html / javascript files - not make the API calls (which - if any - are causing this).
It's a frontend / client app - which doesn't touch the server (except for api calls to refresh the data) afterwards.

It's rather interresting as .flush() and .commit() - while there in the code for safety reasons - should not do anything as freqtrade sessions are created with autoflush=True and autocommit=True.

so i suspect that if an explicit .flush() is doing anything - you might encounter a different kinda problem "earlier" (check the logs right before the crash - maybe there was something else going on as well?)

@epigramx
Copy link
Contributor Author

epigramx commented Apr 3, 2021

The only problem that I was getting from the web UI (consistently and every time) was that it was complaining in the log "WARNING - Could not get rate for pair" for a couple of pairs but it does that every time with no further issue at all (and the log has nothing interesting before the sqlite3.OperationalError exception other than that warning and after it the order to buy (which it did at the exchange but never reached the point to save it to the db).

Interesting what you say that flush() should do nothing at all which points to a bug with SQLAlchemy itself (and I had not touched any code related to it either).

Unless perhaps the web server does some indirect double-usage of the dbase that I wouldn't know about because I don't really understand its code.

@xmatthias
Copy link
Member

which version of SQLAlchemy have you got? (pip freeze | grep sqlalchemy -i) ...

The webserver uses one of the scopedSessions that are generated in models.py - which one i cannot say, but it's according to the SQLAlchemy docs for potentially multi-threaded applications.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 3, 2021

SQLAlchemy==1.4.3 (when run from the ft virtual environment (outside of it it reports 1.3.23 but I doubt it affects it since ./setup -u finds it updated)).

I get the vague impression from its descriptions that if scoped_session() is used: then all the threads should access it that way to be consistent.

I wonder if the software has ways to access the dbase that are not just that thread-safe way.

@xmatthias
Copy link
Member

no, all queries run through .session.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 3, 2021

I can't imagine how it could happen if ft or the OS do nothing wrong and sqlalchemy would only blame the underlying dbase tech. A bug with the OS sqlite library itself perhaps?

A definite red flag is that you expect flush() to do nothing but it does if you're right about that.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 3, 2021

Wait a second. The most likely scenario here is that a flush() was called on its own (which is obvious on the trace) right before another access to the db which would call another flush() either automatically via the scoped_session() code or on its own again from ft.
I really don't think thread safety is handled perfectly either by ft or sqlalchemy because I know I haven't touched that code or double-accessed the db.

@xmatthias xmatthias changed the title Frequi (viewing-only) can crash the software if normal trading tries to edit the database at that moment: ERROR - (sqlite3.OperationalError) database is locked (sqlite3.OperationalError) database is locked Apr 3, 2021
@epigramx
Copy link
Contributor Author

epigramx commented Apr 4, 2021

Some background on it to help analyzing it and a possible solution:

ft has check_same_thread==false at create_engine(); it is done to allow it to work multithreaded; sqlalchemy docs warn "when using multiple threads with the same connection writing operations should be serialized by the user to avoid data corruption."

Part of one of the solutions to that is scoped_session() which is based on threading.local(); under the hood threading.local() restricts usage to one thread; comprehensive usage documentation for scoped_session() is here https://docs.sqlalchemy.org/en/14/orm/contextual.html#unitofwork-contextual

Later in the same page at "Thread-Local Scope" it mentions "The Session object is entirely designed to be used in a non-concurrent fashion, which in terms of multithreading means “only in one thread at a time”. So our above example of scoped_session usage, where the same Session object is maintained across multiple calls, suggests that some process needs to be in place such that multiple calls across many threads don’t actually get a handle to the same session. We call this notion thread local storage, which means, a special object is used that will maintain a distinct object per each application thread. Python provides this via the threading.local() construct. The scoped_session object by default uses this object as storage, so that a single Session is maintained for all who call upon the scoped_session registry, but only within the scope of a single thread. Callers who call upon the registry in a different thread get a Session instance that is local to that other thread."

Verdict: they are partly cryptic in text but look into the strong possibility that scoped_session() is used well only if you know you have made session objects with scoped_session() for each thread that needs one and not by using the same object on multiple threads (because if that possibility is true and only one object is used then it might be thread-safe in the sense that it protects the dbase from corruption but it will lock the dbase and crash the software in the rare occasion of concurrency from multiple threads) and the strong indication pointing to that is that scoped_session is using threading.local() (by default) which is a restrict-to-one-thread method.

@xmatthias
Copy link
Member

that's not true - when we introduced scoped_session i investigated this, and the 2 different threads (main thread, webserver thread) do get different session ID's.

This is also supported by the above / documentation - as the scoped_session is using threading.local() for session storage - and will therefore be giving a new session to each new thread. Obviously, if the same thread uses a session multiple times, the same session will be returned.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 4, 2021

Are you saying that scoped_session() is called only once, and it still gives internally a new object for each thread for the purpose of multithreading and it never needs to be called again for the purpose of multithreading specifically? If yes where is that explicitly stated or done in SQLAlchemy (because I'm not sure that even if it returns a new ID it completes the needs for that use case)?

@pizzidellamaremma
Copy link

Hi,
I'm having the same issue with current stable build on Raspberry OS 10.
I've tried stable and develop, both installed today, but in my case it crash on building trades table.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 4, 2021

^If it's an issue, I bet it's more of an issue on slow machines because race conditions are more likely. I also saw it on a slow machine (and when I tried to reproduce it and I couldn't it was on a fast machine).

@xmatthias
Copy link
Member

It's rather easy to prove that different sessions are used via debugger.

Simply set a breakpoint "somewhere" in freqtradebot.py (in my case in process())- and one in rpc.py (for example in _rpc_trade_history).
Then (in the debugger) - look at the ID for the session as follows:

Trade.query.session

2021-04-04-202039_578x156_scrot

In the screenshot above - the first 2 calls are made from the main thread - while the other 2 are made from an API invoked thread. Notice that they do have different ID's.
if after continuing, you then wait again for the main loop to break - you'll see that the id is the main one again.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 4, 2021

The .session is same (not the query.session).

Trade.session
<sqlalchemy.orm.scoping.scoped_session object at 0x7ff9d8cb7280>
Trade.query.session
<sqlalchemy.orm.session.Session object at 0x7ff9d8cb7ca0>
Trade.session
<sqlalchemy.orm.scoping.scoped_session object at 0x7ff9d8cb7280>
Trade.query.session
<sqlalchemy.orm.session.Session object at 0x7ff9f77fa850>

I assume you believe .session is correct to stay the same because it represents scoped_session() but is that certainly the correct usage for this case?

@epigramx
Copy link
Contributor Author

epigramx commented Apr 5, 2021

See this from https://docs.sqlalchemy.org/en/14/orm/contextual.html#using-thread-local-scope-with-web-applications

The scoped_session.remove() method, as always, removes the current Session associated with the thread, if any. However, one advantage of the threading.local() object is that if the application thread itself ends, the “storage” for that thread is also garbage collected. So it is in fact “safe” to use thread local scope with an application that spawns and tears down threads, without the need to call scoped_session.remove(). However, the scope of transactions themselves, i.e. ending them via Session.commit() or Session.rollback(), will usually still be something that must be explicitly arranged for at the appropriate time, unless the application actually ties the lifespan of a thread to the lifespan of a transaction.

That seems to state that even if everything is done perfectly with assigning scoped_session(): it still needs timing management of commit() or rollback().

Ft doesn't seem to use commit directly but it does autocommit with an implied flush() so I wonder if the whole problem is having explicit flush()es at all and just moving them out of the way would prevent the issue.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 5, 2021

Also it's not calling Trade.query.session.flush(), but Trade.session.flush(), so is it flushing them all at the same time without knowing if another flush is ongoing😱?

@xmatthias
Copy link
Member

not having Trade.session.flush() in the code did cause (subtle) problems in the past.

Now it's been multiple years since we had to add this, so maybe it would not be necessary, however it's one of these subtle bugs which you only encounter every few 1000 times (IIRC, then the problem was that a trade can be closed / updated for the main loop, but telegram / api still got the "wrong" information).

While removing it would not cause any immediate harm, it could still reintroduce these subtle bugs without noticing - as it's a "once in a few 1000 times" bug which you can't reliably reporduce (and therefore can't be sure it's gone either).


I don't think using Trade.session is correct, but Trade.query.session should be used instead - that's the thread-local session, whereas Trade.session is of type scopedSession - and is apparently NOT recreated when calling from a new thread - which .query is handling automatically.

Please give the PR linked to this issue a try and let me know if it fixes the problem for you (it still does not reproduce for me - like - at all - not on fast, not on slow systems).

@epigramx
Copy link
Contributor Author

epigramx commented Apr 5, 2021

Thanks: I will report if the issue with flush returns (or someone else will I guess since they can search for the flush backtrace in here).

Regarding the rarity: on one hand it's probably more common on slower machines (as evidenced by the original report and the other person reporting it) when something 'heavier' like the web ui is produced: though on the other hand I didn't reproduce it more than once on the slow machine with regular operation either so it's exceptionally rare on most machines (maybe an exception to that are the most slow machines that run it (e.g. old raspberries or when they are under additional load by other servers)).

Regarding the code: it does make more sense now since Trade.session had all ids under its wing while .query has a new id every time: it's only curious how it was working with .add etc. without having to call it only via query in the first place.

@xmatthias
Copy link
Member

well technically both are session objects - but the Trade.session is not thread-safe (which was apparently missed in the change we did).

As it doesn't reproduce it's difficult to spot too ... and obviously, using FreqUI will expose this as otherwise we'd not be hammering the database from multiple threads on a regular basis.

I suspect it's not so much a "power" issue than a "speed of IO" issue - where raspberries (with SD cards, anyway) are rather prone / affected - but it's a guess at best ...

@epigramx
Copy link
Contributor Author

epigramx commented Apr 5, 2021

Seems fine on dry and live. Only weird thing I noticed was this 'yellow' on code

image

but it's spammy with warnings by default so I guess it's nothing important.

@xmatthias
Copy link
Member

pylint is ... sometimes strange when it comes to certain things - too strickt in many instances (hence i'm not using it at all, but instead use mypy + flake8 - aligned to CI).

@epigramx
Copy link
Contributor Author

epigramx commented Apr 5, 2021

^Thanks. Yeah it's kinda spammy, and half of the time I google about it people say "disable it manually" which defies the purpose.

@xmatthias
Copy link
Member

^Thanks. Yeah it's kinda spammy, and half of the time I google about it people say "disable it manually" which defies the purpose.

Well that's what i did - but globally in the editor - as i found too often that i'd need to "disable for this line" instead of getting helpful warnings or errors.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 10, 2021

Disclaimer: I'm not confident about what the solution here is so I'm just reporting what I know only and I do not claim to know the definitive solution to make a PR; at the same time I do not demand for any help so if anyone reading does not want to help then don't.

I got a related error today but it's different (it appears it could be starting from the load from the UI again (because that's the only way those "could not get rate for pair" warnings are coming up here) though since it happened 5 seconds later it could be irrelevant):

2021-04-10 04:48:35,732 - freqtrade.rpc.rpc - WARNING -  Could not get rate for pair DON.
2021-04-10 04:48:35,734 - freqtrade.rpc.rpc - WARNING -  Could not get rate for pair QI.
2021-04-10 04:48:40,781 - freqtrade.commands.trade_commands - ERROR - (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely)
(sqlite3.OperationalError) database is locked
(Background on this error at: http://sqlalche.me/e/14/e3q8)
2021-04-10 04:48:40,792 - freqtrade.commands.trade_commands - ERROR - Fatal exception!
Traceback (most recent call last):
  File ".env/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 928, in _commit_impl
    self.engine.dialect.do_commit(self.connection)
  File ".env/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 644, in do_commit
    dbapi_connection.commit()
sqlite3.OperationalError: database is locked

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "freqtrade/commands/trade_commands.py", line 19, in start_trading
    worker.run()
  File "freqtrade/worker.py", line 74, in run
    state = self._worker(old_state=state)
  File "freqtrade/worker.py", line 111, in _worker
    self._throttle(func=self._process_running, throttle_secs=self._throttle_secs)
  File "freqtrade/worker.py", line 132, in _throttle
    result = func(*args, **kwargs)
  File "freqtrade/worker.py", line 145, in _process_running
    self.freqtrade.process()
  File "freqtrade/freqtradebot.py", line 188, in process
    if self.get_free_open_trades():
  File "freqtrade/freqtradebot.py", line 243, in get_free_open_trades
    open_trades = len(Trade.get_open_trades())
  File "freqtrade/persistence/models.py", line 641, in get_open_trades
    return Trade.get_trades(Trade.is_open.is_(True)).all()
  File ".env/lib/python3.8/site-packages/sqlalchemy/orm/query.py", line 2665, in all

(if some line numbers in freqtradebot.py are out of order it's irrelevant since I have some personal comments in it (no actual different code)

Because of the difference now being "raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely": I'm reading in docs '''autoflush – When True, all query operations will issue a Session.flush() call to this Session before proceeding.''' so I wonder if that means it issues a flush to the session with the same id for all threads and not to the query.session of the last patch so I'm considering to disable autoflush in the code as an experiment and see what happens (and by the way: the docs report that autocommit is deprecated and it will be removed in 2.0 so I don't know if the general treatment of the software regarding sessionmaker instantiation is correct).

@epigramx
Copy link
Contributor Author

epigramx commented Apr 10, 2021

So I got this again after the last comment's attempt with autoflush being off and the backtrace is from freqtradebot.py's process' Trade.query.session.flush().

It did not happen at saving a trade but after heavy usage of the web server so I guess that's progress since it didn't ruin the record.

At this moment I suspect the sqlite lib itself is incapable in some cases at slow systems but I don't want to believe that too easily.

@xmatthias
Copy link
Member

xmatthias commented Apr 10, 2021

is the database stored on a slow filesystem (like a sdcard on a raspberry)?
are you sure you're not running 2 bots against the same database? (maybe one as a "hidden / forgotten" docker container or systemd service?)

considering how python handles threading (the "GIL") - i don't think it's possible to have "true" concurrency against the database from one freqtrade instance (hence i'd want to exclude the above), as we only use multiple threads, not multiple processes (which could be active at the same time).

sqlite does not handle concurrent writes pretty well - but concurrent reads should be fine (especially as freqtrade isn't really concurrent due to the python GIL).

it's also pretty strange - i'm running quite a few bots (with UI) - in some dry-run instances on pretty slow machines - and didn't experience this problem even once.

How did you install freqtrade? docker? virtualenv? system python ... ?
which version of python ...

@epigramx
Copy link
Contributor Author

epigramx commented Apr 10, 2021

Yes it's on a slow card (directly on linux with ./setup on a venv of python 3.8.5) though the RAM is never filled so I didn't suspect the drive writes (though I probably should) and it never run two instances since I check with top (and always close it with ctrl-c after a stop anyway).

I observe that a practical cause is heavy load from the web server because I kept reloading the ui at that time and top kept showing high CPU usage from processes like kworker/1:1-events.

Looking at all three logs it appears it happens after arbitrary moments of freqtradebot.py; the three moments are: execute_buy()'s Trade.session.flush() (before the recent patch) and models.py's get_open_trades()'s Trade.get_trades(Trade.is_open.is_(True)).all() and process()'s Trade.query.session.flush() (after the session related patch).

@epigramx
Copy link
Contributor Author

epigramx commented Apr 10, 2021

If you think a slow drive is an inevitable cause, it might be possible to be tested in a way that becomes faster than even the best nvme drives by using a ram disk (which can also become persistent).

Though I find it hard to believe that a slow drive is an inevitable cause of these problems to be honest (even though I would not be surprised if a fast drive can workaround them).

@epigramx
Copy link
Contributor Author

epigramx commented Apr 11, 2021

So even a ramdisk holding the sqlite file didn't help: failed after freqtradebot.py's process()'s Trade.query.session.flush() which is the same with the previous crash and it happened several seconds AFTER the web ui was used and CLOSED.

What I'm adding here which wasn't clearly stated before (but I was thinking about it for a while) is that there appears to be system load that persists for several seconds AFTER the web ui stops being used (the browser tab closed).

It's very consistent and not a chance because I have never seen any of these problems if at least ~5 minutes have passed after using the ui.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 11, 2021

Back to the basics: I believe there's a very good chance the session handling is not safe especially when ui threads are involved in light of the fact not even a ram disk holding the database helped (even if of course a faster CPU might have workarounded the issue but that wouldn't be a practical solution here and it wouldn't solve the general issue either).

I'm reading this person that insists even with scoped_session each thread should be closing their sessions derived from a separate call to that one scoped session individually https://stackoverflow.com/a/64828162/277716 and later claims that they should be removed too https://stackoverflow.com/a/64846206/277716 by citing the official docs about it https://docs.sqlalchemy.org/en/13/orm/contextual.html

What I'm getting at: we might need to have each thread making a new session from _session() for each transaction (_session as it is currently derived from scoped session) and then remove()ing them/discarding them when that is done; right now it appears that instead all the threads get a query property from the one same _session from scoped_session and always use that (*I know debugging may show different ids but talking about the actual code conflicting with the methodology of the docs (they cite some_session = Session() after Session = scoped_session(session_factory) instead of directly querying Session as FT currently does)).

@xmatthias
Copy link
Member

If you have the possibility, please test this on a different system (ideally using the official docker image).

I run about 8+ bots on different systems (with different specs and performance) - and NONE of these has EVER had this problem in ~2 years of usage over the different versions.
This seems to be limited to you (and apparently one other user) - but not the majority (i've not seen any reports of this from anyone else in the 3 years i'm involved in this project).

This makes me think that maybe it's some kind of specifity about your system / configuration.

You also talk about "reloading the UI" - is this "clicking the "refresh" button - or pressing f5 reloading the whole UI html stuff?

@epigramx
Copy link
Contributor Author

epigramx commented Apr 11, 2021

I believe it mainly happens with relatively heavy reloading with shift+f5 etc. and in general when the ui server is pummeled with some requests (and the server's system load appears to remain for seconds after the browser tab is closed); also I'm not sure if your systems have much better CPUs since after the ramdisk on fast RAM failed to fix it then I bet it can be workarounded only with a fast CPU; also it might depend on how much trading they do (e.g. if they do only 4-5 trades at the same time the situation might be much lighter than one doing 10 or more trades).

I consider testing making a function that returns query objects from a new session that is = _session().

In general I think the docs are clear that they want people to do new_session = Session() (with Session = scoped_session())

@xmatthias
Copy link
Member

using shift+f5 with freqUI does not make any sense whatsoever - the results you get from this is a static HTML page with a lot of Javascript.
Once that page is loaded (and if login is successfull) - then freqUI will make some requests against the API to get the data / content updated.

So instead of using shift+f5 (which will not touch the database whatsoever until the UI is fully started, by the way), you should simply use the "reload" button in the "botcontrols" section of FreqUI to refresh the content of the page.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 11, 2021

I bet you're right though I find it unimportant in the context of the problem since I first noticed the problem by using the page very normally without reloading it at all or reloading it with a single hit of a browser's reload button. I mentioned shift+f5 in the context of how I was trying to make it get higher load lately while testing its resilience to the problem but I bet you're right I should be using your controls instead.

I want to see if I can follow the docs' example of new_session = Session() (when Session = scoped_session()) without messing up too much with the existing code.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 11, 2021

Oh sorry: it becomes clear to me you meant to not even use a single "normal" refresh of a browser in this case: since it may inflict heavy load to weak systems for no reason. I bet you're right that might alleviate the problem.

I'd still want to avoid the chance since who knows what else may inflict load on it (an automated security update or whatever).

@xmatthias
Copy link
Member

when you reload (shift+f5 or f5) the page, you're causing additional disk load (uvicorn needs to get the files from disk - and send them to your browser) ...
normally, F5 (without shift) "should" use most things from cache (but it's not guaranteed) - and it'll depend on the browser (and browser settings) if that's really the case.

i wouldn't consider either of the 2 as "normal" load - but neither of the 2 will crash my bot either (not on my good VPS, neither on my shitty VPS with constantly failing disks, nor on my raspberry...).

@epigramx
Copy link
Contributor Author

I bet I might have some additional conditions locally (e.g. more trades or an older/bigger database or whatever). I haven't made any changes to relevant code or use any weird OS so I don't think it's user error or even exceptionally rare (e.g. at least one other person gets it).

Also sqlite is often accused "simplistic" but in practice this software is small enough for it so I would much easier look into sqlalchemy or FT causing the problem than such an ancient overused library.

@epigramx
Copy link
Contributor Author

epigramx commented Apr 11, 2021

I just realized something very possible: what if FT is perfectly thread safe and it does nothing wrong at all it tems of SQLAlchemy handling,
HOWEVER: flimsy sqlite and when thousands of old trades are involved: it hogs the dbase file for half a second or so in some cases which is logical,
so what might make sense is to catch the locking exception and just make it retry a second later (which ironically was my first idea days ago).
After all: the software is NOT crashing because of thread unsafety: it ONLY crashes by complaining "the sqlite file is busy" basically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Triage Needed Issues yet to verify
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants