Skip to content
This repository has been archived by the owner on Jan 21, 2022. It is now read-only.

Asyncio rewrite of magneticod #76

Merged
merged 22 commits into from Jun 2, 2017
Merged

Conversation

richardkiss
Copy link
Contributor

I've rewritten and simplified much of magneticod to use asyncio.

This is a big change, so I can understand if you're reluctant to merge it. If you have requests about specific stylistic or other changes, I'm happy to work with you.

I've tested it in a Digital Ocean droplet, and it got 700 hashes in 30 minutes, about the same as v0.3.0. It seems to be a bit lighter on CPU.

The code is smaller by over 100 lines, and all the globals in main are now gone.

@richardkiss
Copy link
Contributor Author

I used single underscores rather than double underscore because double underscores do confusing name mangling that make it more difficult to use pdb, which I depend upon a lot.

@ad-m
Copy link
Contributor

ad-m commented May 15, 2017

@richardkiss , see https://github.com/boramalper/magnetico/projects/1 . Here is the pending task of rewriting on asyncio.

@boramalper
Copy link
Owner

Hey @richardkiss, thanks a ton! Really.

I'll review it when I'm done with my finals (one week left...) but in the meanwhile, can you check how it performs with uvloop? I expect some increase in hash throughput or decrease in CPU usage and I wonder if it will actually happen.

@richardkiss
Copy link
Contributor Author

I'm skeptical because although I expect uvloop to reduce CPU usage, magneticod doesn't seem to be CPU-bound. I'll give it a try though.

@richardkiss
Copy link
Contributor Author

richardkiss commented May 15, 2017

Here is the experiment I run to measure CPU usage and torrent count:

rm -rf ~/.local/share/magneticod/

(or you can rename this path if you want to preserve the data in it)

(time PYTHONPATH=`pwd` python -m magneticod.__main__ &); sleep 1800; killall python; sqlite3 ~/.local/share/magneticod/database.sqlite3 'select count(rowid) from torrents'

This runs magneticod for a half hour from a clean slate, then counts the torrents gathered.

@ad-m
Copy link
Contributor

ad-m commented May 15, 2017

@richardkiss , see #60 for real speed tracking.

@richardkiss
Copy link
Contributor Author

richardkiss commented May 15, 2017

OK, I have some measurements.

v0.3.0:

580 torrents
real    30m0.018s
user    4m8.596s
sys     0m32.716s

asyncio:

700 torrents
real    30m0.017s
user    2m6.848s
sys     0m32.216s

uvloop:

510 torrents
real    30m0.039s
user    1m30.584s
sys     0m26.088s

@richardkiss
Copy link
Contributor Author

richardkiss commented May 15, 2017

We could also have dht keep a running total of bytes sent and received quite easily, since there is only one function for each (.sendto and .datagram_received).

@itdaniher
Copy link

For benchmarking, it's important to choose a unique random port for each trial, otherwise the state of the global DHT will impact the results.

@richardkiss
Copy link
Contributor Author

I haven't been specifying on the command-line, so the default of 0.0.0.0:0 is being used. I suspect that uses a random port.

@itdaniher
Copy link

You'd hope, but uvloop coming in last makes me wonder. It definitely increased my throughput with mana...

@richardkiss
Copy link
Contributor Author

I agree that it doesn't make sense. I think there is a lot of variance which means one trial doesn't really give that much information: it's just a random sample.

One thing I think this data shows though is that we're within an order of magnitude in performance. And I think the code is a lot less fragile, maybe even easier to follow.

import uvloop
asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
logging.info("using uvloop")
except ModuleNotFoundError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/artur/.local/bin/magneticod", line 9, in <module>
    load_entry_point('magneticod==0.3.0', 'console_scripts', 'magneticod')()
  File "/home/artur/.local/lib/python3.5/site-packages/magneticod/__main__.py", line 45, in main
    except ModuleNotFoundError:
NameError: name 'ModuleNotFoundError' is not defined
2017-05-17 09:15:21,906  INFO      magneticod v0.3.0 started
Traceback (most recent call last):
  File "/home/artur/.local/lib/python3.5/site-packages/magneticod/__main__.py", line 42, in main
    import uvloop
ImportError: No module named 'uvloop'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, it suggests during installation in setup.py add a message about use / not uvloop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I tested this on python3.6. I will change ModuleNotFoundError to ImportError

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@ad-m
Copy link
Contributor

ad-m commented May 17, 2017

I note some snippet:

git checkout master;
git fetch origin pull/76/head:uvloop  --force;
git checkout uvloop;
git pull origin pull/60/head;
cd magneticod;
pip3 install . --user -U;
while true; do
    ~/.local/bin/magneticod |& tee $(date +"log-%s.txt")
done;

@ad-m
Copy link
Contributor

ad-m commented May 17, 2017

There is following exceptions after some time:

2017-05-17 11:52:33,894  ERROR     Exception in callback UVTransport._call_connection_made
handle: <Handle UVTransport._call_connection_made>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 52, in uvloop.loop.Handle._run (uvloop/loop.c:48414)
  File "uvloop/handles/tcp.pyx", line 141, in uvloop.loop.TCPTransport._call_connection_made (uvloop/loop.c:80488)
  File "uvloop/handles/basetransport.pyx", line 151, in uvloop.loop.UVBaseTransport._call_connection_made (uvloop/loop.c:65852)
  File "uvloop/handles/stream.pyx", line 257, in uvloop.loop.UVStream._start_reading (uvloop/loop.c:70782)
  File "uvloop/handles/handle.pyx", line 150, in uvloop.loop.UVHandle._ensure_alive (uvloop/loop.c:54917)
RuntimeError: unable to perform operation on <TCPTransport closed=True reading=False 0x2073c98>; the handler is closed

There is no crash of magneticod, only log of error.

@boramalper
Copy link
Owner

There is no crash of magneticod, only log of error.

That's nice! uvloop, unfortunately, prints these errors without being able to silence or catch, I guess...

@ad-m
Copy link
Contributor

ad-m commented May 17, 2017

2 hours and no crash. Average speed is 6.40 metadata / s.

CPU usage and memory continuous grow up. There might be bug eg. data structure leak, so it grow up and lookup is continuous slower, but this can be also fine.

Can DHT network has so slow bootstrap (>70 minute) for full speed or there is some bug?
linux-server-overview
linux-server-overview-memory

@ad-m
Copy link
Contributor

ad-m commented May 17, 2017

Is our SQLite require load full database into memory? I flush database before startup. The size of the database corresponds to the increase in memory consumption.

$ du ~/.local/share/magneticod/database.sqlite3 -h
175M	/home/artur/.local/share/magneticod/database.sqlite3

@richardkiss
Copy link
Contributor Author

I found a leak of DisposablePeer objects that I will fix.

@richardkiss
Copy link
Contributor Author

Regarding:

2017-05-17 11:52:33,894  ERROR     Exception in callback UVTransport._call_connection_made
handle: <Handle UVTransport._call_connection_made>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 52, in uvloop.loop.Handle._run (uvloop/loop.c:48414)
  File "uvloop/handles/tcp.pyx", line 141, in uvloop.loop.TCPTransport._call_connection_made (uvloop/loop.c:80488)
  File "uvloop/handles/basetransport.pyx", line 151, in uvloop.loop.UVBaseTransport._call_connection_made (uvloop/loop.c:65852)
  File "uvloop/handles/stream.pyx", line 257, in uvloop.loop.UVStream._start_reading (uvloop/loop.c:70782)
  File "uvloop/handles/handle.pyx", line 150, in uvloop.loop.UVHandle._ensure_alive (uvloop/loop.c:54917)
RuntimeError: unable to perform operation on <TCPTransport closed=True reading=False 0x2073c98>; the handler is closed

Since the only TCP open is in bittorrent.DisposablePeer, my guess is what's happening is we make a connection to a remote peer, it's accepted, and then they immediately disconnect. Seems like a bug in uvloop. We should try to create a reproducible case and submit to uvloop (I've already sent them one bug report at MagicStack/uvloop#91).

@ad-m
Copy link
Contributor

ad-m commented May 17, 2017

At the end of the day (my time is 23:24) just before the app is shut down, I share the current results.
Graph:
linux-server-overview-memory
linux-server-overview
It is interesting that CPU consumption falls in the evening. This may, however, be due to other services on the server (all the time a private torrent client was running).

Calculation:

$ du ~/.local/share/magneticod/database.sqlite3 -h # database size
405M	/home/artur/.local/share/magneticod/database.sqlite3
$ echo "scale=4; (" $(grep '10 metadata' -c  log-1495030970.txt).0 "* 10.0) " / $(expr $(date +"%s - 1495030970")).0 | bc # speed per second
5.3335

Logs: https://transfer.sh/qjEeb/log-1495030970.txt.tar.gz

@richardkiss , I don't have idea how to debug uvloop.
In log it occurs a three times:

2017-05-17 11:52:33,894  ERROR     Exception in callback UVTransport._call_connection_made
handle: <Handle UVTransport._call_connection_made>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 52, in uvloop.loop.Handle._run (uvloop/loop.c:48414)
  File "uvloop/handles/tcp.pyx", line 141, in uvloop.loop.TCPTransport._call_connection_made (uvloop/loop.c:80488)
  File "uvloop/handles/basetransport.pyx", line 151, in uvloop.loop.UVBaseTransport._call_connection_made (uvloop/loop.c:65852)
  File "uvloop/handles/stream.pyx", line 257, in uvloop.loop.UVStream._start_reading (uvloop/loop.c:70782)
--
2017-05-17 12:57:06,304  DEBUG     closing 413078cd56a976c224b379601ab959870ffc7be8 to ('77.122.255.25', 55528)
2017-05-17 12:57:06,317  DEBUG     closing 0c049f5c227d00553871439499061d0ef8daa0e6 to ('178.93.200.30', 28265)
2017-05-17 12:57:06,318  ERROR     Exception in callback UVTransport._call_connection_made
handle: <Handle UVTransport._call_connection_made>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 52, in uvloop.loop.Handle._run (uvloop/loop.c:48414)
  File "uvloop/handles/tcp.pyx", line 141, in uvloop.loop.TCPTransport._call_connection_made (uvloop/loop.c:80488)
  File "uvloop/handles/basetransport.pyx", line 151, in uvloop.loop.UVBaseTransport._call_connection_made (uvloop/loop.c:65852)
  File "uvloop/handles/stream.pyx", line 257, in uvloop.loop.UVStream._start_reading (uvloop/loop.c:70782)
--
2017-05-17 14:06:06,037  DEBUG     closing e011f28ceef0d6f0505bbb3defc776cd214cf334 to ('5.144.190.33', 43461)
2017-05-17 14:06:06,037  DEBUG     closing 9c8ecc53755cee4dace7a7493c2bd6696a8a8e96 to ('24.202.103.219', 38212)
2017-05-17 14:06:06,038  ERROR     Exception in callback UVTransport._call_connection_made
handle: <Handle UVTransport._call_connection_made>
Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 52, in uvloop.loop.Handle._run (uvloop/loop.c:48414)
  File "uvloop/handles/tcp.pyx", line 141, in uvloop.loop.TCPTransport._call_connection_made (uvloop/loop.c:80488)
  File "uvloop/handles/basetransport.pyx", line 151, in uvloop.loop.UVBaseTransport._call_connection_made (uvloop/loop.c:65852)
  File "uvloop/handles/stream.pyx", line 257, in uvloop.loop.UVStream._start_reading (uvloop/loop.c:70782)

@richardkiss
Copy link
Contributor Author

I think it's a bug in uvloop, and I believe it's safe to ignore.

@richardkiss
Copy link
Contributor Author

My 9b1bbfc check-in should fix at least some of the leaks.

@richardkiss
Copy link
Contributor Author

Some leaks definitely still exist.

@richardkiss
Copy link
Contributor Author

With this latest change, I don't know of any further outstanding leaks (although I would not be surprised if there were some).

@ad-m
Copy link
Contributor

ad-m commented May 24, 2017

@richardkiss , Do you need me to run this on your server and collect data about how long your application behaves, or are you working on it, and is not it still intentional?

@richardkiss
Copy link
Contributor Author

I think testing this for memory leaks and crashes would be useful.

@richardkiss
Copy link
Contributor Author

There are definitely still leaks... I ran out of memory on DigitalOcean after a few days. I need to try https://docs.python.org/3/library/tracemalloc.html on it.

@boramalper
Copy link
Owner

boramalper commented May 29, 2017

Is our SQLite require load full database into memory? I flush database before startup. The size of the database corresponds to the increase in memory consumption.

Commit e4c33ce fixes the issue. I initially had thought that keeping the temporary files in memory would increase the performance but the memory consumption increased drastically as well.

@richardkiss Do you think you might have mis-identified this as the cause of the leaks? If not, can you please be more specific about what you mean when you say "leaks"? I would love to help. =)

@richardkiss
Copy link
Contributor Author

A couple days ago I looked to see where all the memory was going and it appeared to be the set of known info hashes. I wrote a patch to query the SQLite db instead and that drastically reduced the memory usage. I can check it in tomorrow.

Copy link
Owner

@boramalper boramalper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments on unclear points.

@@ -6,6 +6,7 @@
]
PENDING_INFO_HASHES = 10 # threshold for pending info hashes before being committed to database:

TICK_INTERVAL = 1 # in seconds (soft constraint)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the tick interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no real reason to have it or to remove it, either way is fine. I probably did it because I don't usually factor constants to one place when I write in Python.

def when_peer_found(info_hash: InfoHash, peer_addr: PeerAddress) -> None:
raise NotImplementedError()
async def launch(self, loop):
self._loop = loop
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just use asyncio.get_event_loop() here or can it return different event loops each time it is called. I looked at the documentation and it talks about the context but it was unclear for me what is meant by the word "context".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you can do that. I think in general, it's better practice to explicitly pass along the current loop, but we could certainly get away with not doing that here.

if addr[1] == 0:
continue
def connection_lost(self, exc):
self._is_paused = True
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What can connection_lost mean in a connectionless protocol? Can this ever be called in a asyncio.DatagramProtocol and if yes, under what conditions? Again, asking because I couldn't make sense of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember if this being called or not. I may have just added it when I added connection_made. You could try logging or putting a pdb.set_trace() to see if it's ever called. I believe it's called when close is called on the transport. https://github.com/python/cpython/blob/master/Lib/asyncio/selector_events.py#L621


async def on_tick(self) -> None:
while True:
await asyncio.sleep(1)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constants.TICK_INTERVAL could be used here, if not removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed

@boramalper
Copy link
Owner

boramalper commented May 30, 2017

Hello @richardkiss. First and foremost, thank you! A lot! I was planning the same but never had enough courage and energy at the same time to dare.

I left some comments above about certain places, that I couldn't understand. I went through the whole PR and to me it seems perfectly fine, although I'll make some changes myself to clarify naming, and add typing annotations (plus, some other small changes if necessary).

A couple days ago I looked to see where all the memory was going and it appeared to be the set of known info hashes.

How much memory was it consuming and for how many info-hashes?

EDIT: Holy!

$ python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> s2 = set(os.urandom(20) for _ in range(3400000))  # 3.4 million torrent info hashes
>>> 

316.0 MiB!

Copy link
Owner

@boramalper boramalper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @richardkiss, once again!

I would be glad if you can clarify one more point. I'm nearly done with cleanup, annotation, and some renaming so it will be merged quite soon. =)

parent_task.child_count -= 1
try:
metadata = child_task.result()
if metadata and not parent_task.done():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we check for parent_task being done here when a child got result? I mean, if parent_task is done before, and successful, all of its childs will be terminated and this function cannot be called anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose two child tasks are fetching the same metadata for a parent and they finish at the same time (or very close). The first one wakes up, sets the parent_task result which will cause the done callback to be scheduled. The scheduler might still then chooses the second child task to run next (why not? It's been waiting longer) before the parent has a chance to cancel it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked through asyncio and confirmed that indeed, this really could happen.

@boramalper
Copy link
Owner

boramalper commented Jun 2, 2017

I seem to forgot commits:

  • Add some resource debug logging. 5d37737
  • Refactor create_tasks out of main. d7ead95
  • Query DB when checking if an infohash is new or not. 0e389aa

I'm trying to solve the conflict now. Sorry for my git skills. =)

@boramalper boramalper merged commit 0e389aa into boramalper:master Jun 2, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants