New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance using threads #23
Comments
Do you see the tool being bound by CPU on your box? In my case it doesn't even get to 100% CPU on one core, this is I/O bound. In this sense adding additional threads would complicate the code but is unlikely to give us noticeable performance wins. |
Hi ambv, Yes totally 100% CPU, just note that /tmp is tmpfs and then RAM. Respect the solution: 1- I don't know python (you seem to know a bit... ;) ) 2-I just wrote some ballpark ideas. The normal and obvious way will be, first messure bottlenecks and then fix them. ;) #time bitrot -v (ctrl -c )
60.2%^CTraceback (most recent call last):
File "/home/charly/.local/bin/bitrot", line 30, in <module>
run_from_command_line()
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 524, in run_from_command_line
bt.run()
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 230, in run
cur, p_uni, new_mtime, new_sha1,
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 360, in handle_unknown_path
if os.path.exists(stored_path):
File "/usr/lib64/python2.7/genericpath.py", line 26, in exists
os.stat(path)
KeyboardInterrupt
real 1m28.468s
user 1m12.890s
sys 0m14.966s
#time for file in *; do sha1sum $file &>/dev/null; done
real 1m11.391s
user 0m19.609s
sys 0m50.629s So CPU bound? ;) By the way I did a VERY quick and dirty profiling before writting this yesterday with KCacheGrind, I don't know python so I just used first search hit: https://julien.danjou.info/blog/2015/guide-to-python-profiling-cprofile-concrete-case-carbonara And I think I remember seeing it hit by lstat. :) PD: I was thinking you had abandoned this project... Cheers and thanks!! :) |
To clarify it, when I launch bitrot it start like 50% and starts increasing. So: #time bitrot -v
77.7%^CTraceback (most recent call last):
File "/home/charly/.local/bin/bitrot", line 30, in <module>
run_from_command_line()
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 524, in run_from_command_line
bt.run()
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 230, in run
cur, p_uni, new_mtime, new_sha1,
File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 360, in handle_unknown_path
if os.path.exists(stored_path):
File "/usr/lib64/python2.7/genericpath.py", line 26, in exists
os.stat(path)
KeyboardInterrupt
real 11m15.066s
user 8m56.544s
sys 2m14.869s |
If you're willing to test how this could work with a multiprocessing worker pool, I'm happy to review this. I'm not going to be able to spend time implementing this myself. |
Ok, I will give a try, not sure if this week or the next, but It seems pretty easy and fun: :) https://wltrimbl.github.io/2014-06-10-spelman/intermediate/python/04-multiprocessing.html Could you recommend me any profiling tool for python? (linux) Cheers! |
Done: #time ~/Clones/bitrot/src/bitrot.py
Finished. 0.62 MiB of data read. 0 errors found.
32769 entries in the database, 32769 new, 0 updated, 0 renamed, 0 missing.
Updating bitrot.sha512... done.
real 0m4.879s
user 0m4.086s
sys 0m0.726s I have created some tests in bats (bash script) to implement it (what a burden is test something without tests ;) ) |
1. Use 2 new data structures: -paths (set) contains all the files in the actual filesystem -hashes (dictionary) substitute the sqlite query with dict[hash] = set(db paths) 2. Minimal unitary tests created with bats (bash script) See #23 for details.
So... using threads didn't help much. But then I used a ProcessPoolExecutor and the thing is easily 4X faster on my box. Here's a test on a 5G directory of a 1000 files. Before:
After:
|
@ambv have you any insight into optimal chunk sizes? I have found chunk size to REALLY affect time needed to hash on my RAID magnetic drives. |
Interesting, that's like 64X bigger. The last time I measured it was way back in 2013 and I haven't touched it since. What's the difference that you're observing? |
@ambv Going from like 7 hours to 4 hours on 4ish terabytes on magnetic drives RAID 0 |
Hi,
I've noted that it takes time related to number of files. So I'm trying to use it for big number of files and it takes so long.
I suppose that it could be made to work with threads and improve it a lot, cause single threaded to calculate the sha1 and insert into sqlite for x files is too old school for nowdays. ;)
I'm using an Intel i7 so I have plenty of spare threads to burn I reckon something like a central buffer/MQ/DB/x where insert the files to be hashed and n threads to calculate and insert/update them (or just another thread for just sqlite) could work (they collect files from the central buffer, n at a time), sounds like a cool project. ;)
I'm using it for this tool. ;)
https://github.com/liloman/heal-bitrots
Cheers!
The text was updated successfully, but these errors were encountered: