Improve performance using threads #23

liloman · 2017-02-19T18:01:50Z

Hi,

I've noted that it takes time related to number of files. So I'm trying to use it for big number of files and it takes so long.

cd /tmp; mkdir more ; cd more/
#create a 320KB file                                                              
dd if=/dev/zero of=masterfile bs=1 count=327680                                 
#split it in 32768 files (instantly) + masterfile = 32769                       
split -b 10 -a 10 masterfile    
#waiiiiiiiiiiiiiiiiiiiiiiiit
bitrot -v

I suppose that it could be made to work with threads and improve it a lot, cause single threaded to calculate the sha1 and insert into sqlite for x files is too old school for nowdays. ;)

I'm using an Intel i7 so I have plenty of spare threads to burn I reckon something like a central buffer/MQ/DB/x where insert the files to be hashed and n threads to calculate and insert/update them (or just another thread for just sqlite) could work (they collect files from the central buffer, n at a time), sounds like a cool project. ;)

I'm using it for this tool. ;)

https://github.com/liloman/heal-bitrots

Cheers!

ambv · 2017-02-21T00:22:56Z

Do you see the tool being bound by CPU on your box? In my case it doesn't even get to 100% CPU on one core, this is I/O bound. In this sense adding additional threads would complicate the code but is unlikely to give us noticeable performance wins.

liloman · 2017-02-21T12:17:03Z

Hi ambv,

Yes totally 100% CPU, just note that /tmp is tmpfs and then RAM.

Respect the solution:

1- I don't know python (you seem to know a bit... ;) )

2-I just wrote some ballpark ideas. The normal and obvious way will be, first messure bottlenecks and then fix them. ;)

#time bitrot -v (ctrl -c )
 60.2%^CTraceback (most recent call last):
  File "/home/charly/.local/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 524, in run_from_command_line
    bt.run()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 230, in run
    cur, p_uni, new_mtime, new_sha1,
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 360, in handle_unknown_path
    if os.path.exists(stored_path):
  File "/usr/lib64/python2.7/genericpath.py", line 26, in exists
    os.stat(path)
KeyboardInterrupt

real	1m28.468s
user	1m12.890s
sys	0m14.966s

#time for file in *; do sha1sum $file &>/dev/null; done
real	1m11.391s
user	0m19.609s
sys	0m50.629s

So CPU bound? ;)

By the way I did a VERY quick and dirty profiling before writting this yesterday with KCacheGrind, I don't know python so I just used first search hit:

https://julien.danjou.info/blog/2015/guide-to-python-profiling-cprofile-concrete-case-carbonara

And I think I remember seeing it hit by lstat. :)

PD: I was thinking you had abandoned this project...

Cheers and thanks!! :)

liloman · 2017-02-21T12:30:33Z

To clarify it, when I launch bitrot it start like 50% and starts increasing. So:

#time bitrot -v 
 77.7%^CTraceback (most recent call last):
  File "/home/charly/.local/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 524, in run_from_command_line
    bt.run()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 230, in run
    cur, p_uni, new_mtime, new_sha1,
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 360, in handle_unknown_path
    if os.path.exists(stored_path):
  File "/usr/lib64/python2.7/genericpath.py", line 26, in exists
    os.stat(path)
KeyboardInterrupt

real	11m15.066s
user	8m56.544s
sys	2m14.869s

ambv · 2017-02-21T19:36:19Z

If you're willing to test how this could work with a multiprocessing worker pool, I'm happy to review this. I'm not going to be able to spend time implementing this myself.

liloman · 2017-02-21T22:34:25Z

Ok, I will give a try, not sure if this week or the next, but It seems pretty easy and fun: :)

https://wltrimbl.github.io/2014-06-10-spelman/intermediate/python/04-multiprocessing.html

Could you recommend me any profiling tool for python? (linux)

Cheers!

liloman · 2017-02-22T20:11:18Z

Done:
It was just a stat and sqlite issue. Now it goes blaaaaazing fast even in my laptop: ;)

#time ~/Clones/bitrot/src/bitrot.py 
Finished. 0.62 MiB of data read. 0 errors found.
32769 entries in the database, 32769 new, 0 updated, 0 renamed, 0 missing.
Updating bitrot.sha512... done.

real	0m4.879s
user	0m4.086s
sys	0m0.726s

I have created some tests in bats (bash script) to implement it (what a burden is test something without tests ;) )

1. Use 2 new data structures: -paths (set) contains all the files in the actual filesystem -hashes (dictionary) substitute the sqlite query with dict[hash] = set(db paths) 2. Minimal unitary tests created with bats (bash script) See #23 for details.

ambv · 2020-05-17T20:53:10Z

So... using threads didn't help much. But then I used a ProcessPoolExecutor and the thing is easily 4X faster on my box. Here's a test on a 5G directory of a 1000 files.

Before:

bitrot -tv  7.46s user 1.33s system 88% cpu 9.993 total

After:

bitrot -tv  14.68s user 2.67s system 814% cpu 2.129 total

p1r473 · 2020-06-09T23:16:21Z

@ambv have you any insight into optimal chunk sizes?
Right now I am using 1048576 which is 1 meg
It used to be 16384 which is the block size in HFS+; 4X the block size in ext4

I have found chunk size to REALLY affect time needed to hash on my RAID magnetic drives.

ambv · 2020-06-10T10:23:03Z

Interesting, that's like 64X bigger. The last time I measured it was way back in 2013 and I haven't touched it since. What's the difference that you're observing?

p1r473 · 2020-06-17T18:45:48Z

Interesting, that's like 64X bigger. The last time I measured it was way back in 2013 and I haven't touched it since. What's the difference that you're observing?

@ambv Going from like 7 hours to 4 hours on 4ish terabytes on magnetic drives RAID 0
I have been trying to see if RAID stripe size matter.... I am not an expert on this stuff, but bigger chunk size is definitely faster

liloman mentioned this issue Feb 22, 2017

Swap sqlite cursor with dictionary and set data structures #24

Merged

ambv closed this as completed in 0dc3390 May 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance using threads #23

Improve performance using threads #23

liloman commented Feb 19, 2017 •

edited

ambv commented Feb 21, 2017

liloman commented Feb 21, 2017 •

edited

liloman commented Feb 21, 2017

ambv commented Feb 21, 2017

liloman commented Feb 21, 2017

liloman commented Feb 22, 2017 •

edited

ambv commented May 17, 2020

p1r473 commented Jun 9, 2020 •

edited

ambv commented Jun 10, 2020

p1r473 commented Jun 17, 2020

Improve performance using threads #23

Improve performance using threads #23

Comments

liloman commented Feb 19, 2017 • edited

ambv commented Feb 21, 2017

liloman commented Feb 21, 2017 • edited

liloman commented Feb 21, 2017

ambv commented Feb 21, 2017

liloman commented Feb 21, 2017

liloman commented Feb 22, 2017 • edited

ambv commented May 17, 2020

p1r473 commented Jun 9, 2020 • edited

ambv commented Jun 10, 2020

p1r473 commented Jun 17, 2020

liloman commented Feb 19, 2017 •

edited

liloman commented Feb 21, 2017 •

edited

liloman commented Feb 22, 2017 •

edited

p1r473 commented Jun 9, 2020 •

edited