Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance using threads #23

Closed
liloman opened this issue Feb 19, 2017 · 10 comments
Closed

Improve performance using threads #23

liloman opened this issue Feb 19, 2017 · 10 comments

Comments

@liloman
Copy link
Contributor

liloman commented Feb 19, 2017

Hi,

I've noted that it takes time related to number of files. So I'm trying to use it for big number of files and it takes so long.

cd /tmp; mkdir more ; cd more/
#create a 320KB file                                                              
dd if=/dev/zero of=masterfile bs=1 count=327680                                 
#split it in 32768 files (instantly) + masterfile = 32769                       
split -b 10 -a 10 masterfile    
#waiiiiiiiiiiiiiiiiiiiiiiiit
bitrot -v 

I suppose that it could be made to work with threads and improve it a lot, cause single threaded to calculate the sha1 and insert into sqlite for x files is too old school for nowdays. ;)

I'm using an Intel i7 so I have plenty of spare threads to burn I reckon something like a central buffer/MQ/DB/x where insert the files to be hashed and n threads to calculate and insert/update them (or just another thread for just sqlite) could work (they collect files from the central buffer, n at a time), sounds like a cool project. ;)

I'm using it for this tool. ;)

https://github.com/liloman/heal-bitrots

Cheers!

@ambv
Copy link
Owner

ambv commented Feb 21, 2017

Do you see the tool being bound by CPU on your box? In my case it doesn't even get to 100% CPU on one core, this is I/O bound. In this sense adding additional threads would complicate the code but is unlikely to give us noticeable performance wins.

@liloman
Copy link
Contributor Author

liloman commented Feb 21, 2017

Hi ambv,

Yes totally 100% CPU, just note that /tmp is tmpfs and then RAM.

Respect the solution:

1- I don't know python (you seem to know a bit... ;) )

2-I just wrote some ballpark ideas. The normal and obvious way will be, first messure bottlenecks and then fix them. ;)

#time bitrot -v (ctrl -c )
 60.2%^CTraceback (most recent call last):
  File "/home/charly/.local/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 524, in run_from_command_line
    bt.run()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 230, in run
    cur, p_uni, new_mtime, new_sha1,
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 360, in handle_unknown_path
    if os.path.exists(stored_path):
  File "/usr/lib64/python2.7/genericpath.py", line 26, in exists
    os.stat(path)
KeyboardInterrupt

real	1m28.468s
user	1m12.890s
sys	0m14.966s

#time for file in *; do sha1sum $file &>/dev/null; done
real	1m11.391s
user	0m19.609s
sys	0m50.629s

So CPU bound? ;)

By the way I did a VERY quick and dirty profiling before writting this yesterday with KCacheGrind, I don't know python so I just used first search hit:

https://julien.danjou.info/blog/2015/guide-to-python-profiling-cprofile-concrete-case-carbonara

And I think I remember seeing it hit by lstat. :)

PD: I was thinking you had abandoned this project...

Cheers and thanks!! :)

@liloman
Copy link
Contributor Author

liloman commented Feb 21, 2017

To clarify it, when I launch bitrot it start like 50% and starts increasing. So:

#time bitrot -v 
 77.7%^CTraceback (most recent call last):
  File "/home/charly/.local/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 524, in run_from_command_line
    bt.run()
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 230, in run
    cur, p_uni, new_mtime, new_sha1,
  File "/home/charly/.local/lib/python2.7/site-packages/bitrot.py", line 360, in handle_unknown_path
    if os.path.exists(stored_path):
  File "/usr/lib64/python2.7/genericpath.py", line 26, in exists
    os.stat(path)
KeyboardInterrupt

real	11m15.066s
user	8m56.544s
sys	2m14.869s

@ambv
Copy link
Owner

ambv commented Feb 21, 2017

If you're willing to test how this could work with a multiprocessing worker pool, I'm happy to review this. I'm not going to be able to spend time implementing this myself.

@liloman
Copy link
Contributor Author

liloman commented Feb 21, 2017

Ok, I will give a try, not sure if this week or the next, but It seems pretty easy and fun: :)

https://wltrimbl.github.io/2014-06-10-spelman/intermediate/python/04-multiprocessing.html

Could you recommend me any profiling tool for python? (linux)

Cheers!

@liloman
Copy link
Contributor Author

liloman commented Feb 22, 2017

Done:
It was just a stat and sqlite issue. Now it goes blaaaaazing fast even in my laptop: ;)

#time ~/Clones/bitrot/src/bitrot.py 
Finished. 0.62 MiB of data read. 0 errors found.
32769 entries in the database, 32769 new, 0 updated, 0 renamed, 0 missing.
Updating bitrot.sha512... done.

real	0m4.879s
user	0m4.086s
sys	0m0.726s

I have created some tests in bats (bash script) to implement it (what a burden is test something without tests ;) )

ambv pushed a commit that referenced this issue Mar 3, 2017
1. Use 2 new data structures:
-paths (set) contains all the files in the actual filesystem
-hashes (dictionary) substitute the sqlite query with dict[hash] = set(db paths)

2. Minimal unitary tests created with bats (bash script)

See #23 for details.
@ambv
Copy link
Owner

ambv commented May 17, 2020

So... using threads didn't help much. But then I used a ProcessPoolExecutor and the thing is easily 4X faster on my box. Here's a test on a 5G directory of a 1000 files.

Before:

bitrot -tv  7.46s user 1.33s system 88% cpu 9.993 total

After:

bitrot -tv  14.68s user 2.67s system 814% cpu 2.129 total

@ambv ambv closed this as completed in 0dc3390 May 17, 2020
@p1r473
Copy link
Contributor

p1r473 commented Jun 9, 2020

@ambv have you any insight into optimal chunk sizes?
Right now I am using 1048576 which is 1 meg
It used to be 16384 which is the block size in HFS+; 4X the block size in ext4

I have found chunk size to REALLY affect time needed to hash on my RAID magnetic drives.

@ambv
Copy link
Owner

ambv commented Jun 10, 2020

Interesting, that's like 64X bigger. The last time I measured it was way back in 2013 and I haven't touched it since. What's the difference that you're observing?

@p1r473
Copy link
Contributor

p1r473 commented Jun 17, 2020

Interesting, that's like 64X bigger. The last time I measured it was way back in 2013 and I haven't touched it since. What's the difference that you're observing?

@ambv Going from like 7 hours to 4 hours on 4ish terabytes on magnetic drives RAID 0
I have been trying to see if RAID stripe size matter.... I am not an expert on this stuff, but bigger chunk size is definitely faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants