New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do faster hashing #1462

Closed
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@SubJunk
Member

SubJunk commented Mar 23, 2018

No description provided.

SubJunk added some commits Mar 24, 2018

@Nadahar

This comment has been minimized.

Show comment
Hide comment
@Nadahar

Nadahar Mar 24, 2018

Collaborator

@SubJunk Although mapped files are much faster for things like this, I understand that they have a somewhat high cost to establish. Mapped files can consume some memory, and it will probably break at 32 bit OS'es. The memory they consume is outside the Java heap, so it won't be visible there.

32 bit processes has a limited virtual address space, so mapping 5-6 GB files simply isn't possible because of the limited address space. That's why I've stayed away from mapped files this far. I've been thinking that if we ever implement it, we'd need to implement an alternative solution for 32 bit.

The ideal solution was if there was a way to read the first 64k and then actually jump to the last 64k. Could RandomAccessFile be a better solution, where you simply move the cursor? I know it's kind of "old school", but that's the way we used to work with files "back in the days" and I think this has filesystem support so that no data has to be read when you move the cursor. The cursor is 64 bit/long, so it should be able to address huge files.

This obviously won't work for streams either, only for files on disk.

Collaborator

Nadahar commented Mar 24, 2018

@SubJunk Although mapped files are much faster for things like this, I understand that they have a somewhat high cost to establish. Mapped files can consume some memory, and it will probably break at 32 bit OS'es. The memory they consume is outside the Java heap, so it won't be visible there.

32 bit processes has a limited virtual address space, so mapping 5-6 GB files simply isn't possible because of the limited address space. That's why I've stayed away from mapped files this far. I've been thinking that if we ever implement it, we'd need to implement an alternative solution for 32 bit.

The ideal solution was if there was a way to read the first 64k and then actually jump to the last 64k. Could RandomAccessFile be a better solution, where you simply move the cursor? I know it's kind of "old school", but that's the way we used to work with files "back in the days" and I think this has filesystem support so that no data has to be read when you move the cursor. The cursor is 64 bit/long, so it should be able to address huge files.

This obviously won't work for streams either, only for files on disk.

@SubJunk

This comment has been minimized.

Show comment
Hide comment
@SubJunk

SubJunk Mar 24, 2018

Member

I compared the speed before and after and unfortunately it wasn't as much of a bottleneck as I anticipated, taking only around 30ms per file. It still adds up over thousands of files but I doubt this was the cause of the problem.
Yes the RandomAccessFile solution seems promising. Can you commit it to this branch and I can benchmark it the same way?

Member

SubJunk commented Mar 24, 2018

I compared the speed before and after and unfortunately it wasn't as much of a bottleneck as I anticipated, taking only around 30ms per file. It still adds up over thousands of files but I doubt this was the cause of the problem.
Yes the RandomAccessFile solution seems promising. Can you commit it to this branch and I can benchmark it the same way?

@Nadahar

This comment has been minimized.

Show comment
Hide comment
@Nadahar

Nadahar Mar 24, 2018

Collaborator

@SubJunk I don't know how you benchmark this, but be aware that this is something that can be really tricky to get right. It obviously depends on speed and type of harddrive, but there are also multiple cache mechanisms involved. Many harddrives will read ahead and cache what it anticipates that will be read next for example. I guess you need to do this from many threads at once to really see the consequence, on traditional harddrives (not SSDs) this leads to the read head having to jump around a lot. This is when you see the speed drop drastically, compare sequential read versus random access found in benchmarks.

I don't know the size of your test files either, but a typical modern harddrive read speed is somewhere in the range 140-180 MB/s. I can't understand how a 4 GB file can be hashed in 30 ms if the file actually has to be read through.

I can implement random access, but I'm in the middle of rewriting the XML parsing, so it will have to wait until I'm done with that commit.

Collaborator

Nadahar commented Mar 24, 2018

@SubJunk I don't know how you benchmark this, but be aware that this is something that can be really tricky to get right. It obviously depends on speed and type of harddrive, but there are also multiple cache mechanisms involved. Many harddrives will read ahead and cache what it anticipates that will be read next for example. I guess you need to do this from many threads at once to really see the consequence, on traditional harddrives (not SSDs) this leads to the read head having to jump around a lot. This is when you see the speed drop drastically, compare sequential read versus random access found in benchmarks.

I don't know the size of your test files either, but a typical modern harddrive read speed is somewhere in the range 140-180 MB/s. I can't understand how a 4 GB file can be hashed in 30 ms if the file actually has to be read through.

I can implement random access, but I'm in the middle of rewriting the XML parsing, so it will have to wait until I'm done with that commit.

@valib

This comment has been minimized.

Show comment
Hide comment
@valib

valib Mar 24, 2018

Contributor

Agree with @Nadahar that testing it is a little bit tricky. I tried to implement the RandomAccessFile but the result seems the same like this current implementation.

Contributor

valib commented Mar 24, 2018

Agree with @Nadahar that testing it is a little bit tricky. I tried to implement the RandomAccessFile but the result seems the same like this current implementation.

@SubJunk

This comment has been minimized.

Show comment
Hide comment
@SubJunk

SubJunk Mar 25, 2018

Member

I tested it by restarting the computer and logging how long the hashing took to happen on a folder with files ranging from 700MB up to 21GB. I mostly paid attention to the biggest files, there were heaps of them over 10GB each and the drive is slower than normal, it's a WD Green so spins a bit slower than the regular 7200k ones, which makes me like it for testing. However you're right that I didn't make sure it wasn't serving from its cache. I did however run it subsequent times and confirmed it takes about 1ms to hash files after the first time they're read, and obviously it's very unlikely that the 128MB cache on my 3TB hard drive is even holding relevant data from both ends of the 21GB file, let alone the entire folder which is over 1TB alone. Very tricky like everyone has said!

Member

SubJunk commented Mar 25, 2018

I tested it by restarting the computer and logging how long the hashing took to happen on a folder with files ranging from 700MB up to 21GB. I mostly paid attention to the biggest files, there were heaps of them over 10GB each and the drive is slower than normal, it's a WD Green so spins a bit slower than the regular 7200k ones, which makes me like it for testing. However you're right that I didn't make sure it wasn't serving from its cache. I did however run it subsequent times and confirmed it takes about 1ms to hash files after the first time they're read, and obviously it's very unlikely that the 128MB cache on my 3TB hard drive is even holding relevant data from both ends of the 21GB file, let alone the entire folder which is over 1TB alone. Very tricky like everyone has said!

@Nadahar

This comment has been minimized.

Show comment
Hide comment
@Nadahar

Nadahar Mar 25, 2018

Collaborator

@SubJunk @valib I've pushed a "benchmark" commit where I've added all the different hashing variants to ImdbUtil. Don't run UMS, run net.pms.Benchmark instead.

The parameters are set in the code itself:

  • folder - Hardcoded path to the folder whose files to hash
  • method - Which hashing method to use:
    • 1 = ImdbUtil.computeHashFileChannel(File)
    • 2 = ImdbUtil.computeHashFileChannel(Path)
    • 3 = ImdbUtil.computeHashInputStream(file)
    • 4 = ImdbUtil.computeHashRandomAccessFile(file)
  • numThreads - The number of concurrent threads.
  • multiply - Set the number of times to repeat scanning of the files in the folder. I made this because I don't have enough files in any folder to make it work for any amount of time. It might not be necessary for really large folders.

By setting the number of threads to a high value, performance will decrease (for me at least). This is probably because the read head will have to jump around a lot, but it might also have to do with overhead related to threading itself. I suspect this is part of the problem in #1372, there is no upper limit to the number of threads spawned and the hashing will be triggered for every call to getDisplayName() until it is written to InfoDb - which means many times for each file.

I think the "real problem" is the many calls to the prettifier (via getDisplayName) that will trigger hashing until it is written in InfoDb. This is simply bad design, it shouldn't start a new hash for the same file if one is already in progress, but since InfoDb has been removed in 7.0.0 there's hardly any point in fixing this. The question is if 7.0.0 handles this any better though.

The way it was handled until I moved the hashing out of synchronize was that only one file could be hashed at any one time. Reverting back to this might be the easiest "quick fix", although this is far from ideal. The best solution would be if threads requesting the IMDB info for files that have already been resolved would get an answer without blocking, while those requesting one that isn't resolved would have to wait, ideally services by a thread pool with a limited number of threads (5-10 or whatever gives the average best performance).

That said, I can't really get my computer to "lock up" no matter what parameters I try. I hope that you @SubJunk can by using a large number of files and many threads on your "green" disk.

Collaborator

Nadahar commented Mar 25, 2018

@SubJunk @valib I've pushed a "benchmark" commit where I've added all the different hashing variants to ImdbUtil. Don't run UMS, run net.pms.Benchmark instead.

The parameters are set in the code itself:

  • folder - Hardcoded path to the folder whose files to hash
  • method - Which hashing method to use:
    • 1 = ImdbUtil.computeHashFileChannel(File)
    • 2 = ImdbUtil.computeHashFileChannel(Path)
    • 3 = ImdbUtil.computeHashInputStream(file)
    • 4 = ImdbUtil.computeHashRandomAccessFile(file)
  • numThreads - The number of concurrent threads.
  • multiply - Set the number of times to repeat scanning of the files in the folder. I made this because I don't have enough files in any folder to make it work for any amount of time. It might not be necessary for really large folders.

By setting the number of threads to a high value, performance will decrease (for me at least). This is probably because the read head will have to jump around a lot, but it might also have to do with overhead related to threading itself. I suspect this is part of the problem in #1372, there is no upper limit to the number of threads spawned and the hashing will be triggered for every call to getDisplayName() until it is written to InfoDb - which means many times for each file.

I think the "real problem" is the many calls to the prettifier (via getDisplayName) that will trigger hashing until it is written in InfoDb. This is simply bad design, it shouldn't start a new hash for the same file if one is already in progress, but since InfoDb has been removed in 7.0.0 there's hardly any point in fixing this. The question is if 7.0.0 handles this any better though.

The way it was handled until I moved the hashing out of synchronize was that only one file could be hashed at any one time. Reverting back to this might be the easiest "quick fix", although this is far from ideal. The best solution would be if threads requesting the IMDB info for files that have already been resolved would get an answer without blocking, while those requesting one that isn't resolved would have to wait, ideally services by a thread pool with a limited number of threads (5-10 or whatever gives the average best performance).

That said, I can't really get my computer to "lock up" no matter what parameters I try. I hope that you @SubJunk can by using a large number of files and many threads on your "green" disk.

@Nadahar

This comment has been minimized.

Show comment
Hide comment
@Nadahar

Nadahar Mar 25, 2018

Collaborator

By the way, by looking at the code I've discovered that none of the methods actually read through the files. That would have been extremely slow. The reason is that FileInputStream.skip() actually uses file system "seek" under the hood, so even this method jumps over most of the file. Using memory mapped files on 32 bit systems shouldn't be a worry either, since only the 64 kB chunks to be hashed are mapped - not the whole file.

I though that memory mapping was too expensive to be beneficial for such small mappings, but from my tests memory mapping performs slightly better than the others. That might vary with hardware though. It seems to me that using File is slightly faster than using Path (which is a bit surprising to me), and that using RandomAccessFile is slightly faster than using FileInputStream.

The advantage of using InputStream is that it can also handle other sources than files, like network streams. I am however skeptical if that supports "seeking" or if skip() will actually result in reading through the whole stream. If so, doing this would be increadibly slow meaning you'd have to download the whole file to hash it.

Collaborator

Nadahar commented Mar 25, 2018

By the way, by looking at the code I've discovered that none of the methods actually read through the files. That would have been extremely slow. The reason is that FileInputStream.skip() actually uses file system "seek" under the hood, so even this method jumps over most of the file. Using memory mapped files on 32 bit systems shouldn't be a worry either, since only the 64 kB chunks to be hashed are mapped - not the whole file.

I though that memory mapping was too expensive to be beneficial for such small mappings, but from my tests memory mapping performs slightly better than the others. That might vary with hardware though. It seems to me that using File is slightly faster than using Path (which is a bit surprising to me), and that using RandomAccessFile is slightly faster than using FileInputStream.

The advantage of using InputStream is that it can also handle other sources than files, like network streams. I am however skeptical if that supports "seeking" or if skip() will actually result in reading through the whole stream. If so, doing this would be increadibly slow meaning you'd have to download the whole file to hash it.

@valib

This comment has been minimized.

Show comment
Hide comment
@valib

valib Mar 25, 2018

Contributor

@Nadahar I have no idea how to run net.pms.Benchmark

EDIT: Oh I found it

Contributor

valib commented Mar 25, 2018

@Nadahar I have no idea how to run net.pms.Benchmark

EDIT: Oh I found it

@Nadahar

This comment has been minimized.

Show comment
Hide comment
@Nadahar

Nadahar Mar 30, 2018

Collaborator

@SubJunk @valib How am I to interpret the lack of feedback? Was the "benchmark" I made useless, don't you get any results? Was it to much hard to use, or is there another reason there's no feedback?

Time is ticking, and I've soon forgotten the details.

Collaborator

Nadahar commented Mar 30, 2018

@SubJunk @valib How am I to interpret the lack of feedback? Was the "benchmark" I made useless, don't you get any results? Was it to much hard to use, or is there another reason there's no feedback?

Time is ticking, and I've soon forgotten the details.

@valib

This comment has been minimized.

Show comment
Hide comment
@valib

valib Mar 30, 2018

Contributor

@Nadahar sorry I forgot to give the result. The difference is minimal but it seems that methods 2 or 1 are the best.
Here is the result in order 1 to 4 methods:

Benchmarking of hashing 138000 files using 1 threads took 21589 ms (156446 ns average per file)
Benchmarking of hashing 138000 files using 100 threads took 10310 ms (74714 ns average per file)

Benchmarking of hashing 138000 files using 1 threads took 17042 ms (123495 ns average per file)
Benchmarking of hashing 138000 files using 100 threads took 10488 ms (76003 ns average per file)

Benchmarking of hashing 138000 files using 1 threads took 22675 ms (164314 ns average per file)
Benchmarking of hashing 138000 files using 100 threads took 14584 ms (105683 ns average per file)

Benchmarking of hashing 138000 files using 1 threads took 30252 ms (219222 ns average per file)
Benchmarking of hashing 138000 files using 100 threads took 14149 ms (102535 ns average per file)
Contributor

valib commented Mar 30, 2018

@Nadahar sorry I forgot to give the result. The difference is minimal but it seems that methods 2 or 1 are the best.
Here is the result in order 1 to 4 methods:

Benchmarking of hashing 138000 files using 1 threads took 21589 ms (156446 ns average per file)
Benchmarking of hashing 138000 files using 100 threads took 10310 ms (74714 ns average per file)

Benchmarking of hashing 138000 files using 1 threads took 17042 ms (123495 ns average per file)
Benchmarking of hashing 138000 files using 100 threads took 10488 ms (76003 ns average per file)

Benchmarking of hashing 138000 files using 1 threads took 22675 ms (164314 ns average per file)
Benchmarking of hashing 138000 files using 100 threads took 14584 ms (105683 ns average per file)

Benchmarking of hashing 138000 files using 1 threads took 30252 ms (219222 ns average per file)
Benchmarking of hashing 138000 files using 100 threads took 14149 ms (102535 ns average per file)
@SubJunk

This comment has been minimized.

Show comment
Hide comment
@SubJunk

SubJunk Apr 2, 2018

Member
Benchmarking of hashing 152000 files using 1 threads took 57277 ms (376824 ns average per file)
Benchmarking of hashing 152000 files using 100 threads took 20130 ms (132437 ns average per file)

Benchmarking of hashing 152000 files using 1 threads took 56675 ms (372867 ns average per file)
Benchmarking of hashing 152000 files using 100 threads took 21373 ms (140615 ns average per file)

Benchmarking of hashing 152000 files using 1 threads took 75716 ms (498133 ns average per file)
Benchmarking of hashing 152000 files using 100 threads took 330825 ms (2176486 ns average per file)

Benchmarking of hashing 152000 files using 1 threads took 51090 ms (336121 ns average per file)
Benchmarking of hashing 152000 files using 100 threads took 326446 ms (2147671 ns average per file)

Similar results to @valib where the second method is the fastest, but wow the final two with lots of threads really ran slow for me. This was on a folder with mostly big files and my slower than normal hard drive which explains the extreme results:

image

I think based on these results we should merge the second method, and I will be interested to test the subtitles branch after that's synchronized too

Member

SubJunk commented Apr 2, 2018

Benchmarking of hashing 152000 files using 1 threads took 57277 ms (376824 ns average per file)
Benchmarking of hashing 152000 files using 100 threads took 20130 ms (132437 ns average per file)

Benchmarking of hashing 152000 files using 1 threads took 56675 ms (372867 ns average per file)
Benchmarking of hashing 152000 files using 100 threads took 21373 ms (140615 ns average per file)

Benchmarking of hashing 152000 files using 1 threads took 75716 ms (498133 ns average per file)
Benchmarking of hashing 152000 files using 100 threads took 330825 ms (2176486 ns average per file)

Benchmarking of hashing 152000 files using 1 threads took 51090 ms (336121 ns average per file)
Benchmarking of hashing 152000 files using 100 threads took 326446 ms (2147671 ns average per file)

Similar results to @valib where the second method is the fastest, but wow the final two with lots of threads really ran slow for me. This was on a folder with mostly big files and my slower than normal hard drive which explains the extreme results:

image

I think based on these results we should merge the second method, and I will be interested to test the subtitles branch after that's synchronized too

@SubJunk SubJunk closed this Apr 3, 2018

SubJunk added a commit that referenced this pull request Apr 7, 2018

Nadahar added a commit that referenced this pull request Apr 7, 2018

valib added a commit that referenced this pull request Apr 10, 2018

valib added a commit that referenced this pull request Apr 15, 2018

valib added a commit that referenced this pull request Apr 19, 2018

valib added a commit that referenced this pull request May 13, 2018

@SubJunk SubJunk deleted the faster_hashing branch Sep 17, 2018

@SubJunk SubJunk restored the faster_hashing branch Sep 17, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment