Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use hash code instead of full source text for language service caches #5944

Closed
wants to merge 3 commits into from
Closed

Use hash code instead of full source text for language service caches #5944

wants to merge 3 commits into from

Conversation

cartermp
Copy link
Contributor

@cartermp cartermp commented Nov 20, 2018

Not sure if this is the correct overall approach - but hopefully this can help encourage a better fix 🙂

The following issues track large source files ultimately living on the large object heap, forcing horrible GC delays over time:

#5937
#5936
#5935
#4881

But as mentioned in #4881 (comment) the language service uses the full source text as a key for caches, making it more difficult to simply not pass around the full source text. This seeks to rectify that. It may also help the linked issues

let AreSameForParsing((fileName1: string, source1: string, options1), (fileName2, source2, options2)) =
fileName1 = fileName2 && options1 = options2 && source1 = source2
let AreSameForParsing((fileName1: string, source1Hash: int, options1), (fileName2, source2Hash, options2)) =
fileName1 = fileName2 && options1 = options2 && source1Hash = source2Hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends how accurate the string hash codes are and how much we care about very odd behaviour on collisions.

Checking only hash codes means that when there is collision we will re-use results from some other random parse or typecheck.

So if collisions happen very rarely this may just be ok.

If hash codes are ever based on the partial hash of the string (e.g. a prefix) then this shouldn't be done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should definitely have a think about this. At least initially, I feel that if the file names are the same and the hashes are the same, it's highly unlikely we'd get some arbitrary cached value. But let's definitely make sure we feel confident on that :)

@0x53A
Copy link
Contributor

0x53A commented Nov 21, 2018

I don't really trust string.GetHashCode, I doubt it was designed with strong collision resistance in mind.

If a true cryptographic hash like sha is too expensive, can you maybe at least use a specialised one like murmurHash?

(Sorry, can't sleep)

@cartermp
Copy link
Contributor Author

I think what's worth thinking about is if this is a scenario where strong collision resistance is important. For example, in each cache here, the key varies:

  • AreSimilarForParsing uses the file name as the key
  • AreSameForParsing uses the file name, FSharpParsingOptions (also contains source files, so this needs to be thought about), and (now) the hash of the source file as the key

And so on. All current keys are an aggregate of some information. Given that, does it make sense to use a cryptographically strong hash function? Is the Birthday Problem applicable for IDE tooling? Would doing so be meaningful perf-wise? Not entirely sure how to answer each question, but we definitely do want to move away from storing the whole source text if we don't actually need it.

@0x53A
Copy link
Contributor

0x53A commented Nov 21, 2018

I think I often sound too negative. Hashes are awesome and this pr is great.

String.GetHashCode is 32 bytes, so 4 giga possibilities.

Really large source files can be many kB to maybe even MB if they are auto-generated?

I was never that good in statistics, and am just talking out of my behind because I'm bored. But that's a few magnitudes too few for me to be comfortable ;D

But I would reverse that question: is there any reason NOT to use a strong hash?

Typechecking is probably so expensive that running a sha256 over the source wouldn't be noticeable.

And if it is, there is still murmur3, which is a fast, specialised, 128 bit hash.
Heck, you could also use md5 (also 128 bit)

@cartermp
Copy link
Contributor Author

cartermp commented Nov 21, 2018

So I'm pretty sure my stats needs some work (it's been a while!), but here's a quick calculation in the awesome Fable repl. So for 10k files, we could see a ~1.1% chance of a collision with string's GetHashCode, assuming we only use that as a key for a cache. Given that we aggregate additional information I think it's fine, but I wouldn't take that as gospel 🙂

That said, I certainly don't see anything wrong with a stronger hash function if the perf isn't a problem. We'd have to measure that, but if it's negligible then I think something like MD5 is fine. I'd much rather use something in the BCL than having to hand-code some F# implementation of something in here.

@cartermp
Copy link
Contributor Author

cartermp commented Nov 21, 2018

So, second commit is uhhh...a lot more. But this addresses all sources of source files being used as keys in the caches. Let me make sure this PR is marked as WIP so we don't ever accidentally pull it before really vetting the approach.

@cartermp cartermp changed the title Use hash code instead of full source text for some language service caches [WIP] Use hash code instead of full source text for some language service caches Nov 21, 2018
@cartermp cartermp changed the title [WIP] Use hash code instead of full source text for some language service caches [WIP] Use hash code instead of full source text for language service caches Nov 21, 2018
@vasily-kirichenko
Copy link
Contributor

Typechecking is probably so expensive that running a sha256 over the source wouldn't be noticeable.
And if it is, there is still murmur3, which is a fast, specialised, 128 bit hash. Heck, you could also use md5 (also 128 bit)

.NET MD5 and SHA-1 implementations have approximately same performance, SHA-256/384/512 are way slower, so I'd use SHA-1 or murmur3, if it's fast.

@ncave
Copy link
Contributor

ncave commented Nov 24, 2018

@cartermp Is the second commit necessary, what are we saving by hashing source file names?

@cartermp
Copy link
Contributor Author

Ah, yeah they're just the path+names (even though they're not named that way). I'll be rid of the second commit.

@cartermp cartermp changed the title [WIP] Use hash code instead of full source text for language service caches Use hash code instead of full source text for language service caches Nov 28, 2018
@TIHan
Copy link
Member

TIHan commented Dec 15, 2018

Closing this as the PR, #6001, makes source text equality abstract for caching. If you are using the string implementation of ISourceText , it will do the direct string compare. But for VS, we have an implementation of ISourceText's ContentEquals to be Roslyn's SourceText's ContentEquals which does the right thing already: http://source.roslyn.io/#Microsoft.CodeAnalysis/Text/SourceText.cs,dc290adccf6f9ba7
Other IDEs will be recommended to implement their own ISourceText so they are in charge of the efficiency there; but I will always recommend them to just take a dependency Microsoft.CodeAnalysis and implement ISourceText using Roslyn's SourceText.

@TIHan TIHan closed this Dec 15, 2018
@TIHan
Copy link
Member

TIHan commented Dec 15, 2018

A path forward for caching will be interesting. Ideally we might want to use an int based on time to determine file changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants