Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Libraries shouldn't know about filesystems (or web clients, or...) #5

Closed
rianjs opened this issue May 29, 2017 · 6 comments
Closed
Milestone

Comments

@rianjs
Copy link

rianjs commented May 29, 2017

(For all I know, this is from hunspell proper, so maybe this feedback is in the wrong place.)

tl;dr- Libraries should be as pure as possible, because purity maximizes flexibility, composability, testability, and greatly decreases the maintenance burden on the library author.

I would suggest an API like:

var checker = new HunspellDictionary(ISet<string> dictionaryWords, string affixContent);
bool notOk = hunspell.Check("teh");
var suggestions = hunspell.Suggest("teh");
bool ok = hunspell.Check("the");

As it is now, this library thinks in terms of filesystems for affix and dictionary files. But what if data source is a database? Or a web service endpoint? This is especially true for people that might consume this in an ASP.NET Core web application.

  • Throw useful parse exception if the application developer gives you bad input.
  • Leave the culture handling to the framework + application context
  • Reuse ISet.Comparer for GetHashCode and Equals implementations where it makes sense to do so

A bunch of stuff goes away afterwards:

  • CulturedStringComparer
  • Utf16StringLineReader
  • HunspellLineReaderExtensions
  • IHunspellLineReader
  • DynamicEncodingLineReader
  • StaticEncodingLineReader
  • Most of HunspellDictionary
  • All System.IO dependencies
  • AffixReader gets considerably simpler
  • Probably a bunch of other stuff

This sets you up to delete (or delegate to the framework) a bunch of other stuff:

  • CharacterSet -> HashSet<char>
  • ArrayWrapper, ArrayComparer -> use HashSet<T>
  • Deduper
  • EncodingEx

And avoids bugs around:

  • Byte-order markers (BOM)
  • Endian-ness

By maintaining purity, all of your operations are CPU-bound, so the need for async disappears. (Or rather, it shifts to the application developer who may want to run it on a threadpool thread, but that would be their choice to make.)

@aarondandy
Copy link
Owner

aarondandy commented May 29, 2017

I have a lot of control over the code, most of the design is my trying to wrangle Hunspell into .NET, not just port it over. So the feedback is welcome and may result in some improvements! Some of the design though is very much from hunspell. for example an "Affix" is a core concept in Hunspell. While it is a file, it is also a concept in their problem domain that is responsible for configuration for a dictionary with all kinds of levers and knobs on it as well as memory optimization.

Regarding the public API I do think that it could be much simpler with respect to instantiating a dictionary in code. I'll break that out to a new issue for a minor release increment maybe, or this upcoming 1.0.0. You can already do it today but its a bit more typing than you would want. You have to make a builder for a builder and make a thing to give to a builder then give a list of things to a builder and then give that instance to a dictionary .... and its not pretty. I think providing a simpler API for that use case would be good and may even clean some of the test code up.

While most of the code you mentioned is for reading files and all the fun quirks that came along with it the library definitely does not think in terms of files. I made damn sure of that and that is part of what makes it testable, debugable, and easier to maintain. I think it does not appear that way because of the exposed API but within the code if you look for the nested builder types, that will lead you down the .NET in-memory path. The code I wrote in those classes is absolutely required to parse the files as they have some interesting rules, get in touch using a different medium and I can take you on a tour and maybe we can find some spots where .NET may provide tools I don't know of.

With respect to the other classes I'm going to add some comments to the code as well as here:

  • CulturedStringComparer - Used to dedup suggestion results using the culture rules of the dictionary.
  • Utf16StringLineReader - This is just around to support unit testing, and I don't actually think UTF16 is the right name... its UCS-2 I think?
  • HunspellLineReaderExtensions - I may break this out to another file but these just help out a bit
  • IHunspellLineReader - I wanted this as it provides a simpler APi surface area to implement against than a stream as hunspell only really cares about lines when it reads text from a file.
  • DynamicEncodingLineReader - I really hate this thing. I had to make this because an affix file when it is read can change encoding on the fly based on a command. There is nothing that I can find in .NET that supports this behavior and probably for good reason. I spent a great deal of time on this code researching how .NET does preamble handling and have run it though tests against many dictionaries so I am pretty confident I got it right.
  • StaticEncodingLineReader - This is handy for reading a dictionary file as it gets loaded after the affix encoding is known, which it inherits.
  • HunspellDictionary - I see your point here, and will think about this. It will need to be sorted before 1.0.0
  • System.IO - It only hurts if you use it and most users will be loading a resource or file from a stream.
  • AffixReader - Affix reader reads affixes, I don't think it can get much simpler. if you want to just create an affix you can use the builder and bypass the reader entirely, but the API could be easier and an issue for that should be incoming.
  • "Probably a bunch of other stuff" - I think most of this stuff has to stay to be able to read from streams which I expect to be the normal use case.
  • CharacterSet - This is not needed but I get two things from types like this, first I get a nice static typed name to represent the responsibility of the type and second I can use this to optimize memory (via deduping) after a file is loaded. If you don't load a file most of this stuff won't hurt any.
  • ArrayWrapper - I want to avoid using hash set as some of the collection types are ordered (allowing binary search) and I want that ordering to be preserved to match the implementation. In other cases the order may matter and I don't want to deviate from the existing behavior in hunspell while trying my best to be in line with .NET.
  • Deduper - there is a lot of duplicate stuff in the files! If you don't load a file though it should not hurt any.
  • EncodingEx - this is to avoid leaking exceptions and to try fixing some encoding names when loading files. Should have no cost unless an affix is loaded from a stream using the reader.
  • "BOM" and "Endian" stuff - I tested this a bunch and spent some time researching it and am pretty sure I got it right. If you don't load from a stream though its no problem but there are definitely quirks in how those files are loaded that need to be handled.
  • async - the only async is around file loading and if somebody wants there are sync variants of all methods, this library is even compatible with portable-net45+win8+wpa81+wp8

I'm going to extract some issues from this for now and if/when we talk maybe get some more out of you. Thanks for taking the time to write all that up.

@aarondandy
Copy link
Owner

I feel you with respect to the library not being dependent on FileStream and really it isn't but I provide it there as a convenience and I think it has very little cost as it is part of netstandard1.3 onwards. There are netstandard1.1 and Profile259 targets that don't have FileStream in them for example and can be fed either plain Streams (from embedded resources) or constructed Affic/Dictionary objects.

@rianjs
Copy link
Author

rianjs commented May 29, 2017

I have a lot of control over the code, most of the design is my trying to wrangle Hunspell into .NET, not just port it over

Nice, I thought that might be the case. It looked like more than a mechanical conversion. If only the license wasn't confusing, but I know that's not your fault.

"Affix" is a core concept in Hunspell. While it is a file, it is also a concept in their problem domain that is responsible for configuration for a dictionary with all kinds of levers and knobs on it as well as memory optimization

Do you mean that the affix format governs memory usage as well as specifying the business rules for word expansion? Do you have a link to that?

While most of the code you mentioned is for reading files and all the fun quirks that came along with it the library definitely does not think in terms of files.

I noticed the separation as I explored. I'd still suggest that reading files should be left entirely to the application. I did notice that creating the various object representations of the domain concepts was difficult because the entry point was in HunspellDictionary (IIRC) which wanted files.

I had to make this because an affix file when it is read can change encoding on the fly based on a command. There is nothing that I can find in .NET that supports this behavior and probably for good reason. I spent a great deal of time on this code researching how .NET does preamble handling and have run it though tests against many dictionaries so I am pretty confident I got it right.

Hmm...

The first line specifies the character set used for both the wordlist and the affix file (should be all uppercase).

For example:

SET ISO8859-1

And the second line specifies the characters to be used in building suggestions for misspelled words.

That doesn't seem to make sense to me, though my knowledge of codepages and character sets is limited. How can you read a file if you don't know how it's encoded? Oh, I see... affix seems to be limited to ISO-8859? So a SET command with an ISO 8859 subset should always be readable since it's in the 7-bit ASCII range, so you could always read the first n characters with Encoding.ASCII, and then restart the read using the specified encoding.

https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx

...nope that wouldn't work because UTF-16 wouldn't parse properly due to the padding. You'd have to taste the string's bytes, and try to infer the encoding. Wow, that's genuinely very bad. Something tells me if this were a real RFC, this design decision wouldn't make it past the first draft. So you'd have to taste the bytes, infer the encoding, parse the first line, and restart the read.

That's a big WTF right there.

Of course if you force the application developer to give you a string or collection of strings, this becomes their problem instead.

It will need to be sorted before 1.0.0

I use (and like) semantic versioning: semver.org

It only hurts if you use it and most users will be loading a resource or file from a stream.

My use case is a web service, and I'd prefer not to load files. I suspect most application developers would prefer strings to streams, since File.Read* is close to hand. I do understand that the library prefers streams (yay, bytes!) because of the SET command issues above.

Affix reader reads affixes, I don't think it can get much simpler. if you want to just create an affix you can use the builder and bypass the reader entirely, but the API could be easier and an issue for that should be incoming.

Yeah, most of AffixReader is concerned with the reading of bytes from one source or other, and using builder objects, IIRC. If you could get out of the business of reading bytes, and just use strings (or StringReader), the library gets a lot smaller.

I want to avoid using hash set as some of the collection types are ordered (allowing binary search) and I want that ordering to be preserved to match the implementation.

What's the use case for this? I did see some binary search, but I didn't look too closely at it. Why is a binary search preferable to an O(1) HashSet? Is the ordering just for the binary search, or is there a non-search reason to maintain it?

In other cases the order may matter and I don't want to deviate from the existing behavior in hunspell while trying my best to be in line with .NET.

Oh, I think you're saying that that's how Hunspell does it, so if they make enhancements, you want to be able to pick them up, too. That's certainly a design decision you could make, and it's your library. I find it weird that Hunspell itself doesn't seem to use hashtables?

I tested this a bunch and spent some time researching it and am pretty sure I got it right. If you don't load from a stream though its no problem but there are definitely quirks in how those files are loaded that need to be handled.

That's good to hear. Did you enjoy writing it? I wouldn't have, so I would have worked hard to not have to. :)

@aarondandy
Copy link
Owner

aarondandy commented May 29, 2017

Affix is the name used in Hunspell for the model that contains all kinds of configuration and dangling bits of data that did not come out of a .dic file. In my opinion, the closest thing to represent this model is in https://github.com/hunspell/hunspell/blob/aab258adbd9c78931a36b96e58975a08000249a8/src/hunspell/affixmgr.hxx and I mapped what I saw there as I ported to https://github.com/aarondandy/Hunspell.NetCore/blob/master/src/Hunspell.NetCore/AffixConfig.cs . This AffixConfig combined with WordList are what a "dictionary" is composed of.

FileStream is isolated to a few helper methods here and there for those that want to load things from disk (I know I will). I plan on having better docs and methods to simplify bringing in a raw Stream and even values from memory. I feel pretty strongly about that being a situation where people just should not use it if they don't want to but I do see where you are coming from with respect to making it easy to just create the data structures.

That's a big WTF right there.

Now you see :) It's worse too because the files that you open may have like a BOM or something and yet be totally in some other code page, not even UTF8 and if you don't load it just right a bunch of specialized letters will fail to work. This was a massive pain to work out!

this becomes their problem

nah, I want things to be easy but I think I left enough room for somebody to do things their way, just need to simplify it

semver.org

Yup and the build (when it works) is already setup for it. I just want to get any breaking changes that I know about done up front before 1.0.0 goes out.

My use case is a web service, and I'd prefer not to load files.

I would like to go over this with you using some kind of immediate communication to get a better idea of how you want to build up an instance.

most application developers would prefer strings to streams

Maybe a UCS-2 reader can be added to an upcoming minor release with docs on how to use it and that use case. The tests certainly make use of it already, if somebody else needs that functionality it should be moved (and named correctly). Maybe string is a better word instead of UCS-2 and certainly UTF16.

an O(1) HashSet

So both in theory and in practice a HashSet is going to be mostly O(1) but the real story is the one told by the CPU. In my profiling I didn't really see anything that would indicate that an optimization here would do much, but it could be attempted in a fork. This also pretty much how hunspell works and stores these flags and ordinal values, just without types that are as strong and the collections are small enough that the light weight arrays are search pretty smoothly. Always worth a try though for exploration.

how Hunspell does it, so if they make enhancements

That is part of it but on the other hand it seemed plenty good enough to me. There seem to be some nice performance advantages with arrays for small sorted collections.

Did you enjoy writing it?

Weirdly, I did...

@aarondandy
Copy link
Owner

Also, I plan on getting the performance tests to be easier to run. After that it should be pretty easy to see if something helps/hurts. Its really hard to guess with this stuff.

@aarondandy
Copy link
Owner

I got a lot out of this and think I converted most of it into other issues. If you have any other things you want to see or talk about feel free to create more issues or just get in touch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants