New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Libraries shouldn't know about filesystems (or web clients, or...) #5
Comments
I have a lot of control over the code, most of the design is my trying to wrangle Hunspell into .NET, not just port it over. So the feedback is welcome and may result in some improvements! Some of the design though is very much from hunspell. for example an "Affix" is a core concept in Hunspell. While it is a file, it is also a concept in their problem domain that is responsible for configuration for a dictionary with all kinds of levers and knobs on it as well as memory optimization. Regarding the public API I do think that it could be much simpler with respect to instantiating a dictionary in code. I'll break that out to a new issue for a minor release increment maybe, or this upcoming 1.0.0. You can already do it today but its a bit more typing than you would want. You have to make a builder for a builder and make a thing to give to a builder then give a list of things to a builder and then give that instance to a dictionary .... and its not pretty. I think providing a simpler API for that use case would be good and may even clean some of the test code up. While most of the code you mentioned is for reading files and all the fun quirks that came along with it the library definitely does not think in terms of files. I made damn sure of that and that is part of what makes it testable, debugable, and easier to maintain. I think it does not appear that way because of the exposed API but within the code if you look for the nested builder types, that will lead you down the .NET in-memory path. The code I wrote in those classes is absolutely required to parse the files as they have some interesting rules, get in touch using a different medium and I can take you on a tour and maybe we can find some spots where .NET may provide tools I don't know of. With respect to the other classes I'm going to add some comments to the code as well as here:
I'm going to extract some issues from this for now and if/when we talk maybe get some more out of you. Thanks for taking the time to write all that up. |
I feel you with respect to the library not being dependent on FileStream and really it isn't but I provide it there as a convenience and I think it has very little cost as it is part of netstandard1.3 onwards. There are netstandard1.1 and Profile259 targets that don't have FileStream in them for example and can be fed either plain Streams (from embedded resources) or constructed Affic/Dictionary objects. |
Nice, I thought that might be the case. It looked like more than a mechanical conversion. If only the license wasn't confusing, but I know that's not your fault.
Do you mean that the affix format governs memory usage as well as specifying the business rules for word expansion? Do you have a link to that?
I noticed the separation as I explored. I'd still suggest that reading files should be left entirely to the application. I did notice that creating the various object representations of the domain concepts was difficult because the entry point was in
Hmm...
That doesn't seem to make sense to me, though my knowledge of codepages and character sets is limited. How can you read a file if you don't know how it's encoded? Oh, I see... affix seems to be limited to ISO-8859? So a https://msdn.microsoft.com/en-us/library/windows/desktop/dd317756(v=vs.85).aspx ...nope that wouldn't work because UTF-16 wouldn't parse properly due to the padding. You'd have to taste the string's bytes, and try to infer the encoding. Wow, that's genuinely very bad. Something tells me if this were a real RFC, this design decision wouldn't make it past the first draft. So you'd have to taste the bytes, infer the encoding, parse the first line, and restart the read. That's a big WTF right there. Of course if you force the application developer to give you a
I use (and like) semantic versioning: semver.org
My use case is a web service, and I'd prefer not to load files. I suspect most application developers would prefer strings to streams, since File.Read* is close to hand. I do understand that the library prefers streams (yay, bytes!) because of the
Yeah, most of
What's the use case for this? I did see some binary search, but I didn't look too closely at it. Why is a binary search preferable to an O(1) HashSet? Is the ordering just for the binary search, or is there a non-search reason to maintain it?
Oh, I think you're saying that that's how Hunspell does it, so if they make enhancements, you want to be able to pick them up, too. That's certainly a design decision you could make, and it's your library. I find it weird that Hunspell itself doesn't seem to use hashtables?
That's good to hear. Did you enjoy writing it? I wouldn't have, so I would have worked hard to not have to. :) |
Affix is the name used in Hunspell for the model that contains all kinds of configuration and dangling bits of data that did not come out of a FileStream is isolated to a few helper methods here and there for those that want to load things from disk (I know I will). I plan on having better docs and methods to simplify bringing in a raw Stream and even values from memory. I feel pretty strongly about that being a situation where people just should not use it if they don't want to but I do see where you are coming from with respect to making it easy to just create the data structures.
Now you see :) It's worse too because the files that you open may have like a BOM or something and yet be totally in some other code page, not even UTF8 and if you don't load it just right a bunch of specialized letters will fail to work. This was a massive pain to work out!
nah, I want things to be easy but I think I left enough room for somebody to do things their way, just need to simplify it
Yup and the build (when it works) is already setup for it. I just want to get any breaking changes that I know about done up front before 1.0.0 goes out.
I would like to go over this with you using some kind of immediate communication to get a better idea of how you want to build up an instance.
Maybe a UCS-2 reader can be added to an upcoming minor release with docs on how to use it and that use case. The tests certainly make use of it already, if somebody else needs that functionality it should be moved (and named correctly). Maybe
So both in theory and in practice a HashSet is going to be mostly O(1) but the real story is the one told by the CPU. In my profiling I didn't really see anything that would indicate that an optimization here would do much, but it could be attempted in a fork. This also pretty much how hunspell works and stores these flags and ordinal values, just without types that are as strong and the collections are small enough that the light weight arrays are search pretty smoothly. Always worth a try though for exploration.
That is part of it but on the other hand it seemed plenty good enough to me. There seem to be some nice performance advantages with arrays for small sorted collections.
Weirdly, I did... |
Also, I plan on getting the performance tests to be easier to run. After that it should be pretty easy to see if something helps/hurts. Its really hard to guess with this stuff. |
I got a lot out of this and think I converted most of it into other issues. If you have any other things you want to see or talk about feel free to create more issues or just get in touch. |
(For all I know, this is from hunspell proper, so maybe this feedback is in the wrong place.)
tl;dr- Libraries should be as pure as possible, because purity maximizes flexibility, composability, testability, and greatly decreases the maintenance burden on the library author.
I would suggest an API like:
As it is now, this library thinks in terms of filesystems for affix and dictionary files. But what if data source is a database? Or a web service endpoint? This is especially true for people that might consume this in an ASP.NET Core web application.
ISet.Comparer
forGetHashCode
andEquals
implementations where it makes sense to do soA bunch of stuff goes away afterwards:
CulturedStringComparer
Utf16StringLineReader
HunspellLineReaderExtensions
IHunspellLineReader
DynamicEncodingLineReader
StaticEncodingLineReader
HunspellDictionary
System.IO
dependenciesAffixReader
gets considerably simplerThis sets you up to delete (or delegate to the framework) a bunch of other stuff:
CharacterSet
->HashSet<char>
ArrayWrapper
,ArrayComparer
-> useHashSet<T>
Deduper
EncodingEx
And avoids bugs around:
By maintaining purity, all of your operations are CPU-bound, so the need for
async
disappears. (Or rather, it shifts to the application developer who may want to run it on a threadpool thread, but that would be their choice to make.)The text was updated successfully, but these errors were encountered: