New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Culture revamp, cacheable resources, small tweaks and fixes #20
Conversation
Excellent, looking forward to checking it out. Looks like the tests failed on Linux |
The tests failed on linux going back a month though, was that from your test demo maybe? I remember some demonstrations about tests for different platforms. The failure is about |
One other slight change you might consider "breaking" about But I did also forget to mention that the format is now a parameter and thus overridable as well, but you'd have to pick a min and max first to get at it when using |
My fault. I used the tests in a demo to show how it failed on Linux. You can comment that out. With these changes, probably going to push the version 2.0. Wonder if it's worth putting some basic tests in the current one to test simple values returned from each template? |
Tests are always a good thing, but I think we'd need to refactor some more to separate the random data generation from formatting. We could also use tests for template parsing. |
Agreed. Figured since this PR is refactoring and probably unintentionally breaking things, wondering if it's worth to put the effort in for some "smoke" tests before merging the PR. Or just version this 2.0, put tests in now and know it is breaking, which it will be anyway. |
I still need some PRs for Hacktober, but this month is terrible for me personal time wise. Testing random data generations for multiple cultures is going to be a challenge. We could do some simple tests like the test for the Here are some simple ones off of the top of my head.
Looking at the code, I think the Caching logic is way overly complicated. The size of the data is so small, you could just load it all into memory statically. Then you just need one helper function that loads a given culture that runs with The culture support is awesome and the module is due for a refactor from the ad-hoc, proff-of-concept origins. |
Agreed, not sure how to test a random data generator. First thought was to mock the functions and just make sure they were called. Simple test may be just use a template and see if it returns data. Just to see that things still "work" |
I did fix the tests already by removing the invalid one and adding a single culture-specific test, but I agree that there's a lot of refactoring to be done that would help in writing and performing tests. I tend to try to solve everything at once so I'm trying to show some restraint in not changing absolutely everything at once. I have a few ideas on how to generalize some of this module in a way that:
But for now I wanted to at least start getting feedback on changes thus far. I may yet still add some more tests; writing that "simple" test to ensure spaces re returned with culture in mind did actually surface a bug that I also fixed, it's amazing how even simple tests are so useful. |
@cdhunt re:
Could you expand on "load it into memory statically"? Culture isn't a set-it-one-time thing, as you can call individual functions with individual cultures, and you can change your thread culture on the fly. While the data is small now, it may not stay that way, and multiplied by additional cultures it further grows. Even if the total size is "small enough" we would still need a way to organize it, which the caching does now. The caching logic is meant give template function authors a method to load content from a file and handle keeping that data in memory once asked for, so that when you call Some things could be changed, but overall it makes for (what I think is) a straightforward invocation for templates, while keeping the logic consistent (in one place). The caching logic is separate from culture resolving logic; you can use the cache content functions with a direct path that has nothing to do with cultures. You can also use culture logic ( But I'd love to get more specific feedback on your complexity concerns and what changes you'd like to see. My responses here are kind of guessing at what you're getting at so maybe I can address your concerns more directly. |
Right now, all data for all available cultures is under 50 KB. I don't think that warrants any kind of memory management logic. Just load all of the data on module load. You can use a compound key like Sorry if it sounds like I'm attacking your caching logic. It's good code, just way more than I think this module needs for a long time - if ever. I 100% support keeping it in memory and avoiding file access for every function call. A few line for-loop should be all that's needed to load the data and your lookup code will probably also be a little bit simplified replacing path checking for a Running some benchmarks with |
@briantist Cool code. I'd prefer to wait on caching. I put this module together as a lark based on a TypeScript implementation I thought was cool. I always like using the file system to "index" into subsections of data rather than manage it in memory. |
I think there's some confusion around what I've done here, and it's really my fault for not explaining any of it in a big-ish PR, and providing no context, comments, or documentation for it, so hopefully I can fix that! @cdhunt I think you're correct about the size of the data, and that even in the future, the total size will never be a problem to have in memory all at once. But that was not what led to this design anyway. And I want to stress that there is no "memory management logic", for that reason. I never bother taking anything out of memory unless asked by some invocation. At no point are statistics kept, or a certain memory size targeted, or hit counts evaluated, and I also don't think that will ever be needed. I think perhaps I should not have used the term "caching" at all, as it seems to imply a lot that doesn't exist. The main expense this saves isn't even disk access, it's processing time for anything that doesn't directly use the data that lives on disk (like The design goal of this "caching" mechanism was a near-minimal way to do a simple "store it in memory and forget about it" that meets a few criteria:
So all that being said, I want to directly address your suggestion to read all the files at once in the beginning, but first I have to talk about the culture-resolution logic, which is by-design completely independent. I know I'm being a bit verbose here but if this code lands I figure this comment will be the basis for documentation of this anyway. When I first started looking into how to do this, I read a bunch of resources on organizing localization resources, and one of those was Microsoft's document called Hierarchical organization of resources for localization. This was the inspiration for the file structure I chose (even though they're talking about in-process resources), and I'll briefly explain why: Cultures are separated into language (neutral culture) and specific-culture, so something like The recommendation they have is taking advantage of this hierarchy so that you get good fallback when you don't have something culture-specific but you can provide their language. If someone is running on the Brief note: a neutral-culture is also a valid culture, so setting your program's culture to just This has some cool implications for us if we resolve culture at the file level, because it means (to start) supporting a more specific culture doesn't have to be all-or-nothing once we create the specific directory. I demonstrated this with the But the great thing about that is that you can do Further, because culture becomes a parameter to a template function, you can mix cultures regardless of what your app/thread is set to: The downside of this hierarchy, is perhaps that this logic could be seen as "complex", but I would say, that's why it's wrapped in a Some of this stuff will end up being really important in trying to support languages that don't use alphabets, I think, but again that part isn't fully fleshed out. So, the reason I say all that is that your suggestion to read all the files at the beginning would break a few things on both the caching side (I don't really care too much), and the culture logic side (bigger deal). Memory isn't a concern, and we could just
So after all that, I'll try to summarize:
Sorry a lot of this was missing earlier, and I hope this sheds some light on the design decisions and makes them a bit less scary or confusing. Still happy to discuss further! |
Also just want to say I love the |
Also I ran some benchmarks and to my surprise, even just reads with
Of course this is a lot more iterations than a normal use, so if the consensus is still "rip out cache" then so be it, I was just surprised there was a bigger difference between |
The big news here is the beginnings of adding support for multiple cultures and languages. Directory structure and file naming have been changed for that, and all methods referencing those files have as well.
To that end, I've removed "floating" variables that contain information that should be localized (namely the
$nouns
, $adjectives, and
$verbs` variables), but to get back the performance benefits I've implemented a system whereby the contents of such files are cached when asked for so that they aren't read again (lots of options for overriding and controlling that behavior). Methods relying on that content have been updated to be localization aware.Some methods that relied on hardcoded information have been changed to use localized files (like the
space
function, since space characters are not the same in every language and culture).The
randomdate
function has been updated to be localization aware and to take parameters for min and max date. The only breaking change in this PR from my perspective is changing the defaults for these values. I've changed them to[DateTime]::MinValue
and[DateTime]::MaxValue
respectively. They can now be overridden i.e.ig '[randomdate 1/1/1970 12/31/2070]
but if you feel strongly about your1/1/1970
and[DateTime]::Now
defaults I'll keep the params and change their values.Added
[colors]
(American English, British English, and Japanese), but the lists are pretty short.There's more work to be done in adding localization support, that I haven't gotten to yet (namely script support which would replace direct alphabet support, haven't fully figured out how to do that yet). Adding support for other number notations should be interesting too, especially in a way choosable by the user, as many cultures use several number systems.
I've also had a lot of trouble finding word lists for non-American English, especially lists that have part-of-speech so that nouns, adjectives, and verbs can be shown appropriately. Any help in finding good comprehensive word lists in any language is much appreciated.
Also need to add documentation for new methods and patterns.
Feedback welcome while I finish fleshing out some of the changes above.