Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening filenames with Unicode on Windows not possible #3408

Closed
chrisdone opened this issue Jan 25, 2018 · 11 comments
Closed

Opening filenames with Unicode on Windows not possible #3408

chrisdone opened this issue Jan 25, 2018 · 11 comments
Assignees

Comments

@chrisdone
Copy link

chrisdone commented Jan 25, 2018

  • The method DB::Open actually calls createDirIfMissing.
  • That in turn calls CreateDir from port/win/env_win.cc.
  • Finally, that in turn calls _mkdir, which is from Direct.h and has this type: int _mkdir(const char* pathname).
  • To properly support Unicode on Windows you typically use a wide char (wchar_t or equivalent Win32 API name). The bytes passed into _mkdir are interpreted as the current codepage (CP 437, in my case) as multi-byte characters, as opposed to UTF-16 encoded byte pairs.

So I think it would be easier if we could directly expose a function that uses _wmkdir which accepts wide-character input. This way you have access to the range that fits in two bytes, rather than the more limited range of CP 437, for example.

It's not normal to create a database with non-ASCII, but in our use-case we use absolute paths which may include usernames of people, so that's where the range of unicode support comes in handy.

If I open a PR, would it be considered for inclusion?

@domenkozar
Copy link

We should probably replace whole #include <direct.h> // _rmdir, _mkdir, _getcwd. Moreover, CreateFileA needs to become CreateFileW, etc.

@maysamyabandeh
Copy link
Contributor

@yuslepukhin thoughts?

@domenkozar
Copy link

@maysamyabandeh @yuslepukhin would you accept a patch migrating windows api to unicode types?

@yuslepukhin
Copy link
Contributor

@domenkozar While getting rid of the intermediates like _rmdir is a good idea, you then will have to do the conversion to UTF-16 whereas A entry points do it for you now.

The rest of the product before you hit port layer is a single byte encoding thus, you should be able to pass UTF-8 more or less problem free and convert to UTF-16 unless someone somewhere will try to iterate std::string char by char.

For all other encodings you will need somehow to pass or autodetect the encoding and then convert to UTF-16 which is not an attractive proposition.

@chrisdone
Copy link
Author

Changing all the upper types to wchar_t on windows will probably be too much of an upheaval for the codebase.

I'm fine with just encoding UTF-8 to UTF-16 before hitting the native Windows API calls.

@yuslepukhin
Copy link
Contributor

@chrisdone Here is what I would do if I were in your shoes for the path of least resistance.
Rocksdb is modular. You have an option of creating your own environment that you can do as you please. There is a number that exists already.
That environment can be a of Stackable that would intercept certain calls and forward others OR you can create your own re-using other relevant parts via composition/inheritance of existing parts of win env.
Then you can substitute the default for yours in your build by replacing a few .cc files OR simply override in the options.

Couple of possible problems that may be a non-issue but worth mentioning include

  • how existing code base treats the paths. Those paths are passes around in many ways, part of the version and are used when opening files. It may be important in certain cases that std::strings are no longer treated as ASCII
  • Educational issue as to how people that make changes do it in the future. This is, of course, becomes one more thing for reviewers to check.

@chrisdone
Copy link
Author

@yuslepukhin So would you accept a PR that simply encodes UTF-8 to UTF-16 before hitting the Windows API? Or are you suggesting we maintain a fork of rocksdb with such a change?

@yuslepukhin
Copy link
Contributor

@chrisdone I am suggesting something in the middle. Forks are not desirable in general. Rocksdb allows you to create a custom environment. You can see plenty of examples such as mock_env, chroot env, hdfs.
You can do the same for your purposes and check it into the master branch. When opening the database you can instantiate your env and set it in the database options. In my opinion, it has the following benefits:

  • remains part of the CI build and master branch and, therefore, subject to any refactoring if there are any changes in windows env or env in general.
  • remains non-intrusive for others and allows you weed out Unicode/multibyte specific bugs in the rest of the code w/o affecting others.

@miasantreble
Copy link
Contributor

As suggested by @yuslepukhin creating a custom environment is probably a better option here. Please refer to https://github.com/facebook/rocksdb/tree/master/env and https://github.com/facebook/rocksdb/tree/master/hdfs for examples of creating and using custom envs.

@udoprog
Copy link

udoprog commented Jul 22, 2019

Note for anyone looking this up in the future:

This is now fixed in #4469 without having to use a custom env by using the WITH_WINDOWS_UTF8_FILENAMES option which defines ROCKSDB_WINDOWS_UTF8_FILENAMES.

@rockbmb
Copy link

rockbmb commented Jul 25, 2019

@udoprog thanks for the heads-up!
I'll make sure the people currently working on one of the projects that ran into this hear about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants