Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FileSystemObject Unicode filepath truncations #26

Open
tatewise opened this issue Feb 13, 2022 · 15 comments
Open

FileSystemObject Unicode filepath truncations #26

tatewise opened this issue Feb 13, 2022 · 15 comments

Comments

@tatewise
Copy link

This involves using LuaCOM 1.3 and Lua 5.1 with Microsoft FileSystemObject running in Windows 10.
If file paths include Unicode code-points in UTF-8 format then some methods return truncated file paths. e.g.

require("luacom")
fso = luacom.CreateObject("Scripting.FileSystemObject")
strParent = fso:GetParentFolderName("C:\\Root\\ĀĒĪŌŪ Unicode\\Folder")

That should return C:\Root\ĀĒĪŌŪ Unicode but actually returns C:\Root\ĀĒĪŌŪ Un truncated by 5 bytes.
It is always truncated by the number of multi-byte UTF-8 code points.

Similar problems affect other methods such as fso:GetFolder(...) and fso:GetFile(...) regarding file path names.

When the same script is used with Lua 5.3 and Windows 10 everything works correctly.
Unfortunately, I am forced to use a precompiled Lua 5.1 application.
As a check, I ran similar FileSystemObject methods in Windows PowerShell on the same PC and that worked correctly.
Another user has the same symptoms on a different PC with Lua 5.1 and Windows 11.

Is there any workaround for this problem?

@robertlzj
Copy link

robertlzj commented May 22, 2022

I test your path on my computer, return is correct.
Windows10, Code Page 936, script document in UTF8, lua 5.3, luacom 1.4?.

Also, could use regex

_,_,strParent = string.find('C:\\Root\\ĀĒĪŌŪ Unicode\\Folder','(.+)\\')
assert(strParent==[[C:\Root\ĀĒĪŌŪ Unicode]])

for fso is not strict I think

strParent = fso:GetParentFolderName([["C:\not exist directory\not exist file"]])
assert(strParent=="\"C:\\not exist directory")
--	strange path, truncated quotation?

@tatewise
Copy link
Author

tatewise commented May 22, 2022

Yes, as I said, everything is OK in Lua 5.3 but is faulty in Lua 5.1.
Yes, there are workarounds for GetParentFolderName(...) but not for GetFolder(...) and GetFile(...) and other methods.

@robertlzj
Copy link

robertlzj commented May 22, 2022

Sorry, I missed the '5.1'.
Test using 5.1, same issue.

assert(#"C:\\Root\\ĀĒĪŌŪ Unicode"==26 and #'ĀĒĪŌŪ'==10)
strParent = fso:GetParentFolderName("C:\\Root\\ĀĒĪŌŪ Unicode\\Folder")
assert(strParent==[[C:\Root\ĀĒĪŌŪ Un]] and #strParent==26-5)
----
assert(#'C:\\Root\\啊啊啊啊啊 Unicode'==31 and #'啊啊啊啊啊'==15)--3 bytes per character
strParent = fso:GetParentFolderName("C:\\Root\\啊啊啊啊啊 Unicode\\Folder")
assert(strParent=='C:\\Root\\啊啊啊啊\229' and #strParent==31-10)
----
strParent = fso:GetParentFolderName("C:\\Root\\ĀĒĪŌŪ Unicode     \\Folder")--cheat by appending 1 byte character
assert(strParent==[[C:\Root\ĀĒĪŌŪ Unicode]])

And I just begin to know FSO and just from your post~
Before that, I use lfs, cmd line and regex to handle task on file, directory path etc.
And, recently found this, may help! Windows Shell Items: Lua parsing library - parse binary file directly to get info on various file format. Document is not complete, but could work.

@robertlzj
Copy link

robertlzj commented May 22, 2022

Hi, @tatewise I got a workaround, but not sure if has limitations - I don't known about code point.
Maybe only another workarounds just for GetParentFolderName(...)🤣
The key is to broke code point first, pass to FSO, then convert (assemble) result, as playing on words.
Something like this, ugly, but works within your example (just GetParentFolderName)

string='C:\\Root\\ĀĒĪŌŪ Unicode\\Folder'
print(string)
print(string.byte(string,1,#string))
map={[196]=1,[146]=2,[170]=3,[197]=4,[140]=5,[128]=6,
	196,146,170,197,140,128
}
tem_str_bytes={}
index=1
while index<=#string do
	byte=string.byte(string,index)
	byte=map[byte] or byte
	assert(byte<=128,byte)
	table.insert(tem_str_bytes,string.char(byte))
	index=index+1
end
tem_str=table.concat(tem_str_bytes)
print(tem_str)
tem_strParent = fso:GetParentFolderName(tem_str)
print(string.byte(tem_strParent,1,#tem_strParent))
print(tem_strParent)
tem_str_bytes={}
index=1
while index<=#tem_strParent do
	byte=string.byte(tem_strParent,index)
	byte=map[byte] or byte
	table.insert(tem_str_bytes,string.char(byte))
	index=index+1
end
strParent=table.concat(tem_str_bytes)
print(strParent==[[C:\Root\ĀĒĪŌŪ Unicode]])

print output:

C:\Root\ĀĒĪŌŪ Unicode\Folder
67	58	92	82	111	111	116	92	196	128	196	146	196	170	197	140	197	170	32	85	110	105	99	111	100	101	92	70	111	108	100	101	114
C:\Root\���������� Unicode\Folder
67	58	92	82	111	111	116	92	1	6	1	2	1	3	4	5	4	3	32	85	110	105	99	111	100	101
C:\Root\���������� Unicode
C:\Root\ĀĒĪŌŪ Unicode

@tatewise
Copy link
Author

There are probably many workarounds just for GetParentFolderName(...) but they do not work for GetFolder(...) or GetFile(...) or other methods where they must interact with actual folders or files.
Your suggestion does not work in Lua 5.3 for those other methods let alone in Lua 5.1 ☹

@robertlzj
Copy link

robertlzj commented May 23, 2022

OK, wish you good luck~
And, just mention again, if replaceable, some of the function about Folder object, File object (as I just take a glance) of FSO could implement by lua file syetem lfs, or command line invoked from io.popen etc. I had tried some of them in Lua 5.3.
And there may be some fork of luacom.
Hope you don't miss it. 😃
File object | Microsoft Docs

@tatewise
Copy link
Author

tatewise commented May 23, 2022

Unfortunately, lfs and io.popen only support file paths using the 256 ANSI character set and do NOT support file paths containing any UTF-8 characters such as Ā Ē Ī Ō Ū, etc. I know lfs and the io library very well and used them until switching to luacom FileSystemObjects to handle UTF-8 file paths, but then ran into this issue when using Lua 5.1.

@tatewise tatewise reopened this May 23, 2022
@robertlzj
Copy link

robertlzj commented May 23, 2022

Wait, I use them under gbk (CP 936) system environment, which may handle many non-ANSI characters, too.
In my practice, need to convert from utf8 to gbk (my system code page), then lfs will work!
Tried a lot in Lua 5.3, not sure if in Lua 5.1.

local lfs=require'lfs'
local gbk=require'gbk'
a=lfs.attributes(gbk.fromutf8[[C:\Ā Ē Ī Ō Ū]])--from utf8 to gbk
assert(a)

So does all? io function!😁

I should had asked help for similar question a lot, maybe at lfs's issue page😂
Until you said UTF problems on lfs, I almost forget it, for I packaged gbk+lfs which won't notice the convert.

And another older method, save the script or just the path argument to ANSI, then import to lfs. Which will work too!
The mess character is ‘ĀĒĪŌŪ', code in Lua 5.1
image

So, seems like, the lfs works on system code page which also contain ANSI basically, but not suitable for UTF8 - the script document file encode? Nice summary~🤣

@tatewise
Copy link
Author

I need to be able to support all Unicode UTF-8 code points and not just a subset.
It must also work on any other user system that I cannot control because my script is published for any user to download.

@robertlzj
Copy link

robertlzj commented May 23, 2022

Oh, then try iconv, - convert between various encode, there is a lua bind on windows, but need to compile(and I'm not familiar), not had a try.

Or, could convert from utf8 to ANSI local code page - the 2nd method above.
This would be easier? I didn't think deeply. maybe need convert too, for utf is compress (for transfer) on Unicode, which need convert from local character set first...
Test on lua 5.1

local lfs=require'lfs'
a=lfs.attributes('\168\161 \168\165 \168\169 \168\173 \168\177')
--	ANSI (using local encode when beyond ASCII?): Ā Ē Ī Ō Ū, equal the mess code in the picture above
assert(a)
end

See this, mention PowerShell / iconv (command line tool). Contain file convert (I use SaveAs above), also string convert?
A hard workaround maybe..

And many misstatement I have took, on UTF, Unicode and Character Set maybe, I'm lack of relate knowledge from now on...
mark and learning.

@robertlzj
Copy link

Hi, there is another solution, utf8_filenames.lua.
Not ideal for me, since I use 'gbk' convert. But could try.

@tatewise
Copy link
Author

Unfortunately, that is NOT a general solution for arbitrary UTF-8 symbols because as its comments say:
-- Please note that filenames must contain only symbols from your Windows ANSI codepage (which depends on OS locale).
-- Unfortunately, it's impossible to work with a file having arbitrary UTF-8 symbols in its name.
In other words, all it does is convert UTF-8 to ANSI for the 256 characters in the locale Code Page.

@robertlzj
Copy link

Yes, I see, very limitation 😂
Although gbk encode convert is enough for me , I'm still searching a general solution too. Seems iconv is the best solution I can find now..

@robertlzj
Copy link

robertlzj commented Jun 10, 2022

Hi, I have built and test lua-iconv (based on [libiconv - GNU Project - Free Software Foundation (FSF)](http://www.gnu.org/software/libiconv/)) in Windows 10, with Lua 5.3, works fine, could have a try~

@1linux
Copy link

1linux commented Sep 6, 2022

We ran in a similar problem: customers created file paths consisting of utf16 characters on a Windows machine. You can even get the path into a utf-8 string, which can be handeled by Lua with no problem.
However the limiting factor is the C-Runtime Library (msvcrt) - you simply cannot access files who´s filenames are encoded in utf. On Windows.

Our solution was to write a Lua library. One function is like:

int file_get_contents(lua_State *L, const char*filename, int offset, int maxlen, int encoding) {
	luaL_Buffer luabuffer;
	unsigned char buff[1024];
	int wsz=0;
	wchar_t *winfilename=NULL;
	FILE *pf=NULL;
	if(encoding>0) {
		wsz=to_utf16(filename,encoding,&winfilename);
	} else {
		wsz = to_utf16(filename,CP_UTF8,&winfilename) || to_utf16(filename,CP_ACP,&winfilename);
	}
	if(!wsz) {
		lua_pushnil(L);
		lua_pushstring(L, "convert to windows utf-16 filename fail");
		return 2;
	}
	pf = _wfopen(winfilename, L"rb");
	free(winfilename);
	if (pf == NULL) {
		lua_pushnil(L);
		lua_pushstring(L, strerror(errno));
		return 2;
	}
	if(maxlen<=0) {
		fseek (pf, 0, SEEK_END); 
		maxlen=ftell(pf) - offset;
		if(maxlen<0) maxlen=0;
	}
	if(maxlen<=0) {
		fclose(pf);
		lua_pushstring(L, "");
		return 1;
	}
	luaL_buffinit(L, &luabuffer);
	fseek(pf,offset,SEEK_SET);
	while(maxlen>0) {
		int rs = maxlen > 1024 ? 1024 : maxlen;
		fread(buff,1,rs,pf);
		luaL_addlstring(&luabuffer, buff, rs);
		maxlen -= rs;
	}
	luaL_pushresult(&luabuffer);
	fclose(pf);
	return 1;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants